[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882852#comment-16882852
 ] 

Wellington Chevreuil commented on HBASE-22665:
----------------------------------------------

{quote}After 3), the new writer is up and we will schedule a new consume task 
to write the pending entries out, this is why we add the unackedAppends back to 
toWriteAppends, as we need to write them to the new writer. This will lead to a 
new sync.
{quote}
Yep, and even if this new sync fails, as the stream is already damaged, per the 
log below, only way we could reach this deadlock is if it enters [this 
condition|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L297]
 on _syncFailed_, but I think the epoch would always be the same.

{code:java}
16:06:06.466 [AsyncFSWAL-0] WARN  
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL - sync failed
java.io.IOException: stream already broken
        at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424)
 ~[hbase-server-2.1.2-**.jar:2.1.2-**]
        at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513)
 ~[hbase-server-2.1.2-**.jar:2.1.2-**]
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:142)
 ~[hbase-server-2.1.2-**.jar:2.1.2-**]
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:351) 
~[hbase-server-2.1.2-**.jar:2.1.2-**]
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:534)
 ~[hbase-server-2.1.2-**.jar:2.1.2-**]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[?:1.8.0_112]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[?:1.8.0_112]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
{code}

{quote}Anyway, will try to write an UT to fail the sync multiple times(maybe 
1000) and also trigger log rollings, to see if we can finish without 
hanging.{quote}
Sounds like a hard to reproduce one. May need to forcibly close underlying OS 
as well?

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> ----------------------------------------------------------
>
>                 Key: HBASE-22665
>                 URL: https://issues.apache.org/jira/browse/HBASE-22665
>             Project: HBase
>          Issue Type: Bug
>         Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>            Reporter: Yechao Chen
>            Priority: Major
>         Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, 
> rs.log_part2.zip
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 300000 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x00007f204282c600 nid=0x34aa waiting on condition [0x00007f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x00007f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF0000}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF0000} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF0000}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF0000} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to