[
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365629#comment-17365629
]
Hudson commented on HBASE-25984:
--------------------------------
Results for branch branch-2.3
[build #239 on
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/General_20Nightly_20Build_20Report/]
(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> FSHLog WAL lockup with sync future reuse [RS deadlock]
> ------------------------------------------------------
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
> Issue Type: Bug
> Components: regionserver, wal
> Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
> Reporter: Bharath Vissapragada
> Assignee: Bharath Vissapragada
> Priority: Critical
> Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load
> we noticed the WAL system gets locked up due to a subtle bug involving racy
> code with sync future reuse. This bug applies to all FSHLog implementations
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region
> servers are hanging abruptly with filled up handler queues and stuck MVCC
> indicating appends/syncs not making any progress.
> {noformat}
> WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl -
> STUCK for : 296000 millis.
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690,
> regionName=1ce4003ab60120057734ffe367667dca}
> WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl -
> STUCK for : 296000 millis.
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381,
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
> java.lang.Object.wait(Native Method)
>
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .....
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs
> that need to finish..
> {noformat}
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work
> implying that there are no pending SyncFutures that need to be run
> {noformat}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was
> aborted. I got to the bottom of this issue and have a patch that can fix it
> (more details in the comments due to word limit in the description).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)