[
https://issues.apache.org/jira/browse/HBASE-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-14807:
--------------------------
Attachment: 14807.second.attempt.txt
See below for description of the change from the patch. Let me test this on
cluster too to be sure since it makes a small change but in a critical part.
[~eclark] I think you know this section best. You good with this patch sir?
Patch includes loosening of the test so we continue when threads
run in not-expected order. Also includes minor clean ups in
FSHLog -- a formatting change, removal of an unused trace logging,
and a check so we don't create a new exception when not needed --
but it also includes a subtle so we check if we need to get to safe
point EVEN IF an outstanding exception. Previous we could by-pass
the safe point check. This should make us even more robust against
lockup (though this is a change that comes of code reading, not of
any issue seen in test).
Here is some detail on how I loosened the test:
The test can run in an unexpected order. Attempts at dictating the
order in which threads fire only had me deadlocking one latch
against another (the test latch vs the WAL zigzag latch) so I
gave up trying and instead, if we happen to go the unusual route of
rolling WALs and failing flush before the scheduled log roll
latch goes into place, just time out the run after a few seconds
and exit the test (but do not fail it); just log a WARN.
This is less than ideal but allows us keep some coverage of the
tricky scenario that was bringing on deadlock (a broken WAL that
is throwing exceptions getting stuck waiting on a sync to clear
out the ring buffer getting overshadowed by a subsequent append
added in by a concurrent flush).
> TestWALLockup is flakey
> -----------------------
>
> Key: HBASE-14807
> URL: https://issues.apache.org/jira/browse/HBASE-14807
> Project: HBase
> Issue Type: Bug
> Components: flakey, test
> Reporter: stack
> Assignee: stack
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: 14807.patch, 14807.second.attempt.txt
>
>
> Fails frequently.
> Looks like this:
> {code}
> 2015-11-12 10:38:51,812 DEBUG [Time-limited test] regionserver.HRegion(3882):
> Found 0 recovered edits file(s) under
> /home/jenkins/jenkins-slave/workspace/HBase-1.2/jdk/latest1.7/label/Hadoop/hbase-server/target/test-data/8b8f8f12-1819-47e3-b1f1-8ffa789438ad/data/default/testLockupWhenSyncInMiddleOfZigZagSetup/c8694b53368f3301a8d370089120388d
> 2015-11-12 10:38:51,821 DEBUG [Time-limited test]
> regionserver.FlushLargeStoresPolicy(56):
> hbase.hregion.percolumnfamilyflush.size.lower.bound is not specified, use
> global config(16777216) instead
> 2015-11-12 10:38:51,880 DEBUG [Time-limited test] wal.WALSplitter(729): Wrote
> region
> seqId=/home/jenkins/jenkins-slave/workspace/HBase-1.2/jdk/latest1.7/label/Hadoop/hbase-server/target/test-data/8b8f8f12-1819-47e3-b1f1-8ffa789438ad/data/default/testLockupWhenSyncInMiddleOfZigZagSetup/c8694b53368f3301a8d370089120388d/recovered.edits/2.seqid
> to file, newSeqId=2, maxSeqId=0
> 2015-11-12 10:38:51,881 INFO [Time-limited test] regionserver.HRegion(868):
> Onlined c8694b53368f3301a8d370089120388d; next sequenceid=2
> 2015-11-12 10:38:51,994 ERROR [sync.1] wal.FSHLog$SyncRunner(1226): Error
> syncing, request close of WAL
> java.io.IOException: FAKE! Failed to replace a bad datanode...SYNC
> at
> org.apache.hadoop.hbase.regionserver.TestWALLockup$1DodgyFSLog$1.sync(TestWALLockup.java:162)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1222)
> at java.lang.Thread.run(Thread.java:745)
> 2015-11-12 10:38:51,997 DEBUG [Thread-4] regionserver.LogRoller(139): WAL
> roll requested
> 2015-11-12 10:38:52,019 DEBUG [flusher]
> regionserver.FlushLargeStoresPolicy(100): Since none of the CFs were above
> the size, flushing all.
> 2015-11-12 10:38:52,192 INFO [Thread-4]
> regionserver.TestWALLockup$1DodgyFSLog(129): LATCHED
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:146)
> at
> org.apache.hadoop.hbase.regionserver.TestWALLockup.testLockupWhenSyncInMiddleOfZigZagSetup(TestWALLockup.java:245)
> 2015-11-12 10:39:18,609 INFO [main] regionserver.TestWALLockup(91): Cleaning
> test directory:
> /home/jenkins/jenkins-slave/workspace/HBase-1.2/jdk/latest1.7/label/Hadoop/hbase-server/target/test-data/8b8f8f12-1819-47e3-b1f1-8ffa789438ad
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> ... then times out after being locked up for 30 seconds. Writes 50+MB of
> logs while spinning.
> Reported as this:
> {code}
> -------------------------------------------------------------------------------
> Test set: org.apache.hadoop.hbase.regionserver.TestWALLockup
> -------------------------------------------------------------------------------
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 198.23 sec
> <<< FAILURE! - in org.apache.hadoop.hbase.regionserver.TestWALLockup
> testLockupWhenSyncInMiddleOfZigZagSetup(org.apache.hadoop.hbase.regionserver.TestWALLockup)
> Time elapsed: 0.049 sec <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 30000
> milliseconds
> at org.apache.log4j.Category.callAppenders(Category.java:205)
> at org.apache.log4j.Category.forcedLog(Category.java:391)
> at org.apache.log4j.Category.log(Category.java:856)
> at
> org.apache.commons.logging.impl.Log4JLogger.debug(Log4JLogger.java:155)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1386)
> at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1352)
> at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1302)
> at
> org.apache.hadoop.hbase.regionserver.TestWALLockup.testLockupWhenSyncInMiddleOfZigZagSetup(TestWALLockup.java:260)
> {code}
> Failed here most recently:
> https://builds.apache.org/view/H-L/view/HBase/job/HBase-1.2/364/jdk=latest1.7,label=Hadoop/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)