[
https://issues.apache.org/jira/browse/HBASE-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352383#comment-15352383
]
Konstantin Ryakhovskiy commented on HBASE-14422:
------------------------------------------------
I checked out master, reverted commit e4bf77e2de54ab6ea17b95dc116af9abf24a332d,
modified one line to allow the code to compile.
Thread 1 (T-1) is in the retry-mode
Thread 2 (T-2) is in the fast-fail mode.
when the mode is fast-fail the counter "done" gets incremented (by T-2),
therefore, at some point T-1 shouldn't call latch.await().
if (done.get() <= 1)
latches2[priviRetryCounter.get()].await();
T-2 increments the counter in case when T-2 is in the fast-fail mode only:
boolean pffe = false;
if (!isPriviThreadLocal.get().get())
pffe = !((FastFailInterceptorContext)context).isRetryDespiteFastFailMode();
...
if (!isPriviThreadLocal.get().get()) {
if (pffe) done.incrementAndGet();
The problem is in the PreemptiveFastFailInterceptor#inFastFailMode():
return (fInfo != null &&
EnvironmentEdgeManager.currentTime() >
(fInfo.timeOfFirstFailureMilliSec + this.fastFailThresholdMilliSec));
with some "unliky" timing T2 is in the retry mode instead of fast-fail and the
counter "done" is not incremented,
context.isRetryDespiteFastFailMode() returns true for T-2 which should never
happen.
Can I just remove the verification before incrementing the "done" counter
if (pffe) ... ?
Decreasing fastFailThresholdMilliSec might not help, it will decrease the
possibility of the heisenbug, but will not remove it.
> Fix TestFastFailWithoutTestUtil
> -------------------------------
>
> Key: HBASE-14422
> URL: https://issues.apache.org/jira/browse/HBASE-14422
> Project: HBase
> Issue Type: Task
> Components: test
> Reporter: stack
> Priority: Minor
> Labels: beginner
>
> TestFastFailWithoutTestUtil has a unit test that does
> testInterceptorIntercept50Times Usually it passes but on occasion, the
> latching between thread 1 and thread 2 goes awry and the test hangs and the
> test hangs out. Depends on the hardware but it seems to happen about one in
> four runs here on an internal rig.
> HBASE-14421 changed the wait-on-latch to timeout and do a thread dump and
> just let the test keep going.
> This issue is about digging in on figuring why the hang up on latches and
> then fixing it so the test doesn't have to have the latch timeout. Hopefully
> the threaddump helps.
> This one could be hard to fix since it not easy to reproduce. Marking it
> beginner anyways.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)