[ 
https://issues.apache.org/jira/browse/HBASE-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352383#comment-15352383
 ] 

Konstantin Ryakhovskiy commented on HBASE-14422:
------------------------------------------------

I checked out master, reverted commit e4bf77e2de54ab6ea17b95dc116af9abf24a332d, 
modified one line to allow the code to compile.

Thread 1 (T-1) is in the retry-mode
Thread 2 (T-2) is in the fast-fail mode.

when the mode is fast-fail the counter "done" gets incremented (by T-2), 
therefore, at some point T-1 shouldn't call latch.await().
if (done.get() <= 1) 
  latches2[priviRetryCounter.get()].await();

T-2 increments the counter in case when T-2 is in the fast-fail mode only:
boolean pffe = false;
if (!isPriviThreadLocal.get().get()) 
  pffe = !((FastFailInterceptorContext)context).isRetryDespiteFastFailMode();
...
if (!isPriviThreadLocal.get().get()) {
  if (pffe) done.incrementAndGet();
The problem is in the PreemptiveFastFailInterceptor#inFastFailMode():
return (fInfo != null && 
  EnvironmentEdgeManager.currentTime() >
  (fInfo.timeOfFirstFailureMilliSec + this.fastFailThresholdMilliSec));

with some "unliky" timing T2 is in the retry mode instead of fast-fail and the 
counter "done" is not incremented, 
context.isRetryDespiteFastFailMode() returns true for T-2 which should never 
happen.

Can I just remove the verification before incrementing the "done" counter
if (pffe) ... ?
Decreasing fastFailThresholdMilliSec might not help, it will decrease the 
possibility of the heisenbug, but will not remove it.

> Fix TestFastFailWithoutTestUtil
> -------------------------------
>
>                 Key: HBASE-14422
>                 URL: https://issues.apache.org/jira/browse/HBASE-14422
>             Project: HBase
>          Issue Type: Task
>          Components: test
>            Reporter: stack
>            Priority: Minor
>              Labels: beginner
>
> TestFastFailWithoutTestUtil has a unit test that does 
> testInterceptorIntercept50Times Usually it passes but on occasion, the 
> latching between thread 1 and thread 2 goes awry and the test hangs and the 
> test hangs out. Depends on the hardware but it seems to happen about one in 
> four runs here on an internal rig.
> HBASE-14421 changed the wait-on-latch to timeout and do a thread dump and 
> just let the test keep going.
> This issue is about digging in on figuring why the hang up on latches and 
> then fixing it so the test doesn't have to have the latch timeout. Hopefully 
> the threaddump helps.
> This one could be hard to fix since it not easy to reproduce. Marking it 
> beginner anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to