[
https://issues.apache.org/jira/browse/HBASE-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151996#comment-13151996
]
nkeywal commented on HBASE-4798:
--------------------------------
Ok, I've got it for TestRegionServerCoprocessorExceptionWithAbort as well.
It's because of my change in HRegionServer#stop:
{noformat}
public void stop(final String msg) {
this.stopped = true;
LOG.info("STOPPED: " + msg);
// Wakes run() if it is sleeping
sleeper.skipSleepCycle(); // <================= NEW
}
{noformat}
This is a notification that makes the region server stops *immediately* instead
of waiting the next sleep ending.
So immediately that the poor client sees its region server disappearing during
its put, and enters its usual retry stuff. This is slow and the test timeout
does not let it finish.
This bug was already there before my change, it's just less random now. Now
you've got it 90% of the time, before it was 0,1% :-)
Note as well that the current implementation of HRegionServer#stop actually
intends to stop immediately but fails because it does the notify on the wrong
object. But the comment is clear on the intention. So a bug was hiding another
bug. Usual stuff :-).
I am not sure on how to fix this cleanly. We could launch a thread that would
wait before aborting in handleCoprocessorThrowable, but it's more a workaround
than anything else. @stack, @eugene, what do you think?
> Sleeps and synchronisation improvements for tests
> -------------------------------------------------
>
> Key: HBASE-4798
> URL: https://issues.apache.org/jira/browse/HBASE-4798
> Project: HBase
> Issue Type: Improvement
> Components: master, regionserver, test
> Affects Versions: 0.94.0
> Environment: all
> Reporter: nkeywal
> Assignee: nkeywal
> Priority: Minor
> Attachments: 4798_trunk_all.v2.patch
>
>
> Multiple small changes:
> @commiters: Removing some sleeps made visible a bug on
> JVMClusterUtil#HMaster#waitForServerOnline, so I had to add a synchro point.
> You may want to review this.
> JVMClusterUtil#HMaster#waitForServerOnline: removed, the condition was never
> met (test on "!c && !!c"). Added a new synchronization point.
> AssignementManager#waitForAssignment: add a timeout on the wait => not stuck
> if the notification is received before the wait.
> HMaster#loop: use a notification instead of a 1s sleep
> HRegionServer#waitForServerOnline: new method used by
> JVMClusterUtil#waitForServerOnline() to replace a 1s sleep by a notification
> HRegionServer#getMaster() 1s sleeps replaced by one 0,1s sleep and one 0,2s
> sleep
> HRegionServer#stop: use a notification on sleeper to lower shutdown by 0,5s
> ZooKeeperNodeTracker#start: replace a recursive call by a loop
> ZooKeeperNodeTracker#blockUntilAvailable: add a timeout on the wait => not
> stuck if the notification is received before the wait.
> HBaseTestingUtility#expireSession: use a timeout of 1s instead of 5s
> TestZooKeeper#testClientSessionExpired: use a timeout of 1s instead of 5s,
> with the change on HBaseTestingUtility we are 60s faster
> TestRegionRebalancing#waitForAllRegionsAssigned: use a sleep of 0,2s instead
> of 1s
> TestRestartCluster#testClusterRestart: send all the table creation together,
> then check creation, should be faster
> TestHLog: shutdown the whole cluster instead of DFS only (more standard)
> JVMClusterUtil#startup: lower the sleep from 1s to 0,1s
> HConnectionManager#close: Zookeeper name in debug message from
> HConnectionManager after connection close was always null because it was set
> to null in the delete.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira