[
https://issues.apache.org/jira/browse/HBASE-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635488#comment-13635488
]
Jeffrey Zhong commented on HBASE-8365:
--------------------------------------
{quote}
nodeDataChangeEvent only will give the latest data because it will not be able
to read the old data
{quote}
ZooKeeper intentionally only sends out notifications without passing the
original state which triggers the notification. It relies on clients to fetch
the latest state. In addition, ZooKeeper watcher is one-time trigger which
means it only fire once and client need re-setup watcher on the same znode to
get next notification.
In our case, from the log, the related updates with watcher set on the region
are: 1) opening->opening 2) opening->failed_open 3) failed_open->offline 4)
offline->opening
The first notification(when we got FAILED_OPEN) is triggered by the update of
opening->opening. When Master got the notification and znode was already
changed to failed_open, that's the first trace nodeDataChange.
The thing puzzles me is that ZooKeeper watcher will reset up on failed_open
state after receiving the first failed_open and should only get more
notifications when failed_open state changes. While we still get one more
failed_open later from the same znode and data has the same version as we
received the first notification. I guess we may trigger ZK client reads stale
cache data when the node state changes from failed_open -> offline OR race
conditions in ZK side to cause the dup notifications.
> Duplicated ZK notifications cause Master abort (or other unknown issues)
> ------------------------------------------------------------------------
>
> Key: HBASE-8365
> URL: https://issues.apache.org/jira/browse/HBASE-8365
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.94.6
> Reporter: Jeffrey Zhong
> Attachments: TestResult.txt
>
>
> The duplicated ZK notifications should happen in trunk as well. Since the way
> we handle ZK notifications is different in trunk, we don't see the issue
> there. I'll explain later.
> The issue is causing TestMetaReaderEditor.testRetrying flaky with error
> message {code}reader: count=2, t=null{code} A related link is at
> https://builds.apache.org/job/HBase-0.94/941/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/
> The test case failure is due to an IllegalStateException and master is
> aborted so the rest test cases also failed after testRetrying.
> Below are steps why the issue is happening(region
> fa0e7a5590feb69bd065fbc99c228b36 is in interests):
> 1) Got first notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04
> 17:39:01,197
> {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744):
> Handling transition=RS_ZK_REGION_FAILED_OPEN,
> server=janus.apache.org,42093,1365097126155,
> region=fa0e7a5590feb69bd065fbc99c228b36{code}
> In the step, AM tries to open the region on another RS in a separate thread
> 2) Got second notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04
> 17:39:01,200
> {code}DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744):
> Handling transition=RS_ZK_REGION_FAILED_OPEN,
> server=janus.apache.org,42093,1365097126155,
> region=fa0e7a5590feb69bd065fbc99c228b36{code}
> 3) Later got opening notification event result from the step 1 at 2013-04-04
> 17:39:01,288
> {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744):
> Handling transition=RS_ZK_REGION_OPENING,
> server=janus.apache.org,54833,1365097126175,
> region=fa0e7a5590feb69bd065fbc99c228b36{code}
> Step 2 ClosedRegionHandler throws IllegalStateException because "Cannot
> transit it to OFFLINE"(state is in opening from notification 3) and abort
> Master. This could happen in 0.94 because we handle notifications using
> executorService which opens the door to handle events out of order through
> receive them in order of updates.
> I've confirmed that we don't have duplicated AM listeners and both events
> triggered by same ZK data of exact same version. The issue can be reproduced
> once by running testRetrying test case 20 times in a loop.
> There are several issues for the failure:
> 1) duplicated ZK notifications. Since ZK watcher is one time trigger, the
> duplicated notification should not happen from the same data of the same
> version in the first place
> 2) ZooKeeper watcher handling is wrong in both 0.94 and trunk as following:
> a) 0.94 handle notifications in async way which may lead to handle
> notifications out of order of the events happened
> b) In trunk, we handle ZK notifications synchronously which slows down other
> components such as SSH, LogSplitting etc. because we have a single
> notification queue
> c) In trunk & 0.94, we could use stale event data because we have a long
> listener list. ZK node state could have changed at the time when handling the
> event. If a listener needs to act upon latest state, it should re-fetch the
> data to verify if the data triggered the handler hasn't changed.
> Suggestions:
> For 0.94, we can bandit the CloseRegionHandler to pass in the expected ZK
> data version to skip event handling on stale data with min impact.
> For trunk, I'll open an improvement JIRA on ZK notification handling to
> provide more parallelism to handle unrelated notifications.
> For the duplicated ZK notifications, we need bring some ZK experters to take
> a look at this.
> Please let me know what you think or any better idea.
> Thanks!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira