[
https://issues.apache.org/jira/browse/HELIX-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098684#comment-15098684
]
Marco P. commented on HELIX-621:
--------------------------------
Allright, mystery solved. Turns out this is not a bug but a feature:
https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L808-L812
The anti-flapping mechanism is kicking in, and my spectator fires one last
notification shown above then dies (leaving the impression of a permanently
inconsistent state view).
What I omitted before (for simplicity, but turns out to be very relevant) is
that I'm doing failure injection tests. The way I was "killing" my participants
is cutting them off from Zookeeper.
The side-effect of this is my spectator keeps bouncing from one Zookeeper to
the next.
After a while it is marked 'flapping' and stops delivering notifications.
Apologies for the false alarm.
What is the best way to act on this state change ("disconnected because
flapping")?
It went unnoticed for a long time here, I would like to avoid this be unnoticed
again and take some action like shutting down or raise alarms when it happens
(rather than just silently stop delivering notifications).
> Missing listener notification of LiveInstances changes (and possibly other
> state change)
> ----------------------------------------------------------------------------------------
>
> Key: HELIX-621
> URL: https://issues.apache.org/jira/browse/HELIX-621
> Project: Apache Helix
> Issue Type: Bug
> Components: helix-core
> Affects Versions: 0.6.5
> Reporter: Marco P.
>
> I noticed sometimes my LiveInstanceChangeListener was not notified of an
> instance disconnecting.
> Digging a little bit I found out:
> - A reliable way to consistently reproduce this problem
> - The problem does not seem to be limited to LiveInstances, it can happen to
> other listeners using the same strategy
> This is bad as an application relies on notifications, and its view of the
> system (LiveInstances or else) can get very outdated.
> The problem at the core is this logic:
> 1) Set watch W on some path P
> 2) Event E1 modifies P triggering W
> 3) The callback for W re-sets W on P
> If however a second Event E2 modifies between 2 and 3, W will not trigger
> (until P is modified again).
> An example of why this is bad:
> - 2 live instances L1, L2 and a spectator S watching them.
> 1) L1 disconnects
> 2) S's watch on LIVEINSTANCES fires
> 3) S reads the children of LIVEINSTANCES: {L2}
> 3) L2 disconnects
> 4) S's notifies LiveInstanceChangeListeners and goes back to watching
> LIVEINSTANCES
> The application receives a notification that the live instances now consist
> of {L2}.
> And no further notification until another instance joins.
> The reality is that no instances are live.
> Again, this is not limited to LIVEINSTANCES, although that's the one I can
> reliably reproduce.
> Fixing this is not trivial, it requires firing the watch again when
> re-setting it IF the version of the watched node change since the last time
> the watch fired.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)