[
https://issues.apache.org/jira/browse/HELIX-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junkai Xue closed HELIX-818.
----------------------------
Resolution: Fixed
Should be fixed for Pinot already.
> State transition callbacks for online -> offline and offline -> dropped are
> sometimes not received
> --------------------------------------------------------------------------------------------------
>
> Key: HELIX-818
> URL: https://issues.apache.org/jira/browse/HELIX-818
> Project: Apache Helix
> Issue Type: Bug
> Reporter: Siddharth Teotia
> Priority: Major
>
> As part of a cluster integration tests in Pinot, we have seen that state
> transition callbacks are sometimes not received. Each unit test [here
> |[https://github.com/apache/incubator-pinot/pull/4498/commits/75c0d7eb76f38fd60497876eb7aa501ae048b05c#diff-30ee437b5c9317721c0d35de40a4f36dR456]]
> rebalances tables and moves segments between servers.
> After the test finishes rebalancing (which also means that external view has
> converged to new ideal state because we ensure it), we check for stats
> related to state transitions from ONLINE to OFFLINE and OFFLINE to DROPPED
> with the expectation that as part of rebalance, if a segment lost a server,
> then it should have received these 2 transitions. The test has a custom state
> model factory registered with Helix for each fake server it creates.
> For the above 2 state transitions, the factory methods bump stats and that's
> what we check for in tests.
> Earlier when these were failing intermittently, it was possibly due to stat
> variables not being volatile. The PR pointed to above actually attempts to
> re-enable these tests by changing the stats to atomic int since they will be
> bumped by helix code that invokes callback.
> Seems like even after this, for some reasons, once in a while I have seen
> some test failing randomly at any of the 2 state transitions – this happens
> both in travis builds and sometimes running the test locally in IDE
> An example failure is [here
> |[https://travis-ci.org/apache/incubator-pinot/jobs/569442912]]
> Wondering if there is a potential bug due to which sometimes the state
> transition callbacks are not invoked. This begs the question how is external
> view getting updated as expected since our tests check for that too (server
> that lost a segment as part of rebalancing is no longer present in the
> host-state mapping of that segment in external view). If the callback
> invocations are missed sometimes, how is it possible for current-state and
> subsequently external view to get updated in the right manner/
> Thanks for help
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)