[ 
https://issues.apache.org/jira/browse/HELIX-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junkai Xue closed HELIX-818.
----------------------------
    Resolution: Fixed

Should be fixed for Pinot already.

> State transition callbacks for online -> offline and offline -> dropped are 
> sometimes not received
> --------------------------------------------------------------------------------------------------
>
>                 Key: HELIX-818
>                 URL: https://issues.apache.org/jira/browse/HELIX-818
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Siddharth Teotia
>            Priority: Major
>
> As part of a cluster integration tests in Pinot, we have seen that state 
> transition callbacks are sometimes not received. Each unit test [here 
> |[https://github.com/apache/incubator-pinot/pull/4498/commits/75c0d7eb76f38fd60497876eb7aa501ae048b05c#diff-30ee437b5c9317721c0d35de40a4f36dR456]]
>  rebalances tables and moves segments between servers. 
> After the test finishes rebalancing (which also means that external view has 
> converged to new ideal state because we ensure it), we check for stats 
> related to state transitions from ONLINE to OFFLINE and OFFLINE to DROPPED 
> with the expectation that as part of rebalance, if a segment lost a server, 
> then it should have received these 2 transitions. The test has a custom state 
> model factory registered with Helix for each fake server it creates. 
> For the above 2 state transitions, the factory methods bump stats and that's 
> what we check for in tests. 
> Earlier when these were failing intermittently, it was possibly due to stat 
> variables not being volatile. The PR pointed to above actually attempts to 
> re-enable these tests by changing the stats to atomic int since they will be 
> bumped by helix code that invokes callback.
> Seems like even after this, for some reasons, once in a while I have seen 
> some test failing randomly at any of the 2 state transitions – this happens 
> both in travis builds and sometimes running the test locally in IDE
> An example failure is [here 
> |[https://travis-ci.org/apache/incubator-pinot/jobs/569442912]]
> Wondering if there is a potential bug due to which sometimes the state 
> transition callbacks are not invoked. This begs the question how is external 
> view getting updated as expected since our tests check for that too (server 
> that lost a segment as part of rebalancing is no longer present in the 
> host-state mapping of that segment in external view). If the callback 
> invocations are missed sometimes, how is it possible for current-state and 
> subsequently external view to get updated in the right manner/
> Thanks for help
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to