> On Nov. 24, 2014, 10:35 p.m., Kishore Gopalakrishna wrote:
> > Code looks good, thanks for getting to the root. If I understand correctly, 
> > we are not changing priority list or any of the solutions described in 541.
> > 
> > It will be great to describe the over all logic in the code.

We are not changing any transition priority.

Here is the problem.
- At time t1: Node_0 is selected LEADER for partition_0, and we have a pending 
message, say STANDBY->LEADER for (partition_0, Node_0).

- At time t2 (>t1): Node_0 finishes STANDBY->LEADER transition but hasn't 
deleted the message. Meantime, controller selects another LEADER for 
partition_0, say Node_1. Since Node_0's current state is LEADER and we are only 
considering toState in pending message, controller thinks it's OK to send 
OFFLINE->STANDBY t0 Node_1.

- At time t3 (>t2): Node_0 in LEADER and Node_1 in STANDBY, controller now 
enters into the livelock where neither can it send STANDBY->LEADERl to Node_1, 
nor it can send LEADER->STANDBY to Node_0.

The fix it to consider both fromState and toState in pending message, so 
controller will not send OFFLINE->STANDBY to Node_1 if Node_0 hasn't remove 
STANDBY->LEADER message. For the next Helix pipeline, since we have transition 
priority that LEADER->STANDBY > OFFLINE->STANDBY, controller will bring Node_0 
to STANDBY then to OFFLINE first, avoiding the livelock.


> On Nov. 24, 2014, 10:35 p.m., Kishore Gopalakrishna wrote:
> > helix-core/src/main/java/org/apache/helix/controller/stages/CurrentStateComputationStage.java,
> >  line 134
> > <https://reviews.apache.org/r/28413/diff/1/?file=774681#file774681line134>
> >
> >     do we expect partition to be not null? why is the else condition empty

partition should not be null


> On Nov. 24, 2014, 10:35 p.m., Kishore Gopalakrishna wrote:
> > helix-core/src/main/java/org/apache/helix/controller/stages/CurrentStateOutput.java,
> >  line 211
> > <https://reviews.apache.org/r/28413/diff/1/?file=774682#file774682line211>
> >
> >     looks like this is crux of the change, can we add more comments 
> > describing the logic and concept and why we are using pendingMsg

add more comments


- Zhen


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28413/#review62885
-----------------------------------------------------------


On Nov. 24, 2014, 10:22 p.m., Zhen Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28413/
> -----------------------------------------------------------
> 
> (Updated Nov. 24, 2014, 10:22 p.m.)
> 
> 
> Review request for helix and Shi Lu.
> 
> 
> Bugs: HELIX-541
> 
> 
> Repository: helix-git
> 
> 
> Description
> -------
> 
> In message selection stage, we should consider both toState and fromState in 
> pending messages. For example, assuming we have a STANDBY->LEADER pending 
> message, and the current state of the corresponding partition is in LEADER. 
> At this time if the rebalancer selects another LEADER for the partition and 
> message generation stage generates an OFFLINE->STANDBY message for the new 
> leader, we will enter into the livelock described in HELIX-541.
> 
> Fix it by including both fromState and toState in pending state calculation. 
> Add a test case that has high probably to reproduce this problem.
> 
> 
> Diffs
> -----
> 
>   
> helix-core/src/main/java/org/apache/helix/controller/stages/CurrentStateComputationStage.java
>  6a30a9d 
>   
> helix-core/src/main/java/org/apache/helix/controller/stages/CurrentStateOutput.java
>  ac9d748 
>   
> helix-core/src/main/java/org/apache/helix/controller/stages/MessageGenerationPhase.java
>  92964e9 
>   
> helix-core/src/main/java/org/apache/helix/controller/stages/MessageSelectionStage.java
>  f3a8257 
>   
> helix-core/src/main/java/org/apache/helix/controller/stages/TaskAssignmentStage.java
>  5772385 
>   
> helix-core/src/main/java/org/apache/helix/controller/strategy/AutoRebalanceStrategy.java
>  1e7f275 
>   
> helix-core/src/main/java/org/apache/helix/task/FixedTargetTaskRebalancer.java 
> 53d2ee9 
>   helix-core/src/main/java/org/apache/helix/task/TaskRebalancer.java 131236e 
>   
> helix-core/src/test/java/org/apache/helix/controller/stages/TestCurrentStateComputationStage.java
>  7687e18 
>   
> helix-core/src/test/java/org/apache/helix/controller/stages/TestMsgSelectionStage.java
>  820abbe 
>   
> helix-core/src/test/java/org/apache/helix/integration/TestControllerLiveLock.java
>  e69de29 
> 
> Diff: https://reviews.apache.org/r/28413/diff/
> 
> 
> Testing
> -------
> 
> mvn test
> 
> 
> Thanks,
> 
> Zhen Zhang
> 
>

Reply via email to