Vinayak Borkar created HELIX-595:
------------------------------------
Summary: Possible deadlock in state transition sequence
Key: HELIX-595
URL: https://issues.apache.org/jira/browse/HELIX-595
Project: Apache Helix
Issue Type: Bug
Reporter: Vinayak Borkar
In my setup I have a resource that has about 160 partitions. The resource uses
the MasterSlave state model. The partitions have been configured to have just 1
replica. For some partitions (about 5), I am observing that there are two
replicas, one in MASTER mode and one in SLAVE mode. In addition, I am observing
an imbalance with respect to the MASTER replica placement on the machines I
have.
In discussions with Kishore, the conclusion was that there is a deadlock
occurring as Helix makes state transition to rebalance the imbalance, and
reaching a state where any further transition would violate the constraints of
the state model.
The MasterSlave state model allows at most one MASTER and at most R SLAVES (in
my case R = 1).
Say the current MASTER of a partition is on hostA, but Helix wants to move it
to hostB. Helix would run the following transitions:
hostA: t1(M -> S), t2(S -> O)
hostB: t3(O -> S), t4(S -> M)
If t1 and t2 happen before t3, then eventually, helix would achieve the correct
placement of the master on hostB. However, if t3 runs first, then hostB will
have a SLAVE of the partition while hostA still have MASTERship. Once this
happens, every transition that needs to be performed violates a state machine
constraint. So we end up with a MASTER on hostA and a SLAVE on hostB for this
partition.
You can find the ZK logs corresponding to the MESSAGES for such a partition
here: http://pastebin.com/zqqSk4MA
Please let me know what other details would be necessary to get to the bottom
of this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)