For the benefit of the future readers, a simple workaround for this issue is to: # change the controller to a non-existing broker, # delete the current assignment from zk, and # then change the controller to an existent broker
Maysam On Wed, May 11, 2016 at 5:24 PM, Maysam Yabandeh <myaban...@dropbox.com> wrote: > Hi > > I wondering if makes sense to remove > {code} > case nne: ZkNoNodeException => > createPersistentPath(zkPath, jsonData) > debug("Created path %s with %s for partition > reassignment".format(zkPath, jsonData)) > {code} > from ZKUtils::updatePartitionReassignmentData, which has caused an > incident for us. > > The code does not seem to be doing anything in the normal case: if > reassign path does not exist when removePartitionFromReassignedPartitions > starts, it then has nothing to write back to zk anyway. The only time that > the code kick in is when the admin manually deletes the zk path in the > middle of update, which essentially cancels the admin's attempt to stop a > bad partition assignment. > > The incident in our case was a very large json file that was mistakenly > used by admin for partition assignment. The controller zk thread was in a > busy loop removing partitions from this json file stored at zk, one by one. > We attempted to stop the assignment by i) removing the zk path, ii) > changing the controller. However, due to the many zk update operations by > the active controller, the path would be recreated over and over. Changing > the controller would also did not help since the new controller resumes the > badly started reassignment job by picking it up from zk. > > Simply removing createPersistentPath in the catch clause should avoid such > problems and yet does not seem to changing the intended semantics of > removePartitionFromReassignedPartitions. > > Thoughts? > > Thanks > Maysam >