For the benefit of the future readers, a simple workaround for this issue
is to:
# change the controller to a non-existing broker,
# delete the current assignment from zk, and
# then change the controller to an existent broker

Maysam

On Wed, May 11, 2016 at 5:24 PM, Maysam Yabandeh <myaban...@dropbox.com>
wrote:

> Hi
>
> I wondering if makes sense to remove
> {code}
>           case nne: ZkNoNodeException =>
>             createPersistentPath(zkPath, jsonData)
>             debug("Created path %s with %s for partition
> reassignment".format(zkPath, jsonData))
> {code}
> from ZKUtils::updatePartitionReassignmentData, which has caused an
> incident for us.
>
> The code does not seem to be doing anything in the normal case: if
> reassign path does not exist when removePartitionFromReassignedPartitions
> starts, it then has nothing to write back to zk anyway. The only time that
> the code kick in is when the admin manually deletes the zk path in the
> middle of update, which essentially cancels the admin's attempt to stop a
> bad partition assignment.
>
> The incident in our case was a very large json file that was mistakenly
> used by admin for partition assignment. The controller zk thread was in a
> busy loop removing partitions from this json file stored at zk, one by one.
> We attempted to stop the assignment by i) removing the zk path, ii)
> changing the controller. However, due to the many zk update operations by
> the active controller, the path would be recreated over and over. Changing
> the controller would also did not help since the new controller resumes the
> badly started reassignment job by picking it up from zk.
>
> Simply removing createPersistentPath in the catch clause should avoid such
> problems and yet does not seem to changing the intended semantics of
> removePartitionFromReassignedPartitions.
>
> Thoughts?
>
> Thanks
> Maysam
>

Reply via email to