[
https://issues.apache.org/jira/browse/KAFKA-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990238#comment-14990238
]
ASF GitHub Bot commented on KAFKA-2743:
---------------------------------------
GitHub user ewencp opened a pull request:
https://github.com/apache/kafka/pull/422
KAFKA-2743: Make forwarded task reconfiguration requests asynchronous, run
on a separate thread, and backoff before retrying when they fail.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ewencp/kafka
task-reconfiguration-async-with-backoff
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/kafka/pull/422.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #422
----
commit 8a30a78b9222ed8fec5143a41db5cf8e6e9efbc7
Author: Ewen Cheslack-Postava <[email protected]>
Date: 2015-11-03T05:30:32Z
KAFKA-2743: Make forwarded task reconfiguration requests asynchronous, run
on a separate thread, and backoff before retrying when they fail.
----
> Forwarding task reconfigurations in Copycat can deadlock with rebalances and
> has no backoff
> -------------------------------------------------------------------------------------------
>
> Key: KAFKA-2743
> URL: https://issues.apache.org/jira/browse/KAFKA-2743
> Project: Kafka
> Issue Type: Bug
> Components: copycat
> Reporter: Ewen Cheslack-Postava
> Assignee: Ewen Cheslack-Postava
> Fix For: 0.9.0.0
>
>
> There are two issues with the way we're currently forwarding task
> reconfigurations. First, the forwarding is performed synchronously in the
> DistributedHerder's main processing loop. If node A forwards a task
> reconfiguration and node B has started a rebalance process, we can end up
> with distributed deadlock because node A will be blocking on the HTTP request
> in the thread that would otherwise handle heartbeating and rebalancing.
> Second, currently we just retry aggressively with no backoff. In some cases
> the node that is currently thought to be the leader will legitimately be down
> (it shutdown and the node sending the request didn't rebalance yet), so we
> need some backoff to avoid unnecessarily hammering the network and the huge
> log files that result from constant errors.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)