[
https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256255#comment-14256255
]
Parth Brahmbhatt commented on KAFKA-1788:
-----------------------------------------
[~nehanarkhede] [~junrao] Can you provide input on what you think needs to be
done here. There are 2 problems being discussed:
* No leader is actually available for a long time, which is the original issue
in this jira. This is the case where all replicas are in single DC/AZ and DC/AZ
faces outage. In this case the record stays in RecordAccumulator forever as no
node is ever ready, so no retries are ever attempted and as the max retries are
not exhausted this batch is never dropped. The only way I see to solve this is
by adding an expiry on batches and perform a cleanup on expired batches.
* stale metadata because NetworkClient.leastLoadedNode() returns a bad node and
keeps retrying against a bad node. unless I am missing something here, I think
this just indicates bad configuration, we could reduce default TCP
connection-socket/read timeout so we can fail fast but I am not entirely sure
if we need to do anything in code to handle this case. The method already goes
through all the nodes in the bootstrap list as leastLoadedNode() starts off
with this.metadata.fetch().nodes() and tries to find a good node with fewest
outstanding request.
> producer record can stay in RecordAccumulator forever if leader is no
> available
> -------------------------------------------------------------------------------
>
> Key: KAFKA-1788
> URL: https://issues.apache.org/jira/browse/KAFKA-1788
> Project: Kafka
> Issue Type: Bug
> Components: core, producer
> Affects Versions: 0.8.2
> Reporter: Jun Rao
> Assignee: Jun Rao
> Labels: newbie++
> Fix For: 0.8.3
>
>
> In the new producer, when a partition has no leader for a long time (e.g.,
> all replicas are down), the records for that partition will stay in the
> RecordAccumulator until the leader is available. This may cause the
> bufferpool to be full and the callback for the produced message to block for
> a long time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)