[ https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234595#comment-14234595 ]
Bhavesh Mistry commented on KAFKA-1788: --------------------------------------- We also need to fix the Producer Close which hangs JVM because io.join() thread does not exit. Please refer to KAFKA-1642 for more details. So Kakfa core Dev needs to give guidance on how to solve this problem. Please see below comments from that linked issue. 1) Producer.close() method issue is not address with patch. In event of network connection lost or other events happens, IO thread will not be killed and close method hangs. In patch that I have provided, I had timeout for join method and interrupted IO thread. I think we need similar solution. [~ewencp], 1. I'm specifically trying to address the CPU usage here. I realize from your perspective they are closely related since they're both can be triggered by a loss of network connectivity, but internally they're really separate issues – the CPU usage has to do with incorrect timeouts and the join() issues is due to the lack of timeouts on produce operations. That's why I pointed you toward KAFKA-1788. If a timeout is added for data in the producer, that would resolve the close issue as well since any data waiting in the producer would eventually timeout and the IO thread could exit. I think that's the cleanest solution since it solves both problems with a single setting (the amount of time your willing to wait before discarding data). If you think a separate timeout specifically for Producer.close() is worthwhile I'd suggest filing a separate JIRA for that. > producer record can stay in RecordAccumulator forever if leader is no > available > ------------------------------------------------------------------------------- > > Key: KAFKA-1788 > URL: https://issues.apache.org/jira/browse/KAFKA-1788 > Project: Kafka > Issue Type: Bug > Components: core, producer > Affects Versions: 0.8.2 > Reporter: Jun Rao > Assignee: Jun Rao > Labels: newbie++ > Fix For: 0.8.3 > > > In the new producer, when a partition has no leader for a long time (e.g., > all replicas are down), the records for that partition will stay in the > RecordAccumulator until the leader is available. This may cause the > bufferpool to be full and the callback for the produced message to block for > a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)