[ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232061#comment-14232061 ]
Bhavesh Mistry commented on KAFKA-1642: --------------------------------------- Hi [~ewencp], I will not have time to validate this patch till next week. Here is my comments: 1) You still have not address the Producer.close() method issue that in event of network connection lost or other events happens, IO thread will not be killed and close method hangs. In patch that I have provided, I had timeout for join method and interrupted IO thread. I think we need similar for this. 2) Also, can we please add JMX monitoring for IO tread to know how quick it is running. It will great to add this and run() method will report duration to metric. {code} try{ ThreadMXBean bean = ManagementFactory.getThreadMXBean( ); if(bean.isThreadCpuTimeSupported() && bean.isThreadCpuTimeEnabled()){ this.ioTheadCPUTime = metrics.sensor("iothread-cpu"); this.ioTheadCPUTime.add("iothread-cpu-ms", "The Rate Of CPU Cycle used by iothead in NANOSECONDS", new Rate(TimeUnit.NANOSECONDS) { public double measure(MetricConfig config, long now) { return (now - metadata.lastUpdate()) / 1000.0; } }); } }catch(Throwable th){ log.warn("Not able to set the CPU time... etc"); } {code} 3) Please check the timeout final value in *pollTimeout* if it is zero for constantly then we need to slow IO thread down. 4) Defensive check in for back off in run() method when IO thread is aggressive: 5) When all nodes are disconnected, do you still want to spin the IO Thread ? 6) When you have a firewall rule that says "you can only have 2 concurrent TCP connections from Client to Brokers" and client still have live TCP connection to same not (Broker), but new TCP connection is rejected. Node State will be marked as Disconnected in initiateConnect ? Are you handling that gracefully ? By the way, thank you very much for quick reply and with new patch. I appreciate your help. Thanks, Bhavesh > [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network > connection is lost > --------------------------------------------------------------------------------------- > > Key: KAFKA-1642 > URL: https://issues.apache.org/jira/browse/KAFKA-1642 > Project: Kafka > Issue Type: Bug > Components: producer > Affects Versions: 0.8.2 > Reporter: Bhavesh Mistry > Assignee: Ewen Cheslack-Postava > Priority: Blocker > Fix For: 0.8.2 > > Attachments: > 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, > KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, > KAFKA-1642_2014-10-23_16:19:41.patch > > > I see my CPU spike to 100% when network connection is lost for while. It > seems network IO thread are very busy logging following error message. Is > this expected behavior ? > 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka > producer I/O thread: > java.lang.IllegalStateException: No entry found for node -2 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110) > at > org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394) > at > org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) > at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:744) > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)