[ https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369257#comment-16369257 ]
Robin Tweedie commented on KAFKA-6199: -------------------------------------- There's another piece of evidence I've noticed about this single broker: it builds up file descriptors in a way that the other brokers don't. I'm not sure if this narrows the potential causes much. On the problem broker, check out the high {{sock}} open files: {noformat} $ sudo lsof | awk '{print $5}' | sort | uniq -c | sort -rn 12201 REG 7229 IPv6 1374 sock 337 FIFO 264 DIR 163 CHR 138 0000 77 unknown 54 unix 13 IPv4 1 TYPE 1 pack {noformat} If you look at the {{lsof}} output directory there are lots of lines like this (25305 is the Kafka pid) {noformat} java 25305 user *105u sock 0,6 0t0 351061533 can't identify protocol java 25305 user *111u sock 0,6 0t0 351219556 can't identify protocol java 25305 user *131u sock 0,6 0t0 350831689 can't identify protocol java 25305 user *134u sock 0,6 0t0 351001514 can't identify protocol java 25305 user *136u sock 0,6 0t0 351410956 can't identify protocol {noformat} Compare with a good broker that has an uptime of 76 days (only 65 open {{sock}} files): {noformat} 11729 REG 7037 IPv6 335 FIFO 264 DIR 164 CHR 137 0000 76 unknown 65 sock 54 unix 14 IPv4 1 TYPE 1 pack {noformat} > Single broker with fast growing heap usage > ------------------------------------------ > > Key: KAFKA-6199 > URL: https://issues.apache.org/jira/browse/KAFKA-6199 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.2.1 > Environment: Amazon Linux > Reporter: Robin Tweedie > Priority: Major > Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot > 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, > histo_live_20171206.txt, histo_live_80.txt, jstack-2017-12-08.scrubbed.out, > merge_shortest_paths.png, path2gc.png > > > We have a single broker in our cluster of 25 with fast growing heap usage > which necessitates us restarting it every 12 hours. If we don't restart the > broker, it becomes very slow from long GC pauses and eventually has > {{OutOfMemory}} errors. > See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage > percentage on the broker. A "normal" broker in the same cluster stays below > 50% (averaged) over the same time period. > We have taken heap dumps when the broker's heap usage is getting dangerously > high, and there are a lot of retained {{NetworkSend}} objects referencing > byte buffers. > We also noticed that the single affected broker logs a lot more of this kind > of warning than any other broker: > {noformat} > WARN Attempting to send response via channel for which there is no open > connection, connection id 13 (kafka.network.Processor) > {noformat} > See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log > message visualized across all the brokers (to show it happens a bit on other > brokers, but not nearly as much as it does on the "bad" broker). > I can't make the heap dumps public, but would appreciate advice on how to pin > down the problem better. We're currently trying to narrow it down to a > particular client, but without much success so far. > Let me know what else I could investigate or share to track down the source > of this leak. -- This message was sent by Atlassian JIRA (v7.6.3#76005)