[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369257#comment-16369257
 ] 

Robin Tweedie commented on KAFKA-6199:
--------------------------------------

There's another piece of evidence I've noticed about this single broker: it 
builds up file descriptors in a way that the other brokers don't. I'm not sure 
if this narrows the potential causes much.

On the problem broker, check out the high {{sock}} open files:
{noformat}
$ sudo lsof | awk '{print $5}' | sort | uniq -c | sort -rn
  12201 REG
   7229 IPv6
   1374 sock
    337 FIFO
    264 DIR
    163 CHR
    138 0000
     77 unknown
     54 unix
     13 IPv4
      1 TYPE
      1 pack
{noformat}

If you look at the {{lsof}} output directory there are lots of lines like this 
(25305 is the Kafka pid)
{noformat}
java      25305 user *105u     sock                0,6       0t0  351061533 
can't identify protocol
java      25305 user *111u     sock                0,6       0t0  351219556 
can't identify protocol
java      25305 user *131u     sock                0,6       0t0  350831689 
can't identify protocol
java      25305 user *134u     sock                0,6       0t0  351001514 
can't identify protocol
java      25305 user *136u     sock                0,6       0t0  351410956 
can't identify protocol
{noformat}

Compare with a good broker that has an uptime of 76 days (only 65 open {{sock}} 
files):
{noformat}
  11729 REG
   7037 IPv6
    335 FIFO
    264 DIR
    164 CHR
    137 0000
     76 unknown
     65 sock
     54 unix
     14 IPv4
      1 TYPE
      1 pack
{noformat}

> Single broker with fast growing heap usage
> ------------------------------------------
>
>                 Key: KAFKA-6199
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6199
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.10.2.1
>         Environment: Amazon Linux
>            Reporter: Robin Tweedie
>            Priority: Major
>         Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_20171206.txt, histo_live_80.txt, jstack-2017-12-08.scrubbed.out, 
> merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to