[ https://issues.apache.org/jira/browse/KAFKA-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302143#comment-17302143 ]
Tom Lee commented on KAFKA-10518: --------------------------------- Attaching a contrived repro case. In case it's not obvious: in the event the repro case is run against a multi-broker cluster, key to the repro is that the partitions assigned to the consumer are being fetched from the same broker. Sample output here: [https://gist.githubusercontent.com/thomaslee/fa13c9a10466dc35792173c2485ad84b/raw/34c02bfc9f756eced8b952530b1b6378760fd7cd/bug-repro-output |https://gist.githubusercontent.com/thomaslee/fa13c9a10466dc35792173c2485ad84b/raw/34c02bfc9f756eced8b952530b1b6378760fd7cd/bug-repro-output]Note the throughput drop from a ballpark ~2-3M records/sec to less than 200k/sec. This is the point at which the _disable_topic_2_ file is created and the producer stops writing to topic_2. Imagine a scenario where a consumer of topic_2 is downstream of another system producing to topic_2: if conditions are right, an incident impacting the producer could also impact the consumer. Same deal if the producer is decommed. > Consumer fetches could be inefficient when lags are unbalanced > -------------------------------------------------------------- > > Key: KAFKA-10518 > URL: https://issues.apache.org/jira/browse/KAFKA-10518 > Project: Kafka > Issue Type: Bug > Reporter: Dhruvil Shah > Priority: Major > Attachments: kafka-slow-consumer-repro.tar.gz > > > Consumer fetches are inefficient when lags are imbalanced across partitions, > due to head of the line blocking and the behavior of blocking for > `max.wait.ms` until data is available. > When the consumer receives a fetch response, it prepares the next fetch > request and sends it out. The caveat is that the subsequent fetch request > would explicitly exclude partitions for which the consumer received data in > the previous round. This is to allow the consumer application to drain the > data for those partitions, until the consumer fetches the other partitions it > is subscribed to. > This behavior does not play out too well if the consumer is consuming when > the lag is unbalanced, because it would receive data for the partitions it is > lagging on, and then it would send a fetch request for partitions that do not > have any data (or have little data). The latter will end up blocking for > fetch.max.wait.ms on the broker before an empty response is sent back. This > slows down the consumer’s overall consumption throughput since > fetch.max.wait.ms is 500ms by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)