[ 
https://issues.apache.org/jira/browse/KAFKA-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302143#comment-17302143
 ] 

Tom Lee commented on KAFKA-10518:
---------------------------------

Attaching a contrived repro case. In case it's not obvious: in the event the 
repro case is run against a multi-broker cluster, key to the repro is that the 
partitions assigned to the consumer are being fetched from the same broker.

Sample output here: 
[https://gist.githubusercontent.com/thomaslee/fa13c9a10466dc35792173c2485ad84b/raw/34c02bfc9f756eced8b952530b1b6378760fd7cd/bug-repro-output
 
|https://gist.githubusercontent.com/thomaslee/fa13c9a10466dc35792173c2485ad84b/raw/34c02bfc9f756eced8b952530b1b6378760fd7cd/bug-repro-output]Note
 the throughput drop from a ballpark ~2-3M records/sec to less than 200k/sec. 
This is the point at which the _disable_topic_2_ file is created and the 
producer stops writing to topic_2.

Imagine a scenario where a consumer of topic_2 is downstream of another system 
producing to topic_2: if conditions are right, an incident impacting the 
producer could also impact the consumer. Same deal if the producer is decommed.

> Consumer fetches could be inefficient when lags are unbalanced
> --------------------------------------------------------------
>
>                 Key: KAFKA-10518
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10518
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Dhruvil Shah
>            Priority: Major
>         Attachments: kafka-slow-consumer-repro.tar.gz
>
>
> Consumer fetches are inefficient when lags are imbalanced across partitions, 
> due to head of the line blocking and the behavior of blocking for 
> `max.wait.ms` until data is available.
> When the consumer receives a fetch response, it prepares the next fetch 
> request and sends it out. The caveat is that the subsequent fetch request 
> would explicitly exclude partitions for which the consumer received data in 
> the previous round. This is to allow the consumer application to drain the 
> data for those partitions, until the consumer fetches the other partitions it 
> is subscribed to.
> This behavior does not play out too well if the consumer is consuming when 
> the lag is unbalanced, because it would receive data for the partitions it is 
> lagging on, and then it would send a fetch request for partitions that do not 
> have any data (or have little data). The latter will end up blocking for 
> fetch.max.wait.ms on the broker before an empty response is sent back. This 
> slows down the consumer’s overall consumption throughput since 
> fetch.max.wait.ms is 500ms by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to