Michal Turek created KAFKA-2978:
-----------------------------------

             Summary: Topic partition is not sometimes consumed after 
rebalancing of consumer group
                 Key: KAFKA-2978
                 URL: https://issues.apache.org/jira/browse/KAFKA-2978
             Project: Kafka
          Issue Type: Bug
          Components: consumer, core
    Affects Versions: 0.9.0.0
            Reporter: Michal Turek
            Assignee: Neha Narkhede
            Priority: Critical


Hi there, we are evaluating Kafka 0.9 to find if it is stable enough and ready 
for production. We wrote a tool that basically verifies that each produced 
message is also properly consumed. We found the issue described below while 
stressing Kafka using this tool.

Adding more and more consumers to a consumer group may result in unsuccessful 
rebalancing. Data from one or more partitions
are not consumed and are not effectively available to the client application 
(e.g. for 15 minutes). Situation can be resolved
externally by touching the consumer group again (add or remove a consumer) 
which forces another rebalancing that may or may not be successful.

Significantly higher CPU utilization was observed in such cases (from about 3% 
to 17%). The CPU utilization takes place in both the affected consumer and 
Kafka broker according to htop and profiling using jvisualvm. 

Jvisualvm indicates the issue may be related to KAFKA-2936 (see its screenshots 
in the GitHub repo below), but I'm very unsure. I don't also know if the issue 
is in consumer or broker because both are affected and I don't know Kafka 
internals.

The issue is not deterministic but it can be easily reproduced after a few 
minutes just by executing more and more consumers.
More parallelism with multiple CPUs probably gives the issue more chances to 
appear.

The tool itself together with very detailed instructions for quite reliable 
reproduction of the issue and initial analysis are available here:

- https://github.com/avast/kafka-tests
- https://github.com/avast/kafka-tests/tree/issue1/issues/1_rebalancing
- Prefer fixed tag {{issue1}} to branch {{master}} which may change.
- Note there are also various screenshots of jvisualvm together with full logs 
from all components of the tool.

My colleague was able to independently reproduce the issue according to the 
instructions above. If you have any questions or if you need any help with the 
tool, just let us know. This issue is blocker for us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to