Gangadharan created KAFKA-18974:
-----------------------------------

             Summary: Uneven distribution of topic partitions across consumers 
while using Cooperative Sticky Assignor
                 Key: KAFKA-18974
                 URL: https://issues.apache.org/jira/browse/KAFKA-18974
             Project: Kafka
          Issue Type: Bug
          Components: clients, consumer
    Affects Versions: 3.8.1
            Reporter: Gangadharan


I came across a scenario where we see the spread of partitions with topic 
across consumer threads is uneven. The topic with high TPS (for ex. 85% 
traffic) had more partitions compared to the topics with low TPS (for ex. 15% 
traffic).  The consumer threads had subscribed to both set of topics. 
Subsequently, some of the consumer threads were assigned with the more 
partitions of low TPS topics. As a result, the pods with the consumer threads 
that had more partitions of high TPS topics had to slog more resulting in 
higher lag. However, if we choose round robin, the distribution is even between 
threads and across pods. But we are limited by the stop the world condition.

There was already an issue raised and fixed on this context. However, it 
doesn't fix the whole problem. I suspect that it is because, during the 
rebalance the partitions that only the that are supposed to be moved from 
existing consumers are sorted and distributed. However, there was no logic to 
also check if the retained partitions should be moved to ensure even spread 
across consumers. 

[KAFKA-16277] CooperativeStickyAssignor does not spread topics evenly among 
consumer group - ASF Jira

 

Below is a sample test:

2 pods with 6 consumer threads in each. Two topics with 18 partitions each 
(test_topic_1 with higher inflow compared to test_topicone_1). As we could see, 
the test_topic_1 is concentrated in pod1 as a result, it starts to create the 
lag for the cooperative sticky strategy. However, for round robin, we see it is 
distributed between pods.

Note: The sample test with same partition count was put for the sake of 
understanding. Irrespective of the partition count of the topics, the behavior 
seems to be same.
 

Cooperative Sticky:

Pod1

c--> consumer 1912486590767 [test_topic_1-1, test_topic_1-3, 
{*}test_topicone_1{*}-1]
c--> consumer 1922696734819 [test_topic_1-11, test_topic_1-6, 
{*}test_topicone_1{*}-6]
c--> consumer 1941340051228 [test_topic_1-12, test_topic_1-7, 
{*}test_topicone_1{*}-7]
c--> consumer 1940955938996 [test_topic_1-0, test_topic_1-8, 
{*}test_topicone_1{*}-0]
c--> consumer 1941837822481 [test_topic_1-2, test_topic_1-9, 
{*}test_topicone_1{*}-2] 
c--> consumer 1942719746188 [test_topic_1-10, test_topic_1-4, 
{*}test_topicone_1{*}-4] 

 
Pod2

c--> consumer 1941486742305 [test_topic_1-13, {*}test_topicone_1{*}-13, 
{*}test_topicone_1{*}-5] 
c--> consumer 1941837974018 [test_topic_1-14, {*}test_topicone_1{*}-14, 
{*}test_topicone_1{*}-8] 
c--> consumer 1942719897724 [test_topic_1-15, {*}test_topicone_1{*}-15, 
{*}test_topicone_1{*}-9]
c--> consumer 1942696886353 [test_topic_1-16, {*}test_topicone_1{*}-10, 
{*}test_topicone_1{*}-16]
c--> consumer 1941340202762 [test_topic_1-17, {*}test_topicone_1{*}-11, 
{*}test_topicone_1{*}-17]
c--> consumer 1940956090534 [test_topic_1-5, {*}test_topicone_1{*}-12, 
{*}test_topicone_1{*}-3]

-----------------------------------------------------------------------------------------

Round Robin:

Pod1

c--> consumer 1941408797822 [test_topic_1-0, test_topic_1-12, 
{*}test_topicone_1{*}-6]
c--> consumer 1941456423553 [test_topic_1-9, {*}test_topicone_1{*}-15, 
{*}test_topicone_1{*}-3]
c--> consumer 1942070859325 [test_topic_1-14, test_topic_1-2, 
{*}test_topicone_1{*}-8]
c--> consumer 1941385036886 [test_topic_1-16, test_topic_1-4, 
{*}test_topicone_1{*}-10]
c--> consumer 1941105638483 [test_topic_1-6, {*}test_topicone_1{*}-0, 
{*}test_topicone_1{*}-12] 
c--> consumer 1941885698382 [test_topic_1-10, {*}test_topicone_1{*}-16, 
{*}test_topicone_1{*}-4]

Pod2

c--> consumer 1941456538287 [test_topic_1-8, {*}test_topicone_1{*}-14, 
{*}test_topicone_1{*}-2]
c--> consumer 1942070974058 [test_topic_1-15, test_topic_1-3, 
{*}test_topicone_1{*}-9]
c--> consumer 1941885813119 [test_topic_1-11, {*}test_topicone_1{*}-19, 
{*}test_topicone_1{*}-5]
c--> consumer 1941408912555 [test_topic_1-1, test_topic_1-13, 
{*}test_topicone_1{*}-7]
c--> consumer 1941385151618 [test_topic_1-17, test_topic_1-5, 
{*}test_topicone_1{*}-11]
c--> consumer 1941105753216 [test_topic_1-7, {*}test_topicone_1{*}-1, 
{*}test_topicone_1{*}-13]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to