Sean Quah created KAFKA-19862:
---------------------------------

             Summary: Group coordinator loading may fail when there is 
concurrent compaction
                 Key: KAFKA-19862
                 URL: https://issues.apache.org/jira/browse/KAFKA-19862
             Project: Kafka
          Issue Type: Bug
          Components: group-coordinator
            Reporter: Sean Quah
            Assignee: Sean Quah


For consumer and streams groups, we reject replay of 
{{Consumer/StreamsGroupCurrentMemberAssignment}} records when we detect a 
partition / task is already owned by another member.

During group coordinator load, we replay the records in 
{{{}__consumer_offsets{}}}. When compaction is running concurrently, we can 
load uncompacted data, followed by a newly swapped in compacted segment, 
followed by the uncompacted head of the log. This allows for situations where 
the record unassigning a partition/task is missed during loading.

eg.
We can load a record \{ Member A is assigned partition X },
then miss the record \{ Member A is unassigned partition X },
then load the record \{ Member B is assigned partition X }, which fails with an 
exception like

{{[GroupCoordinator id=2] Failed to load metadata from __consumer_offsets-4 
with epoch 10 due to java.lang.RuntimeException: Replaying record 
CoordinatorRecord(key=ConsumerGroupCurrentMemberAssignmentKey(groupId='...', 
memberId='ZxHk7W53S_aHFdpxYc-_Jw'), 
value=ApiMessageAndVersion(ConsumerGroupCurrentMemberAssignmentValue(memberEpoch=854659,
 previousMemberEpoch=854633, state=0, 
assignedPartitions=[TopicPartitions(topicId=9lL1aTMuSC22QAXsHgzhew, 
partitions=[1, 2]), TopicPartitions(topicId=RHKM682KQYyOfF1XsOSF1A, 
partitions=[0]), TopicPartitions(topicId=rKx9q1JmS1uP-ug_cj56ug, 
partitions=[0]), TopicPartitions(topicId=I7EtFwesTRubnj-VHClqbQ, 
partitions=[2]), TopicPartitions(topicId=ydAln6IUTZe-od9UUkn3rg, 
partitions=[2])], partitionsPendingRevocation=[]) at version 0)) from 
__consumer_offsets-4 at offset 3889549 with producer id -1 and producer epoch 
-1 failed..}}


{{java.lang.IllegalStateException: Cannot set the epoch of 
RHKM682KQYyOfF1XsOSF1A-0 to 854659 because the partition is still owned at 
epoch 853490}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to