Re: [PR] KAFKA-16249; Improve reconciliation state machine [kafka]

via GitHub Wed, 13 Mar 2024 02:39:12 -0700


dajac commented on code in PR #15364:
URL: https://github.com/apache/kafka/pull/15364#discussion_r1522869232



##########
group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java:
##########
@@ -1211,13 +1192,99 @@ private 
CoordinatorResult<ConsumerGroupHeartbeatResponseData, Record> consumerGr
         // 1. The member reported its owned partitions;
         // 2. The member just joined or rejoined to group (epoch equals to 
zero);
         // 3. The member's assignment has been updated.
-        if (ownedTopicPartitions != null || memberEpoch == 0 || 
assignmentUpdated) {
+        if (ownedTopicPartitions != null || memberEpoch == 0 || 
hasAssignedPartitionsChanged(member, updatedMember)) {
             response.setAssignment(createResponseAssignment(updatedMember));
         }
 
         return new CoordinatorResult<>(records, response);
     }
 
+    /**
+     * Reconciles the current assignment of the member if needed.
+     *
+     * @param groupId               The group id.
+     * @param member                The member to reconcile.
+     * @param currentPartitionEpoch The function returning the current epoch of
+     *                              a given partition.
+     * @param targetAssignmentEpoch The target assignment epoch.
+     * @param targetAssignment      The target assignment.
+     * @param ownedTopicPartitions  The list of partitions owned by the 
member. This
+     *                              is reported in the ConsumerGroupHeartbeat 
API and
+     *                              it could be null if not provided.
+     * @param records               The list to accumulate any new records.
+     * @return The received member if no changes have been made; or a new
+     *         member containing the new assignment.
+     */
+    private ConsumerGroupMember maybeReconcile(
+        String groupId,
+        ConsumerGroupMember member,
+        BiFunction<Uuid, Integer, Integer> currentPartitionEpoch,
+        int targetAssignmentEpoch,
+        Assignment targetAssignment,
+        List<ConsumerGroupHeartbeatRequestData.TopicPartitions> 
ownedTopicPartitions,
+        List<Record> records
+    ) {
+        if (member.isReconciledTo(targetAssignmentEpoch)) {
+            return member;
+        }
+
+        ConsumerGroupMember updatedMember = new 
CurrentAssignmentBuilder(member)
+            .withTargetAssignment(targetAssignmentEpoch, targetAssignment)
+            .withCurrentPartitionEpoch(currentPartitionEpoch)
+            .withOwnedTopicPartitions(ownedTopicPartitions)
+            .build();
+
+        if (!updatedMember.equals(member)) {
+            records.add(newCurrentAssignmentRecord(groupId, updatedMember));
+
+            log.info("[GroupId {}] Member {} new assignment state: epoch={}, 
previousEpoch={}, state={}, "
+                     + "assignedPartitions={} and revokedPartitions={}.",
+                groupId, updatedMember.memberId(), 
updatedMember.memberEpoch(), updatedMember.previousMemberEpoch(), 
updatedMember.state(),
+                formatAssignment(updatedMember.assignedPartitions()), 
formatAssignment(updatedMember.revokedPartitions()));
+
+            if (updatedMember.state() == MemberState.UNREVOKED_PARTITIONS) {
+                scheduleConsumerGroupRebalanceTimeout(
+                    groupId,
+                    updatedMember.memberId(),
+                    updatedMember.memberEpoch(),
+                    updatedMember.rebalanceTimeoutMs()
+                );
+            } else {

Review Comment:
   > So just so I get it straight -- we have a timeout only for the revocation 
of the partitions. If we hit the timeout, the member is fenced and I assume we 
can count those partitions as revoked. (Or we no longer have to wait for them 
to be revoked)
   
   Yep. This is correct. As soon as the member is fenced, all its partitions 
become available immediately.
   
   > At the point all partitions are revoked, we can try to assign new 
partitions to members. In this case, are we relying on the heartbeat to kick 
out members out if they aren't responding? And we expect if requests are going 
through the assignment will occur?
   
   Correct. We rely on the heartbeat to maintain the session of the member. If 
the member does not heartbeat at least once within the session timeout, we 
fence it.
   
   > Just trying to confirm the reason for the separate timeout here. Is it 
because revoking is more likely to fail even though the heartbeat still goes 
through, but not the same for assigning?
   
   The reason is that there actually two processes piggy backing on the 
heartbeat: 1) the session; and 2) the assignment reconciliation. The session is 
maintained by the regular heartbeat. The assignment reconciliation relies on 
the reported partitions provided by the member. It is totally possible for a 
member to heartbeat to maintain its sessions while being stuck on the 
assignment part. This is the main reason why we need the second timeout. We 
want to ensure that a member cannot block all the others forever if it does not 
releases its revoked partitions.
   
   Regarding the assigning case, we don't apply the same timeout because we 
have discovered that the consumer could actually be stuck when a partition that 
does not exist anymore is assigned to it. In this case, the consumer won't be 
able to get the metadata for the newly assigned partition and it will retry 
forever. On the server, if we wait on the member to confirms that assignment, 
we cannot revoke the deleted partitions. This is the main reason why we treat 
both the revocation and the assignment differently. The revocation blocks the 
reconciliation process while the assignment does not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-16249; Improve reconciliation state machine [kafka]

Reply via email to