[GitHub] [pulsar] liangyepianzhou commented on a diff in pull request #20948: [fix][broker]Fix chunked messages will be filtered by duplicating

via GitHub Sun, 27 Aug 2023 17:38:29 -0700


liangyepianzhou commented on code in PR #20948:
URL: https://github.com/apache/pulsar/pull/20948#discussion_r1306765177



##########
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java:
##########
@@ -1449,6 +1450,23 @@ private ByteBuf processMessageChunk(ByteBuf 
compressedPayload, MessageMetadata m
         // discard message if chunk is out-of-order
         if (chunkedMsgCtx == null || chunkedMsgCtx.chunkedMsgBuffer == null
                 || msgMetadata.getChunkId() != 
(chunkedMsgCtx.lastChunkedMessageId + 1)) {
+            // Filter duplicated chunks instead of discard it.
+            if (chunkedMsgCtx == null || msgMetadata.getChunkId() <= 
chunkedMsgCtx.lastChunkedMessageId) {
+                log.warn("[{}] Receive a repeated chunk messageId {}, 
last-chunk-id{}, chunkId = {}",
+                        msgMetadata.getProducerName(), chunkedMsgCtx == null ? 
null
+                                : chunkedMsgCtx.lastChunkedMessageId, msgId, 
msgMetadata.getChunkId());
+                compressedPayload.release();
+                increaseAvailablePermits(cnx);
+                if (chunkedMsgCtx != null) {

Review Comment:
   >It seems to be inefficient to iterate all chunks every time. Can we 
optimize it? I believe all chunk message ids(ledger and entry) for the same 
message should be the same, aren't they? Can't we check the last chunk's 
messageId only?
   
   In fact, all the chunks in a chunk message are different. I also just 
learned it.
   
   >It seems like the processMessageChunk does not have the id check logic. Why 
are we introducing this check in this PR?
   
   Because we should check whether it is a duplicated chunk persistent in the 
topic or received twice by the consumer
   For example:
   **Case 1,  duplicated chunk persistent in the topic:**
   1: uuid=p-0, mid:1:1, chunk 1 sequence ID: 0, chunk ID: 0
   2: uuid=p-0, mid:1:2, chunk 2 sequence ID: 0, chunk ID: 1
   3: uuid=p-0, mid:1:3, chunk 1 sequence ID: 0, chunk ID: 0 // should be acked
   3: uuid=p-0, mid:1:4, chunk 2 sequence ID: 0, chunk ID: 1 // should be acked
   **Case 2, received twice by the consumer:**
   1: uuid=p-0, mid:1:1, chunk 1 sequence ID: 0, chunk ID: 0
   2: uuid=p-0, mid:1:2, chunk 2 sequence ID: 0, chunk ID: 1
   3: uuid=p-0, mid:1:1, chunk 1 sequence ID: 0, chunk ID: 0 // Just ignore it
   3: uuid=p-0, mid:1:2, chunk 2 sequence ID: 0, chunk ID: 1 // Just ignore it



##########
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java:
##########
@@ -1449,6 +1450,24 @@ private ByteBuf processMessageChunk(ByteBuf 
compressedPayload, MessageMetadata m
         // discard message if chunk is out-of-order
         if (chunkedMsgCtx == null || chunkedMsgCtx.chunkedMsgBuffer == null
                 || msgMetadata.getChunkId() != 
(chunkedMsgCtx.lastChunkedMessageId + 1)) {
+            // Filter duplicated chunks instead of discard it. (Only do this 
when exist duplication in a chunk message)
+            // For example:
+            //     Chunk-1 sequence ID: 0, chunk ID: 0
+            //     Chunk-2 sequence ID: 0, chunk ID: 0
+            //     Chunk-3 sequence ID: 0, chunk ID: 1
+            if (chunkedMsgCtx != null && msgMetadata.getChunkId() <= 
chunkedMsgCtx.lastChunkedMessageId) {
+                log.warn("[{}] Receive a repeated chunk messageId {}, 
last-chunk-id{}, chunkId = {}",
+                        msgMetadata.getProducerName(), 
chunkedMsgCtx.lastChunkedMessageId, msgId, msgMetadata.getChunkId());
+                compressedPayload.release();
+                increaseAvailablePermits(cnx);
+                boolean repeatedlyReceived = 
Arrays.stream(chunkedMsgCtx.chunkedMessageIds)
+                        .anyMatch(messageId1 -> messageId1 != null && 
messageId1.ledgerId == messageId.getLedgerId()
+                                && messageId1.entryId == 
messageId.getEntryId());
+                if (!repeatedlyReceived) {
+                    doAcknowledge(msgId, AckType.Individual, 
Collections.emptyMap(), null);

Review Comment:
   > 1: uuid=p-0-t1, mid:1:1, chunk 1 sequence ID: 0, chunk ID: 0
   > 2: uuid=p-0-t1, mid:1:2, chunk 2 sequence ID: 0, chunk ID: 1
   > 3: uuid=p-0-t1, mid:1:2, chunk 2 sequence ID: 0, chunk ID: 1 // ignored
   > // producer restarted
   > 4: uuid=p-0-t1, mid:1:3, chunk 3 sequence ID: 0, chunk ID: 0
   > 5: uuid=p-0-t2, mid:1:4, chunk 4 sequence ID: 0, chunk ID: 1
   > 6: uuid=p-0-t3, mid:1:5, chunk 5 sequence ID: 0, chunk ID: 2
   > 
   > So, msg 4, 5 and 6 will complete the chunked msg in this case and msg 1 
and 2 will be eventually expired.
   
   Do you mean uuid = p-0-t2 for chunk 3,4,5? If so, it makes sense to me.



##########
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java:
##########
@@ -1449,6 +1450,24 @@ private ByteBuf processMessageChunk(ByteBuf 
compressedPayload, MessageMetadata m
         // discard message if chunk is out-of-order
         if (chunkedMsgCtx == null || chunkedMsgCtx.chunkedMsgBuffer == null
                 || msgMetadata.getChunkId() != 
(chunkedMsgCtx.lastChunkedMessageId + 1)) {
+            // Filter duplicated chunks instead of discard it. (Only do this 
when exist duplication in a chunk message)
+            // For example:
+            //     Chunk-1 sequence ID: 0, chunk ID: 0
+            //     Chunk-2 sequence ID: 0, chunk ID: 0
+            //     Chunk-3 sequence ID: 0, chunk ID: 1
+            if (chunkedMsgCtx != null && msgMetadata.getChunkId() <= 
chunkedMsgCtx.lastChunkedMessageId) {
+                log.warn("[{}] Receive a repeated chunk messageId {}, 
last-chunk-id{}, chunkId = {}",
+                        msgMetadata.getProducerName(), 
chunkedMsgCtx.lastChunkedMessageId, msgId, msgMetadata.getChunkId());
+                compressedPayload.release();
+                increaseAvailablePermits(cnx);
+                boolean repeatedlyReceived = 
Arrays.stream(chunkedMsgCtx.chunkedMessageIds)
+                        .anyMatch(messageId1 -> messageId1 != null && 
messageId1.ledgerId == messageId.getLedgerId()
+                                && messageId1.entryId == 
messageId.getEntryId());
+                if (!repeatedlyReceived) {
+                    doAcknowledge(msgId, AckType.Individual, 
Collections.emptyMap(), null);

Review Comment:
   > Then, it seems like we don't need to iterate all 
chunkedMsgCtx.chunkedMessageIds.
   > I think we can check
   > ```
   > var prevChunkMsgId = chunkedMsgCtx.chunkedMessageIds[chunkId]
   > boolean repeatedlyReceived =  prevChunkMsgId.ledgerId = 
messageId.getLedgerId() 
   > && prevChunkMsgId.entryId = messageId.getEntryId();
   > ```
   
   The retransmission of chunks by the producer might occur due to reconnection 
after a connection disruption. In this scenario, the producer doesn't re-split 
the chunk message but rather resends the chunks from the previously pending 
message. In such cases, the resent chunk and the previously sent chunk belong 
to the same chunk message, and they share the same UUID.



##########
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java:
##########
@@ -1449,6 +1450,24 @@ private ByteBuf processMessageChunk(ByteBuf 
compressedPayload, MessageMetadata m
         // discard message if chunk is out-of-order
         if (chunkedMsgCtx == null || chunkedMsgCtx.chunkedMsgBuffer == null
                 || msgMetadata.getChunkId() != 
(chunkedMsgCtx.lastChunkedMessageId + 1)) {
+            // Filter duplicated chunks instead of discard it. (Only do this 
when exist duplication in a chunk message)
+            // For example:
+            //     Chunk-1 sequence ID: 0, chunk ID: 0
+            //     Chunk-2 sequence ID: 0, chunk ID: 0
+            //     Chunk-3 sequence ID: 0, chunk ID: 1
+            if (chunkedMsgCtx != null && msgMetadata.getChunkId() <= 
chunkedMsgCtx.lastChunkedMessageId) {
+                log.warn("[{}] Receive a repeated chunk messageId {}, 
last-chunk-id{}, chunkId = {}",
+                        msgMetadata.getProducerName(), 
chunkedMsgCtx.lastChunkedMessageId, msgId, msgMetadata.getChunkId());
+                compressedPayload.release();
+                increaseAvailablePermits(cnx);
+                boolean repeatedlyReceived = 
Arrays.stream(chunkedMsgCtx.chunkedMessageIds)
+                        .anyMatch(messageId1 -> messageId1 != null && 
messageId1.ledgerId == messageId.getLedgerId()
+                                && messageId1.entryId == 
messageId.getEntryId());
+                if (!repeatedlyReceived) {
+                    doAcknowledge(msgId, AckType.Individual, 
Collections.emptyMap(), null);

Review Comment:
   > This means we probably need to update the chunking uuid definition logic 
and add a suffix there(session id, maybe the producer start-time, or some other 
unique id to identify the producer session). Currently,
   > ```
   > String uuid = totalChunks > 1 ? String.format("%s-%d", producerName, 
sequenceId) : null;
   > ```
   
   Yeah, this is a good suggestion. I changed it as follows.
   ```
       String uuid = totalChunks > 1 ? String.format("%s-%d-%d", producerName, 
sequenceId,
               System.currentTimeMillis()) : null;
   
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] liangyepianzhou commented on a diff in pull request #20948: [fix][broker]Fix chunked messages will be filtered by duplicating

Reply via email to