equanz opened a new pull request #7553:
URL: https://github.com/apache/pulsar/pull/7553


   ### Motivation
   In some case of Key_Shared consumer, messages ordering was broken.
   Here is how to reproduce(I think it is one of case to reproduce this issue).
   
   1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive
      - receiverQueueSize: 500
   2. Connect Producer and publish 500 messages with key `(i % 10)`
   3. Connect Consumer2 to same subscription and start to receive
      - receiverQueueSize: 1
      - since https://github.com/apache/pulsar/pull/7106 , Consumer2 can't 
receive (expected)
   4. Producer publish more 500 messages with same key generation algorithm
   5. After that, Consumer1 start to receive
   6. Check Consumer2 message ordering
      - sometimes message ordering was broken in same key
   
   Consumer1:
   ```
   Connected: Tue Jul 14 09:36:39 JST 2020
   [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum 
- Failed to load Circe JNI library. Falling back to Java based CRC32c provider
   [pulsar-timer-4-1] INFO 
org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - 
[persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched 
messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- 
Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 
---Failed acks: 0
   Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 
JST 2020
   Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 
JST 2020
   Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 
JST 2020
   ...
   Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 
09:37:46 JST 2020
   Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 
09:37:46 JST 2020
   Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 
09:37:46 JST 2020
   ...
   Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 
09:37:46 JST 2020
   Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 
09:37:46 JST 2020
   Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 
09:37:46 JST 2020
   Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 
09:37:46 JST 2020
   ```
   
   Consumer2:
   ```
   Connected: Tue Jul 14 09:37:03 JST 2020
   [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum 
- Failed to load Circe JNI library. Falling back to Java based CRC32c provider
   Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 
Date: Tue Jul 14 09:37:46 JST 2020
   Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 
Date: Tue Jul 14 09:37:46 JST 2020
   Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 
Date: Tue Jul 14 09:37:46 JST 2020
   Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 
Date: Tue Jul 14 09:37:46 JST 2020
   Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 
Date: Tue Jul 14 09:37:46 JST 2020
   Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 
Date: Tue Jul 14 09:37:46 JST 2020
   Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 
Date: Tue Jul 14 09:37:46 JST 2020
   ordering was broken, key: 1 oldNum: 901 newNum: 511
   Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 
Date: Tue Jul 14 09:37:46 JST 2020
   Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 
Date: Tue Jul 14 09:37:46 JST 2020
   ...
   ```
   
   I think this issue is caused by https://github.com/apache/pulsar/pull/7105.
   Here is example.
   1. dispatch messages
   2. Consumer2 was stuck and `totalMessagesSent=0`
      - Consumer2 availablePermits was 0
   3. skip redeliver messages temporally
      - Consumer2 availablePermits was back to 1
   4. dispatch new messages
      - new message was dispatched to Consumer2
   5. back to redeliver messages
   4. dispatch messages
      - ordering was broken
   
   ### Modifications
   Stop to dispatch manually when skip message temporally since Key_Shared 
consumer stuck on delivery.
   
   ### Verifying this change
   It seems that this issue should check in some test cases.
   However, I think this issue is corner case and not easier to check. If it is 
not, please tell me.
   
   ### Does this pull request potentially affect one of the following parts:
     - Dependencies (does it add or upgrade a dependency): (no)
     - The public API: (no)
     - The schema: (no)
     - The default values of configurations: (no)
     - The wire protocol: (no)
     - The rest endpoints: (no)
     - The admin cli options: (no)
     - Anything that affects deployment: (no)
   
   ### Documentation
     - Does this pull request introduce a new feature? (no)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to