[GitHub] [rocketmq] cserwen opened a new issue, #7355: [Bug] ConsumeQueue building exception caused by host downtime

via GitHub Wed, 13 Sep 2023 05:26:11 -0700


cserwen opened a new issue, #7355:
URL: https://github.com/apache/rocketmq/issues/7355


   ### Before Creating the Bug Report
   
   - [X] I found a bug, not just asking a question, which should be created in 
[GitHub Discussions](https://github.com/apache/rocketmq/discussions).
   
   - [X] I have searched the [GitHub 
Issues](https://github.com/apache/rocketmq/issues) and [GitHub 
Discussions](https://github.com/apache/rocketmq/discussions)  of this 
repository and believe that this is not a duplicate.
   
   - [X] I have confirmed that this bug belongs to the current repository, not 
other repositories of RocketMQ.
   
   
   ### Runtime platform environment
   
   OS: CentOS 7.3
   
   ### RocketMQ version
   
   4.8.x
   
   ### JDK Version
   
   openjdk version "1.8.0_202"
   
   ### Describe the Bug
   
   - A downtime occurred on the slave node
   - When the Broker process is started, the following exception log appears 
during building consumeQueue index. 
   ```log
   [BUG]logic queue order maybe wrong, expectLogicOffset: 5762122360 
currentLogicOffset: 5762121440 Topic: hlth-center-data QID: 1 Diff: 920
   ```
   - Delete all consumeQueues and back to normal.
   
   ### Steps to Reproduce
   
   - stop the slave node.
   - delete the last consumeQueue of a queue.
   - start the slave node.
   
   ### What Did You Expect to See?
   
   This log should not be printed.
   
   ### What Did You See Instead?
   
   ```log
   [BUG]logic queue order maybe wrong, expectLogicOffset: 5762122360 
currentLogicOffset: 5762121440 Topic: hlth-center-data QID: 1 Diff: 920
   ```
   
   ### Additional Context
   
   ### # Possible reason
   - A power outage on the host causes the loss of unpersisted consumeQueue 
records in the pagecache.
   
   ###  # Process
   - ConsumeQueue files are flushed asynchronously, executed once every 1 second
   
![image](https://github.com/apache/rocketmq/assets/46882838/7320145a-e287-4642-ba7d-84b763194699)
   - The index is built and stored in the memory, and the disk is flushed once 
every second, so there will be some index data (the white blocks) is not 
written to the disk.
   ![cq-destroy 
drawio](https://github.com/apache/rocketmq/assets/46882838/d1cdd010-eb85-4cd4-9f04-a9bc399f14ab)
   -  A power outage on the host causes the loss of unpersisted consumeQueue 
records in the pagecache.
   ![cq-destroy drawio 
(1)](https://github.com/apache/rocketmq/assets/46882838/18b1f92e-8821-4d72-a5f6-e10360383b5a)
   - After the Broker is started, the current maximum reputOffset is 1004. The 
index building starts from 1005, so 1002 and 1003 are skipped.
   ![cq-destroy drawio 
(2)](https://github.com/apache/rocketmq/assets/46882838/f2918a97-d067-4630-a88b-a4fe04bf55d8)
   
   ### # Verify
   - Read the wrong consumeQueue file, 
   ```
   last=287986464, new=287986511, 
file=/home/mi/bak_cq/hlth-center-data/1/00000000005754000000
   MessageExt [brokerName=null, queueId=1, storeSize=1914, 
queueOffset=287986464, sysFlag=0, bornTimestamp=1694089617516}]
   MessageExt [brokerName=null, queueId=1, storeSize=1218, 
queueOffset=287986511, sysFlag=0, bornTimestamp=1694089620422}]
   ```
   - There is a gap. Theoretically, the QueueOffset of the next message should 
be 287986465, but in fact it is 287986511, a difference of 46. The converted 
index file length is exactly 920 bytes, which is consistent with the log.
   
   ### # Hot to fix
   #### 1. Memory cache
   - Add a cache to maintains the flushed offset of each queeu: map 
<Topic#QueueId, commitLogOffsetPhy>
   - When exiting abnormally, enter the CQ recover logic and start scanning the 
commitLog for recovery based on the smallest persistent location recorded in 
the map.
   
   #### 2. Scan partly commitLogs
   - Since cq will be flushed periodically, the most recent commitLog files 
will be affected.
   - Find the corresponding CommitLog A according to the error log. Then push 
forward a certain amount of time (3 minutes), find the first CommitLog that 
exceeds 3 minutes, and then start building from this commitLog
   
   ### # Linked issue
   #1397
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [rocketmq] cserwen opened a new issue, #7355: [Bug] ConsumeQueue building exception caused by host downtime

Reply via email to