cserwen opened a new issue, #7355: URL: https://github.com/apache/rocketmq/issues/7355
### Before Creating the Bug Report - [X] I found a bug, not just asking a question, which should be created in [GitHub Discussions](https://github.com/apache/rocketmq/discussions). - [X] I have searched the [GitHub Issues](https://github.com/apache/rocketmq/issues) and [GitHub Discussions](https://github.com/apache/rocketmq/discussions) of this repository and believe that this is not a duplicate. - [X] I have confirmed that this bug belongs to the current repository, not other repositories of RocketMQ. ### Runtime platform environment OS: CentOS 7.3 ### RocketMQ version 4.8.x ### JDK Version openjdk version "1.8.0_202" ### Describe the Bug - A downtime occurred on the slave node - When the Broker process is started, the following exception log appears during building consumeQueue index. ```log [BUG]logic queue order maybe wrong, expectLogicOffset: 5762122360 currentLogicOffset: 5762121440 Topic: hlth-center-data QID: 1 Diff: 920 ``` - Delete all consumeQueues and back to normal. ### Steps to Reproduce - stop the slave node. - delete the last consumeQueue of a queue. - start the slave node. ### What Did You Expect to See? This log should not be printed. ### What Did You See Instead? ```log [BUG]logic queue order maybe wrong, expectLogicOffset: 5762122360 currentLogicOffset: 5762121440 Topic: hlth-center-data QID: 1 Diff: 920 ``` ### Additional Context ### # Possible reason - A power outage on the host causes the loss of unpersisted consumeQueue records in the pagecache. ### # Process - ConsumeQueue files are flushed asynchronously, executed once every 1 second  - The index is built and stored in the memory, and the disk is flushed once every second, so there will be some index data (the white blocks) is not written to the disk.  - A power outage on the host causes the loss of unpersisted consumeQueue records in the pagecache.  - After the Broker is started, the current maximum reputOffset is 1004. The index building starts from 1005, so 1002 and 1003 are skipped.  ### # Verify - Read the wrong consumeQueue file, ``` last=287986464, new=287986511, file=/home/mi/bak_cq/hlth-center-data/1/00000000005754000000 MessageExt [brokerName=null, queueId=1, storeSize=1914, queueOffset=287986464, sysFlag=0, bornTimestamp=1694089617516}] MessageExt [brokerName=null, queueId=1, storeSize=1218, queueOffset=287986511, sysFlag=0, bornTimestamp=1694089620422}] ``` - There is a gap. Theoretically, the QueueOffset of the next message should be 287986465, but in fact it is 287986511, a difference of 46. The converted index file length is exactly 920 bytes, which is consistent with the log. ### # Hot to fix #### 1. Memory cache - Add a cache to maintains the flushed offset of each queeu: map <Topic#QueueId, commitLogOffsetPhy> - When exiting abnormally, enter the CQ recover logic and start scanning the commitLog for recovery based on the smallest persistent location recorded in the map. #### 2. Scan partly commitLogs - Since cq will be flushed periodically, the most recent commitLog files will be affected. - Find the corresponding CommitLog A according to the error log. Then push forward a certain amount of time (3 minutes), find the first CommitLog that exceeds 3 minutes, and then start building from this commitLog ### # Linked issue #1397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
