devmadhuu opened a new pull request, #7960: URL: https://github.com/apache/ozone/pull/7960
## What changes were proposed in this pull request? This PR change is to improve error handling of OM background tasks processing in case of abrupt crash of Recon. If Recon has applied incremental DB updates and just before consuming those events, if Recon crashed due to some unexpected error or CU restarted the Recon during that time, then on restart of Recon again, recon will not try to consume those events again and due to this edge case, OM DB updates will be missed, So this PR is implementing following solution to fix this gap: On restart, check if incremental DB update task lastSequence number not matching with lastUpdatedSeq number of underlying task, then just run reprocess for such tasks. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-12377 ## How was this patch tested? This PR change is tested manually in local docker cluster by abruptly stopping recon. Here are the steps being followed: **Created 24 keys in OM and wait for Recon OM Sync:** ``` dd if=/dev/zero of=testfile bs=1024 count=10240 dd if=/dev/zero of=testfile1 bs=2048 count=20480 ozone fs -mkdir -p ofs://om/volume1/fso-bucket/dir1/dir2/dir3 ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1 ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/file1 ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/dir2 ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/dir2/file1 ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/dir2/dir3 ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/dir2/dir3/file1 ozone sh bucket create --layout=OBJECT_STORE /volume1/obs-bucket ozone sh key put /volume1/obs-bucket/key1 testfile ozone sh key put /volume1/obs-bucket/key1/key2 testfile ozone sh key put /volume1/obs-bucket/key1/key2/key3 testfile ozone sh key put /volume1/obs-bucket/key4 testfile ozone sh key put /volume1/obs-bucket/key5 testfile ozone sh key put /volume1/obs-bucket/key6 testfile ozone sh key put -r=rs-3-2-1024k -t=EC /volume1/obs-bucket/key7 testfile ozone sh bucket create --layout=LEGACY /volume1/legacy-bucket ozone sh key put /volume1/legacy-bucket/key1 testfile ozone sh key put /volume1/legacy-bucket/key1/key2 testfile1 ozone sh key put /volume1/legacy-bucket/key1/key2/key3 testfile ozone sh key put /volume1/legacy-bucket/key4 testfile1 ozone sh key put /volume1/legacy-bucket/key5 testfile ozone sh key put /volume1/legacy-bucket/key6 testfile1 ozone sh key put /volume1/obs-bucket/key7 testfile1 ozone sh key put /volume1/obs-bucket/key8/key9 testfile1 ozone sh key put /volume1/obs-bucket/key10/key11/key12 testfile1 ozone sh key put /volume1/obs-bucket/key13 testfile1 ozone sh key put /volume1/obs-bucket/key14 testfile1 ozone sh key put /volume1/obs-bucket/key15 testfile1 ``` <img width="1710" alt="Pasted Graphic" src="https://github.com/user-attachments/assets/758f4776-70e5-4c9e-a938-ac14507e3fd7" /> **Created below 3 extra keys:** ``` ozone sh key put /volume1/obs-bucket/demo1 testfile1 ozone sh key put /volume1/obs-bucket/demo2 testfile1 ozone sh key put /volume1/obs-bucket/demo3 testfile1 ``` **Crash recon after DB updates got applied on recon OM DB snapshot, but before 3 delta updates got processed by tasks, so below log shows sequence number lag as zero, but still count of keys shown is 24 and even in next sync cycle, lost events will not be processed:** ``` 2025-02-24 17:38:23 2025-02-24 12:08:23,627 [Recon-SyncOM-0] INFO impl.OzoneManagerServiceProviderImpl: Last known sequence number before sync: 319 2025-02-24 17:38:23 2025-02-24 12:08:23,629 [Recon-SyncOM-0] INFO impl.OzoneManagerServiceProviderImpl: Seq number of Recon's OM DB : 319 2025-02-24 17:38:23 2025-02-24 12:08:23,812 [Recon-SyncOM-0] INFO impl.OzoneManagerServiceProviderImpl: From Sequence Number:319, Recon DB Sequence Number: 319, Number of updates received from OM : 0, SequenceNumber diff: 0, SequenceNumber Lag from OM 0, isDBUpdateSuccess: true ``` **Now after the fix, whenever after crash or abrupt restart of Recon, Recon will recover on its own and process those lost events:** ``` 2025-02-24 17:37:23 2025-02-24 12:07:23,627 [main] INFO impl.OzoneManagerServiceProviderImpl: Task details of such tasks whose lastUpdatedSeqNumber number not matching with lastUpdatedSeqNumber of 'OmDeltaRequest' task:: 2025-02-24 17:37:23 2025-02-24 17:37:23 2025-02-24 12:07:23,627 [main] INFO impl.OzoneManagerServiceProviderImpl: OmDeltaRequest->319 2025-02-24 17:37:23 2025-02-24 12:07:23,627 [main] INFO impl.OzoneManagerServiceProviderImpl: OmTableInsightTask->301 2025-02-24 17:37:23 2025-02-24 12:07:23,627 [main] INFO impl.OzoneManagerServiceProviderImpl: NSSummaryTask->301 2025-02-24 17:37:23 2025-02-24 12:07:23,628 [main] INFO impl.OzoneManagerServiceProviderImpl: ContainerKeyMapperTask->301 2025-02-24 17:37:23 2025-02-24 12:07:23,628 [main] INFO impl.OzoneManagerServiceProviderImpl: FileSizeCountTaskOBS->301 2025-02-24 17:37:23 2025-02-24 12:07:23,628 [main] INFO impl.OzoneManagerServiceProviderImpl: FileSizeCountTaskFSO->301 2025-02-24 17:49:06 2025-02-24 12:19:06,685 [ReconTaskThread-0] INFO impl.ReconContainerMetadataManagerImpl: KEY_CONTAINER Table is empty, initializing from CONTAINER_KEY Table ... 2025-02-24 17:49:06 2025-02-24 12:19:06,686 [ReconTaskThread-0] INFO impl.ReconContainerMetadataManagerImpl: It took 0.0 seconds to initialized 0 records to KEY_CONTAINER table 2025-02-24 17:49:06 2025-02-24 12:19:06,708 [ReconTaskThread-0] INFO tasks.FileSizeCountTaskHelper: Starting Reprocess for FileSizeCountTaskOBS 2025-02-24 17:49:06 2025-02-24 12:19:06,717 [ReconTaskThread-0] INFO tasks.FileSizeCountTaskHelper: Deleted 5 records from "FILE_COUNT_BY_SIZE" 2025-02-24 17:49:06 2025-02-24 12:19:06,719 [ReconTaskThread-0] INFO tasks.FileSizeCountTaskHelper: Reprocessed 21 keys for bucket layout OBJECT_STORE. 2025-02-24 17:49:06 2025-02-24 12:19:06,740 [ReconTaskThread-0] INFO tasks.FileSizeCountTaskHelper: FileSizeCountTaskOBS completed Reprocess in 31 ms. 2025-02-24 17:49:06 2025-02-24 12:19:06,742 [ReconTaskThread-0] INFO tasks.FileSizeCountTaskHelper: Starting Reprocess for FileSizeCountTaskFSO 2025-02-24 17:49:06 2025-02-24 12:19:06,742 [ReconTaskThread-0] INFO tasks.FileSizeCountTaskHelper: Table already truncated by another task; waiting for truncation to complete. 2025-02-24 17:49:06 2025-02-24 12:19:06,743 [ReconTaskThread-0] INFO tasks.FileSizeCountTaskHelper: Reprocessed 6 keys for bucket layout FILE_SYSTEM_OPTIMIZED. 2025-02-24 17:49:06 2025-02-24 12:19:06,749 [ReconTaskThread-0] INFO tasks.FileSizeCountTaskHelper: FileSizeCountTaskFSO completed Reprocess in 7 ms. ``` <img width="1728" alt="Pasted Graphic 1" src="https://github.com/user-attachments/assets/e946597d-6722-40f1-b37e-a87584337d7b" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
