devmadhuu opened a new pull request, #7960:
URL: https://github.com/apache/ozone/pull/7960

   ## What changes were proposed in this pull request?
   This PR change is to improve error handling of OM background tasks 
processing in case of abrupt crash of Recon.
   
   If Recon has applied incremental DB updates and just before consuming those 
events, if Recon crashed due to some unexpected error or CU restarted the Recon 
during that time, then on restart of Recon again, recon will not try to consume 
those events again and due to this edge case, OM DB updates will be missed, So 
this PR is implementing following solution to fix this gap:
   
   On restart, check if incremental DB update task lastSequence number not 
matching with lastUpdatedSeq number of underlying task, then just run reprocess 
for such tasks.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-12377
   
   ## How was this patch tested?
   This PR change is tested manually in local docker cluster by abruptly 
stopping recon. Here are the steps being followed:
   
   **Created 24 keys in OM and wait for Recon OM Sync:**
   
   ```
   dd if=/dev/zero of=testfile bs=1024 count=10240
   dd if=/dev/zero of=testfile1 bs=2048 count=20480
   
   ozone fs -mkdir -p ofs://om/volume1/fso-bucket/dir1/dir2/dir3
   ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1
   ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/file1
   ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/dir2
   ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/dir2/file1
   ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/dir2/dir3
   ozone fs -put -f testfile ofs://om/volume1/fso-bucket/dir1/dir2/dir3/file1
   
   
   ozone sh bucket create --layout=OBJECT_STORE /volume1/obs-bucket
   
   ozone sh key put /volume1/obs-bucket/key1 testfile
   ozone sh key put /volume1/obs-bucket/key1/key2 testfile
   ozone sh key put /volume1/obs-bucket/key1/key2/key3 testfile
   ozone sh key put /volume1/obs-bucket/key4 testfile
   ozone sh key put /volume1/obs-bucket/key5 testfile
   ozone sh key put /volume1/obs-bucket/key6 testfile
   
   ozone sh key put -r=rs-3-2-1024k -t=EC /volume1/obs-bucket/key7 testfile
   
   ozone sh bucket create --layout=LEGACY /volume1/legacy-bucket
   
   ozone sh key put /volume1/legacy-bucket/key1 testfile
   ozone sh key put /volume1/legacy-bucket/key1/key2 testfile1
   ozone sh key put /volume1/legacy-bucket/key1/key2/key3 testfile
   ozone sh key put /volume1/legacy-bucket/key4 testfile1
   ozone sh key put /volume1/legacy-bucket/key5 testfile
   ozone sh key put /volume1/legacy-bucket/key6 testfile1
   
   ozone sh key put /volume1/obs-bucket/key7 testfile1
   ozone sh key put /volume1/obs-bucket/key8/key9 testfile1
   ozone sh key put /volume1/obs-bucket/key10/key11/key12 testfile1
   ozone sh key put /volume1/obs-bucket/key13 testfile1
   ozone sh key put /volume1/obs-bucket/key14 testfile1
   ozone sh key put /volume1/obs-bucket/key15 testfile1
   ```
   <img width="1710" alt="Pasted Graphic" 
src="https://github.com/user-attachments/assets/758f4776-70e5-4c9e-a938-ac14507e3fd7";
 />
   
   **Created below 3 extra keys:**
   
   ```
   ozone sh key put /volume1/obs-bucket/demo1 testfile1
   ozone sh key put /volume1/obs-bucket/demo2 testfile1
   ozone sh key put /volume1/obs-bucket/demo3 testfile1
   ```
   
   **Crash recon after DB updates got applied on recon OM DB snapshot, but 
before 3 delta updates got processed by tasks, so 
   below log shows sequence number lag as zero, but still count of keys shown 
is 24 and even in next sync cycle, 
   lost events will not be processed:**
   
   ```
   2025-02-24 17:38:23 2025-02-24 12:08:23,627 [Recon-SyncOM-0] INFO 
impl.OzoneManagerServiceProviderImpl: Last known sequence number before sync: 
319
   2025-02-24 17:38:23 2025-02-24 12:08:23,629 [Recon-SyncOM-0] INFO 
impl.OzoneManagerServiceProviderImpl: Seq number of Recon's OM DB : 319
   2025-02-24 17:38:23 2025-02-24 12:08:23,812 [Recon-SyncOM-0] INFO 
impl.OzoneManagerServiceProviderImpl: 
   From Sequence Number:319, Recon DB Sequence Number: 319, Number of updates 
received from OM : 0, SequenceNumber diff: 0, 
   SequenceNumber Lag from OM 0, isDBUpdateSuccess: true
   ```
   
   **Now after the fix, whenever after crash or abrupt restart of Recon, Recon 
will recover on its own and process those lost events:**
   
   ```
   2025-02-24 17:37:23 2025-02-24 12:07:23,627 [main] INFO 
impl.OzoneManagerServiceProviderImpl: Task details of such tasks whose 
lastUpdatedSeqNumber number not matching with lastUpdatedSeqNumber of 
'OmDeltaRequest' task::
   2025-02-24 17:37:23 
   2025-02-24 17:37:23 2025-02-24 12:07:23,627 [main] INFO 
impl.OzoneManagerServiceProviderImpl: OmDeltaRequest->319
   2025-02-24 17:37:23 2025-02-24 12:07:23,627 [main] INFO 
impl.OzoneManagerServiceProviderImpl: OmTableInsightTask->301
   2025-02-24 17:37:23 2025-02-24 12:07:23,627 [main] INFO 
impl.OzoneManagerServiceProviderImpl: NSSummaryTask->301
   2025-02-24 17:37:23 2025-02-24 12:07:23,628 [main] INFO 
impl.OzoneManagerServiceProviderImpl: ContainerKeyMapperTask->301
   2025-02-24 17:37:23 2025-02-24 12:07:23,628 [main] INFO 
impl.OzoneManagerServiceProviderImpl: FileSizeCountTaskOBS->301
   2025-02-24 17:37:23 2025-02-24 12:07:23,628 [main] INFO 
impl.OzoneManagerServiceProviderImpl: FileSizeCountTaskFSO->301
   2025-02-24 17:49:06 2025-02-24 12:19:06,685 [ReconTaskThread-0] INFO 
impl.ReconContainerMetadataManagerImpl: KEY_CONTAINER Table is empty, 
initializing from CONTAINER_KEY Table ...
   2025-02-24 17:49:06 2025-02-24 12:19:06,686 [ReconTaskThread-0] INFO 
impl.ReconContainerMetadataManagerImpl: It took 0.0 seconds to initialized 0 
records to KEY_CONTAINER table
   2025-02-24 17:49:06 2025-02-24 12:19:06,708 [ReconTaskThread-0] INFO 
tasks.FileSizeCountTaskHelper: Starting Reprocess for FileSizeCountTaskOBS
   2025-02-24 17:49:06 2025-02-24 12:19:06,717 [ReconTaskThread-0] INFO 
tasks.FileSizeCountTaskHelper: Deleted 5 records from "FILE_COUNT_BY_SIZE"
   2025-02-24 17:49:06 2025-02-24 12:19:06,719 [ReconTaskThread-0] INFO 
tasks.FileSizeCountTaskHelper: Reprocessed 21 keys for bucket layout 
OBJECT_STORE.
   2025-02-24 17:49:06 2025-02-24 12:19:06,740 [ReconTaskThread-0] INFO 
tasks.FileSizeCountTaskHelper: FileSizeCountTaskOBS completed Reprocess in 31 
ms.
   2025-02-24 17:49:06 2025-02-24 12:19:06,742 [ReconTaskThread-0] INFO 
tasks.FileSizeCountTaskHelper: Starting Reprocess for FileSizeCountTaskFSO
   2025-02-24 17:49:06 2025-02-24 12:19:06,742 [ReconTaskThread-0] INFO 
tasks.FileSizeCountTaskHelper: Table already truncated by another task; waiting 
for truncation to complete.
   2025-02-24 17:49:06 2025-02-24 12:19:06,743 [ReconTaskThread-0] INFO 
tasks.FileSizeCountTaskHelper: Reprocessed 6 keys for bucket layout 
FILE_SYSTEM_OPTIMIZED.
   2025-02-24 17:49:06 2025-02-24 12:19:06,749 [ReconTaskThread-0] INFO 
tasks.FileSizeCountTaskHelper: FileSizeCountTaskFSO completed Reprocess in 7 ms.
   ```
   
   <img width="1728" alt="Pasted Graphic 1" 
src="https://github.com/user-attachments/assets/e946597d-6722-40f1-b37e-a87584337d7b";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to