[ 
https://issues.apache.org/jira/browse/HDDS-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-10680:
----------------------------------
    Labels: pull-request-available  (was: )

> Duplicate delete key blocks sent to SCM
> ---------------------------------------
>
>                 Key: HDDS-10680
>                 URL: https://issues.apache.org/jira/browse/HDDS-10680
>             Project: Apache Ozone
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Critical
>              Labels: pull-request-available
>
> In one user's SCM audit log file, there are two DELETE_KEY_BLOCK records with 
> almost the same blocks info, here are the top a few lines of each,
>  
> {code:java}
> 024-04-11 08:42:37,811 | INFO  | SCMAudit | 
> user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK 
> {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
>  blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
>  blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
>  blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
>  blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
>  blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
>  blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx'/,
>  blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
>  blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
>  blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
>  blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/33b7b03beade4ba28a565385987294ad.xlsx',
>  blockIDs=[conID: 163887 locID: 111677748060148812 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/34131229d96d451788d012a23245d9fe.doc',
>  blockIDs=[conID: 163923 locID: 111677748060046291 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
>  blockIDs=[conID: 164102 locID: 111677748060156005 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
>  blockIDs=[conID: 164091 locID: 111677748060156058 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/
> 2024-04-11 08:43:35,862 | INFO  | SCMAudit | 
> user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK 
> {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
>  blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
>  blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
>  blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
>  blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
>  blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
>  blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
>  blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
>  blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
>  blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/ {code}
> Here is the OM logs(not the same time window because no OM logs available for 
> that period,  but the logic is the same)
> {code:java}
> 2024-04-10 14:31:03,015 WARN org.apache.hadoop.hdds.utils.BackgroundService: 
> Background task execution failed
> java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key 
> Name.
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:45)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService.addToMap(KeyDeletingService.java:316)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService.access$8(KeyDeletingService.java:312)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.submitPurgeKeysRequest(KeyDeletingService.java:251)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:180)
>         at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2024-04-10 14:31:50,170 WARN org.apache.hadoop.ipc.Server: Large response 
> size 1051235 for call Call#118019916 Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 10.59.2.1:37063
> 2024-04-10 14:31:50,272 WARN org.apache.hadoop.ipc.Server: Large response 
> size 1051989 for call Call#118019932 Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 10.59.2.1:37063
> 2024-04-10 14:32:05,664 INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolling segment 
> log-204433475_204439100 to index:204439100
> 2024-04-10 14:32:05,665 INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolled log segment from 
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204433475
>  to 
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_204433475-204439100
> 2024-04-10 14:32:05,673 INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: created new log segment 
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204439101
> 2024-04-10 14:32:05,706 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20015, elapsed time: 62690ms
> 2024-04-10 14:32:08,939 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for fdt.dfs@***.COM (auth:KERBEROS)
> 2024-04-10 14:32:08,946 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for fdt.dfs@***.COM (auth:KERBEROS) for 
> protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
> 2024-04-10 14:32:11,483 INFO 
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest: Detect allocated but 
> uncommitted blocks [{blockID={containerID=164049, 
> localID=111677748060155900}, length=268435456, offset=0, token=null, 
> pipeline=null, createVersion=0, partNumber=0}] in key 
> c61d3394d32145e4a9f23c6a3046cfc4.docx.
> ....
> 2024-04-10 14:33:08,236 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20019, elapsed time: 62530ms
> {code}
> After a further investigation, I believe the key which caused failure 
> "java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key 
> Name." is "/s3v/gb-data/", the following info from SCM audit logs, 
> {code:java}
>  BlockGroup[groupID='/s3v/gb-data/', blockIDs=[conID: 80813 locID: 
> 111677748040965241 bcsId: 0]]
> {code}
> It's a bucket. I'm not sure why a bucket is in the OM deletedTable, as I 
> tested with "ozone fs -rm -r -f -skipTrash" with FSO, LEGACY and OBS, non of 
> them will result in a bucket in deletedTable. So I guess it could be caused 
> by history bugs in the past. 
> Anyway, this "java.lang.IllegalStateException" will abort the current round 
> of KeyDeletingTask#call, so the PurgeKeysRequest will not be sent to RATIS to 
> delete the keys in deletedTable, so keys will be processed again in next 
> KeyDeletingService iterator, and those blocks of those keys will be sent to 
> SCM again and again.  In a SCM log file, 256MB, holding all logs in 15s, 
> there is 27 repeated log for one random picked block ID, similar to this. 
> {code:java}
> 2024-03-12 18:28:55,727 WARN 
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: Container: #34680 was 
> deleted for the transaction: txID: 481479568
> containerID: 34680
> localID: 111677748037003018
> localID: 111677748037003295
> localID: 111677748037003639
> localID: 111677748037003811
> localID: 111677748037003880
> localID: 111677748037004088
> localID: 111677748037004282
> localID: 111677748037004521
> localID: 111677748037004658
> localID: 111677748037004818
> localID: 111677748037005071
> localID: 111677748037005223
> localID: 111677748037005365
> localID: 111677748037005641
> localID: 111677748037005752
> localID: 111677748037005934
> localID: 111677748037002486
> localID: 111677748037002740
> count: 0
> {code}
> The consequence is duplicates key block delete requests send to SCM again and 
> again. And SCM is busy to create new delete transactions for it, and process 
> those transactions.  The user's leader SCM terminated itself frequently 
> during this period, caused by long GC exceeds 60s. 
> One more issue observed in this case is it took almost 60s to finish one 
> round of key delete. Currently the key delete task runs every 
> "ozone.block.deleting.service.interval" 60s. Currently it's not clear that 
> this 60s, whether because SCM is slow or OM ratis is slow. Add more logs to 
> tell that. 
> {code:java}
> sammi@SAMMICHEN-MB0 master02 % grep "Number of keys deleted" ozone-om.log 
> 2024-03-31 16:39:35,197 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 192, elapsed time: 200286ms
> 2024-03-31 16:39:36,256 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 192, elapsed time: 1058ms
> 2024-03-31 22:22:54,103 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20000, elapsed time: 61224ms
> 2024-03-31 22:23:57,816 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20000, elapsed time: 63712ms
> 2024-04-01 22:03:32,881 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20000, elapsed time: 64292ms
> 2024-04-01 22:04:36,903 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20000, elapsed time: 64021ms
> 2024-04-01 22:52:54,213 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20000, elapsed time: 63099ms
> 2024-04-01 22:53:56,579 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20000, elapsed time: 62366ms
> 2024-04-04 10:35:48,089 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20007, elapsed time: 59047ms
> 2024-04-08 11:50:33,453 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20023, elapsed time: 61243ms
> 2024-04-08 11:51:38,612 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20080, elapsed time: 65159ms
> 2024-04-10 14:32:05,706 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20015, elapsed time: 62690ms
> 2024-04-10 14:33:08,236 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20019, elapsed time: 62530ms
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to