[ 
https://issues.apache.org/jira/browse/HDDS-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen updated HDDS-10680:
------------------------------
    Description: 
In one user's SCM audit log file, there are two DELETE_KEY_BLOCK records with 
almost the same blocks info, here are the top a few lines of each,

 
{code:java}
024-04-11 08:42:37,811 | INFO  | SCMAudit | 
user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK 
{KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
 blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
 blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
 blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
 blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
 blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
 blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx'/,
 blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
 blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
 blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
 blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/33b7b03beade4ba28a565385987294ad.xlsx',
 blockIDs=[conID: 163887 locID: 111677748060148812 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/34131229d96d451788d012a23245d9fe.doc',
 blockIDs=[conID: 163923 locID: 111677748060046291 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
 blockIDs=[conID: 164102 locID: 111677748060156005 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
 blockIDs=[conID: 164091 locID: 111677748060156058 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/


2024-04-11 08:43:35,862 | INFO  | SCMAudit | 
user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK 
{KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
 blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
 blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
 blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
 blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
 blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
 blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
 blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
 blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
 blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/ {code}

Here is the OM logs(not the same time window because no OM logs available for 
that period,  but the logic is the same)
{code:java}
2024-04-10 14:31:03,015 WARN org.apache.hadoop.hdds.utils.BackgroundService: 
Background task execution failed
java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key 
Name.
        at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:45)
        at 
org.apache.hadoop.ozone.om.service.KeyDeletingService.addToMap(KeyDeletingService.java:316)
        at 
org.apache.hadoop.ozone.om.service.KeyDeletingService.access$8(KeyDeletingService.java:312)
        at 
org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.submitPurgeKeysRequest(KeyDeletingService.java:251)
        at 
org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:180)
        at 
org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
        at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2024-04-10 14:31:50,170 WARN org.apache.hadoop.ipc.Server: Large response size 
1051235 for call Call#118019916 Retry#0 
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
10.59.2.1:37063
2024-04-10 14:31:50,272 WARN org.apache.hadoop.ipc.Server: Large response size 
1051989 for call Call#118019932 Retry#0 
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
10.59.2.1:37063
2024-04-10 14:32:05,664 INFO 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolling segment 
log-204433475_204439100 to index:204439100
2024-04-10 14:32:05,665 INFO 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolled log segment from 
/metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204433475
 to 
/metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_204433475-204439100
2024-04-10 14:32:05,673 INFO 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
om2@group-9F198C4C3682-SegmentedRaftLogWorker: created new log segment 
/metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204439101
2024-04-10 14:32:05,706 INFO 
org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 
20015, elapsed time: 62690ms
2024-04-10 14:32:08,939 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for fdt.dfs@***.COM (auth:KERBEROS)
2024-04-10 14:32:08,946 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for fdt.dfs@***.COM (auth:KERBEROS) for 
protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
2024-04-10 14:32:11,483 INFO 
org.apache.hadoop.ozone.om.request.key.OMKeyRequest: Detect allocated but 
uncommitted blocks [{blockID={containerID=164049, localID=111677748060155900}, 
length=268435456, offset=0, token=null, pipeline=null, createVersion=0, 
partNumber=0}] in key c61d3394d32145e4a9f23c6a3046cfc4.docx.
....
2024-04-10 14:33:08,236 INFO 
org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 
20019, elapsed time: 62530ms
{code}

After a further investigation, I believe the key which caused failure 
"java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key 
Name." is "/s3v/gb-data/", the following info from SCM audit logs, 
{code:java}
 BlockGroup[groupID='/s3v/gb-data/', blockIDs=[conID: 80813 locID: 
111677748040965241 bcsId: 0]]
{code}
It's a bucket. I'm not sure why a bucket is in the OM deletedTable, as I tested 
with "ozone fs -rm -r -f -skipTrash" with FSO, LEGACY and OBS, non of them will 
result in a bucket in deletedTable. So I guess it could be caused by history 
bugs in the past. 
Anyway, this "java.lang.IllegalStateException" will abort the current round of 
KeyDeletingTask#call, so the PurgeKeysRequest will not be sent to RATIS to 
delete the keys in deletedTable, so keys will be processed again in next 
KeyDeletingService iterator, and those blocks of those keys will be sent to SCM 
again and again.  In a SCM log file, 256MB, holding all logs in 15s, there is 
27 repeated log for one random picked block ID, similar to this. 

{code:java}
2024-03-12 18:28:55,727 WARN 
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: Container: #34680 was 
deleted for the transaction: txID: 481479568
containerID: 34680
localID: 111677748037003018
localID: 111677748037003295
localID: 111677748037003639
localID: 111677748037003811
localID: 111677748037003880
localID: 111677748037004088
localID: 111677748037004282
localID: 111677748037004521
localID: 111677748037004658
localID: 111677748037004818
localID: 111677748037005071
localID: 111677748037005223
localID: 111677748037005365
localID: 111677748037005641
localID: 111677748037005752
localID: 111677748037005934
localID: 111677748037002486
localID: 111677748037002740
count: 0
{code}

The consequence is duplicates key block delete requests send to SCM again and 
again. And SCM is busy to create new delete transactions for it, and process 
those transactions.  The user's leader SCM terminated itself frequently during 
this period, caused by long GC exceeds 60s. 


  was:
In one user's SCM audit log file, there are two DELETE_KEY_BLOCK records with 
almost the same blocks info, here are the top a few lines of each,

 
{code:java}
024-04-11 08:42:37,811 | INFO  | SCMAudit | 
user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK 
{KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
 blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
 blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
 blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
 blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
 blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
 blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx'/,
 blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
 blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
 blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
 blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/33b7b03beade4ba28a565385987294ad.xlsx',
 blockIDs=[conID: 163887 locID: 111677748060148812 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/34131229d96d451788d012a23245d9fe.doc',
 blockIDs=[conID: 163923 locID: 111677748060046291 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
 blockIDs=[conID: 164102 locID: 111677748060156005 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
 blockIDs=[conID: 164091 locID: 111677748060156058 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/


2024-04-11 08:43:35,862 | INFO  | SCMAudit | 
user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK 
{KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
 blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
 blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
 blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
 blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
 blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
 blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
 blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
 blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
 blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
 blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], 
BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/ {code}

Here is the OM logs(not the same time window because no OM logs available for 
that period,  but the logic is the same)
{code:java}
2024-04-10 14:31:03,015 WARN org.apache.hadoop.hdds.utils.BackgroundService: 
Background task execution failed
java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key 
Name.
        at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:45)
        at 
org.apache.hadoop.ozone.om.service.KeyDeletingService.addToMap(KeyDeletingService.java:316)
        at 
org.apache.hadoop.ozone.om.service.KeyDeletingService.access$8(KeyDeletingService.java:312)
        at 
org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.submitPurgeKeysRequest(KeyDeletingService.java:251)
        at 
org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:180)
        at 
org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
        at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2024-04-10 14:31:50,170 WARN org.apache.hadoop.ipc.Server: Large response size 
1051235 for call Call#118019916 Retry#0 
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
10.59.2.1:37063
2024-04-10 14:31:50,272 WARN org.apache.hadoop.ipc.Server: Large response size 
1051989 for call Call#118019932 Retry#0 
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
10.59.2.1:37063
2024-04-10 14:32:05,664 INFO 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolling segment 
log-204433475_204439100 to index:204439100
2024-04-10 14:32:05,665 INFO 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolled log segment from 
/metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204433475
 to 
/metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_204433475-204439100
2024-04-10 14:32:05,673 INFO 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
om2@group-9F198C4C3682-SegmentedRaftLogWorker: created new log segment 
/metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204439101
2024-04-10 14:32:05,706 INFO 
org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 
20015, elapsed time: 62690ms
2024-04-10 14:32:08,939 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for fdt.dfs@***.COM (auth:KERBEROS)
2024-04-10 14:32:08,946 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for fdt.dfs@***.COM (auth:KERBEROS) for 
protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
2024-04-10 14:32:11,483 INFO 
org.apache.hadoop.ozone.om.request.key.OMKeyRequest: Detect allocated but 
uncommitted blocks [{blockID={containerID=164049, localID=111677748060155900}, 
length=268435456, offset=0, token=null, pipeline=null, createVersion=0, 
partNumber=0}] in key c61d3394d32145e4a9f23c6a3046cfc4.docx.
....
2024-04-10 14:33:08,236 INFO 
org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 
20019, elapsed time: 62530ms
{code}

After a further investigation, I believe the key which caused failure 
"java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key 
Name." is "/s3v/gb-data/", the following info from SCM audit logs, 
{code:java}
 BlockGroup[groupID='/s3v/gb-data/', blockIDs=[conID: 80813 locID: 
111677748040965241 bcsId: 0]]
{code}
It's a bucket. I'm not sure why a bucket is in the OM deletedTable, as I tested 
with "ozone fs -rm -r -f -skipTrash" with FSO, LEGACY and OBS, non of them will 
result in a bucket in deletedTable. So I guess it could be caused by history 
bugs in the past. 
Anyway, this "java.lang.IllegalStateException" will abort the current round of 
KeyDeletingTask#call, so the PurgeKeysRequest will not be sent to RATIS to 
delete the keys in deletedTable, so keys will be processed again in next 
KeyDeletingService iterator, and those blocks of those keys will be sent to SCM 
again and again.  In a SCM log file, 256MB, holding all logs in 15s, there is 
27 repeated log for one block ID, similar to this. 

{code:java}
2024-03-12 18:28:55,727 WARN 
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: Container: #34680 was 
deleted for the transaction: txID: 481479568
containerID: 34680
localID: 111677748037003018
localID: 111677748037003295
localID: 111677748037003639
localID: 111677748037003811
localID: 111677748037003880
localID: 111677748037004088
localID: 111677748037004282
localID: 111677748037004521
localID: 111677748037004658
localID: 111677748037004818
localID: 111677748037005071
localID: 111677748037005223
localID: 111677748037005365
localID: 111677748037005641
localID: 111677748037005752
localID: 111677748037005934
localID: 111677748037002486
localID: 111677748037002740
count: 0
{code}

The consequence is duplicates key block delete requests send to SCM again and 
again. And SCM is busy to create new delete transactions for it, and process 
those transactions.  The user's leader SCM terminated itself frequently during 
this period, caused by long GC exceeds 60s. 



> Duplicate delete key blocks sent to SCM
> ---------------------------------------
>
>                 Key: HDDS-10680
>                 URL: https://issues.apache.org/jira/browse/HDDS-10680
>             Project: Apache Ozone
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Critical
>
> In one user's SCM audit log file, there are two DELETE_KEY_BLOCK records with 
> almost the same blocks info, here are the top a few lines of each,
>  
> {code:java}
> 024-04-11 08:42:37,811 | INFO  | SCMAudit | 
> user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK 
> {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
>  blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
>  blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
>  blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
>  blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
>  blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
>  blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx'/,
>  blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
>  blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
>  blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
>  blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/33b7b03beade4ba28a565385987294ad.xlsx',
>  blockIDs=[conID: 163887 locID: 111677748060148812 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/34131229d96d451788d012a23245d9fe.doc',
>  blockIDs=[conID: 163923 locID: 111677748060046291 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
>  blockIDs=[conID: 164102 locID: 111677748060156005 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
>  blockIDs=[conID: 164091 locID: 111677748060156058 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/
> 2024-04-11 08:43:35,862 | INFO  | SCMAudit | 
> user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK 
> {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
>  blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
>  blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
>  blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
>  blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
>  blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
>  blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
>  blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
>  blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
>  blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
>  blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], 
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/ {code}
> Here is the OM logs(not the same time window because no OM logs available for 
> that period,  but the logic is the same)
> {code:java}
> 2024-04-10 14:31:03,015 WARN org.apache.hadoop.hdds.utils.BackgroundService: 
> Background task execution failed
> java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key 
> Name.
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:45)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService.addToMap(KeyDeletingService.java:316)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService.access$8(KeyDeletingService.java:312)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.submitPurgeKeysRequest(KeyDeletingService.java:251)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:180)
>         at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2024-04-10 14:31:50,170 WARN org.apache.hadoop.ipc.Server: Large response 
> size 1051235 for call Call#118019916 Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 10.59.2.1:37063
> 2024-04-10 14:31:50,272 WARN org.apache.hadoop.ipc.Server: Large response 
> size 1051989 for call Call#118019932 Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 10.59.2.1:37063
> 2024-04-10 14:32:05,664 INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolling segment 
> log-204433475_204439100 to index:204439100
> 2024-04-10 14:32:05,665 INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolled log segment from 
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204433475
>  to 
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_204433475-204439100
> 2024-04-10 14:32:05,673 INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: created new log segment 
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204439101
> 2024-04-10 14:32:05,706 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20015, elapsed time: 62690ms
> 2024-04-10 14:32:08,939 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for fdt.dfs@***.COM (auth:KERBEROS)
> 2024-04-10 14:32:08,946 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for fdt.dfs@***.COM (auth:KERBEROS) for 
> protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
> 2024-04-10 14:32:11,483 INFO 
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest: Detect allocated but 
> uncommitted blocks [{blockID={containerID=164049, 
> localID=111677748060155900}, length=268435456, offset=0, token=null, 
> pipeline=null, createVersion=0, partNumber=0}] in key 
> c61d3394d32145e4a9f23c6a3046cfc4.docx.
> ....
> 2024-04-10 14:33:08,236 INFO 
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys 
> deleted: 20019, elapsed time: 62530ms
> {code}
> After a further investigation, I believe the key which caused failure 
> "java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key 
> Name." is "/s3v/gb-data/", the following info from SCM audit logs, 
> {code:java}
>  BlockGroup[groupID='/s3v/gb-data/', blockIDs=[conID: 80813 locID: 
> 111677748040965241 bcsId: 0]]
> {code}
> It's a bucket. I'm not sure why a bucket is in the OM deletedTable, as I 
> tested with "ozone fs -rm -r -f -skipTrash" with FSO, LEGACY and OBS, non of 
> them will result in a bucket in deletedTable. So I guess it could be caused 
> by history bugs in the past. 
> Anyway, this "java.lang.IllegalStateException" will abort the current round 
> of KeyDeletingTask#call, so the PurgeKeysRequest will not be sent to RATIS to 
> delete the keys in deletedTable, so keys will be processed again in next 
> KeyDeletingService iterator, and those blocks of those keys will be sent to 
> SCM again and again.  In a SCM log file, 256MB, holding all logs in 15s, 
> there is 27 repeated log for one random picked block ID, similar to this. 
> {code:java}
> 2024-03-12 18:28:55,727 WARN 
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: Container: #34680 was 
> deleted for the transaction: txID: 481479568
> containerID: 34680
> localID: 111677748037003018
> localID: 111677748037003295
> localID: 111677748037003639
> localID: 111677748037003811
> localID: 111677748037003880
> localID: 111677748037004088
> localID: 111677748037004282
> localID: 111677748037004521
> localID: 111677748037004658
> localID: 111677748037004818
> localID: 111677748037005071
> localID: 111677748037005223
> localID: 111677748037005365
> localID: 111677748037005641
> localID: 111677748037005752
> localID: 111677748037005934
> localID: 111677748037002486
> localID: 111677748037002740
> count: 0
> {code}
> The consequence is duplicates key block delete requests send to SCM again and 
> again. And SCM is busy to create new delete transactions for it, and process 
> those transactions.  The user's leader SCM terminated itself frequently 
> during this period, caused by long GC exceeds 60s. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


Reply via email to