[
https://issues.apache.org/jira/browse/HDDS-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-10680:
----------------------------------
Labels: pull-request-available (was: )
> Duplicate delete key blocks sent to SCM
> ---------------------------------------
>
> Key: HDDS-10680
> URL: https://issues.apache.org/jira/browse/HDDS-10680
> Project: Apache Ozone
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Sammi Chen
> Assignee: Sammi Chen
> Priority: Critical
> Labels: pull-request-available
>
> In one user's SCM audit log file, there are two DELETE_KEY_BLOCK records with
> almost the same blocks info, here are the top a few lines of each,
>
> {code:java}
> 024-04-11 08:42:37,811 | INFO | SCMAudit |
> user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK
> {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
> blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
> blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
> blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
> blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
> blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
> blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx'/,
> blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
> blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
> blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
> blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/33b7b03beade4ba28a565385987294ad.xlsx',
> blockIDs=[conID: 163887 locID: 111677748060148812 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/34131229d96d451788d012a23245d9fe.doc',
> blockIDs=[conID: 163923 locID: 111677748060046291 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
> blockIDs=[conID: 164102 locID: 111677748060156005 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
> blockIDs=[conID: 164091 locID: 111677748060156058 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/
> 2024-04-11 08:43:35,862 | INFO | SCMAudit |
> user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK
> {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
> blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
> blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
> blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
> blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
> blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
> blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
> blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
> blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
> blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/ {code}
> Here is the OM logs(not the same time window because no OM logs available for
> that period, but the logic is the same)
> {code:java}
> 2024-04-10 14:31:03,015 WARN org.apache.hadoop.hdds.utils.BackgroundService:
> Background task execution failed
> java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key
> Name.
> at
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:45)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService.addToMap(KeyDeletingService.java:316)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService.access$8(KeyDeletingService.java:312)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.submitPurgeKeysRequest(KeyDeletingService.java:251)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:180)
> at
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
> at
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2024-04-10 14:31:50,170 WARN org.apache.hadoop.ipc.Server: Large response
> size 1051235 for call Call#118019916 Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 10.59.2.1:37063
> 2024-04-10 14:31:50,272 WARN org.apache.hadoop.ipc.Server: Large response
> size 1051989 for call Call#118019932 Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 10.59.2.1:37063
> 2024-04-10 14:32:05,664 INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolling segment
> log-204433475_204439100 to index:204439100
> 2024-04-10 14:32:05,665 INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolled log segment from
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204433475
> to
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_204433475-204439100
> 2024-04-10 14:32:05,673 INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: created new log segment
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204439101
> 2024-04-10 14:32:05,706 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20015, elapsed time: 62690ms
> 2024-04-10 14:32:08,939 INFO SecurityLogger.org.apache.hadoop.ipc.Server:
> Auth successful for fdt.dfs@***.COM (auth:KERBEROS)
> 2024-04-10 14:32:08,946 INFO
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
> Authorization successful for fdt.dfs@***.COM (auth:KERBEROS) for
> protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
> 2024-04-10 14:32:11,483 INFO
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest: Detect allocated but
> uncommitted blocks [{blockID={containerID=164049,
> localID=111677748060155900}, length=268435456, offset=0, token=null,
> pipeline=null, createVersion=0, partNumber=0}] in key
> c61d3394d32145e4a9f23c6a3046cfc4.docx.
> ....
> 2024-04-10 14:33:08,236 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20019, elapsed time: 62530ms
> {code}
> After a further investigation, I believe the key which caused failure
> "java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key
> Name." is "/s3v/gb-data/", the following info from SCM audit logs,
> {code:java}
> BlockGroup[groupID='/s3v/gb-data/', blockIDs=[conID: 80813 locID:
> 111677748040965241 bcsId: 0]]
> {code}
> It's a bucket. I'm not sure why a bucket is in the OM deletedTable, as I
> tested with "ozone fs -rm -r -f -skipTrash" with FSO, LEGACY and OBS, non of
> them will result in a bucket in deletedTable. So I guess it could be caused
> by history bugs in the past.
> Anyway, this "java.lang.IllegalStateException" will abort the current round
> of KeyDeletingTask#call, so the PurgeKeysRequest will not be sent to RATIS to
> delete the keys in deletedTable, so keys will be processed again in next
> KeyDeletingService iterator, and those blocks of those keys will be sent to
> SCM again and again. In a SCM log file, 256MB, holding all logs in 15s,
> there is 27 repeated log for one random picked block ID, similar to this.
> {code:java}
> 2024-03-12 18:28:55,727 WARN
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: Container: #34680 was
> deleted for the transaction: txID: 481479568
> containerID: 34680
> localID: 111677748037003018
> localID: 111677748037003295
> localID: 111677748037003639
> localID: 111677748037003811
> localID: 111677748037003880
> localID: 111677748037004088
> localID: 111677748037004282
> localID: 111677748037004521
> localID: 111677748037004658
> localID: 111677748037004818
> localID: 111677748037005071
> localID: 111677748037005223
> localID: 111677748037005365
> localID: 111677748037005641
> localID: 111677748037005752
> localID: 111677748037005934
> localID: 111677748037002486
> localID: 111677748037002740
> count: 0
> {code}
> The consequence is duplicates key block delete requests send to SCM again and
> again. And SCM is busy to create new delete transactions for it, and process
> those transactions. The user's leader SCM terminated itself frequently
> during this period, caused by long GC exceeds 60s.
> One more issue observed in this case is it took almost 60s to finish one
> round of key delete. Currently the key delete task runs every
> "ozone.block.deleting.service.interval" 60s. Currently it's not clear that
> this 60s, whether because SCM is slow or OM ratis is slow. Add more logs to
> tell that.
> {code:java}
> sammi@SAMMICHEN-MB0 master02 % grep "Number of keys deleted" ozone-om.log
> 2024-03-31 16:39:35,197 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 192, elapsed time: 200286ms
> 2024-03-31 16:39:36,256 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 192, elapsed time: 1058ms
> 2024-03-31 22:22:54,103 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 61224ms
> 2024-03-31 22:23:57,816 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 63712ms
> 2024-04-01 22:03:32,881 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 64292ms
> 2024-04-01 22:04:36,903 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 64021ms
> 2024-04-01 22:52:54,213 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 63099ms
> 2024-04-01 22:53:56,579 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 62366ms
> 2024-04-04 10:35:48,089 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20007, elapsed time: 59047ms
> 2024-04-08 11:50:33,453 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20023, elapsed time: 61243ms
> 2024-04-08 11:51:38,612 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20080, elapsed time: 65159ms
> 2024-04-10 14:32:05,706 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20015, elapsed time: 62690ms
> 2024-04-10 14:33:08,236 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20019, elapsed time: 62530ms
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]