[
https://issues.apache.org/jira/browse/HDDS-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838049#comment-17838049
]
Sammi Chen edited comment on HDDS-10680 at 4/17/24 8:56 AM:
------------------------------------------------------------
There is no enough log files to find how this "/s3v/gb-data/" is in the
deletedTable. The user is using an older version Ozone, and upgrade to a new
Ozone version. This patch will handle the "/s3v/gb-data/" gracefully instead of
abort the task. Also It doesn't look like related with Ratis.
was (Author: sammi):
There is no enough log files to find how this "/s3v/gb-data/" is in the
deletedTable. The user is using an older version Ozone, and upgarde The patch
will handle this case gracefully instead of abort the task. And It doesn't
look like related with Ratis.
> Duplicate delete key blocks sent to SCM
> ---------------------------------------
>
> Key: HDDS-10680
> URL: https://issues.apache.org/jira/browse/HDDS-10680
> Project: Apache Ozone
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Sammi Chen
> Assignee: Sammi Chen
> Priority: Critical
> Labels: pull-request-available
>
> In one user's SCM audit log file, there are two DELETE_KEY_BLOCK records with
> almost the same blocks info, here are the top a few lines of each,
>
> {code:java}
> 024-04-11 08:42:37,811 | INFO | SCMAudit |
> user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK
> {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
> blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
> blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
> blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
> blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
> blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
> blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx'/,
> blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
> blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
> blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
> blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/33b7b03beade4ba28a565385987294ad.xlsx',
> blockIDs=[conID: 163887 locID: 111677748060148812 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/34131229d96d451788d012a23245d9fe.doc',
> blockIDs=[conID: 163923 locID: 111677748060046291 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
> blockIDs=[conID: 164102 locID: 111677748060156005 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx',
> blockIDs=[conID: 164091 locID: 111677748060156058 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/
> 2024-04-11 08:43:35,862 | INFO | SCMAudit |
> user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK
> {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256',
> blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx',
> blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
> blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx',
> blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
> blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx',
> blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx',
> blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx',
> blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
> blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx',
> blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]],
> BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/ {code}
> Here is the OM logs(not the same time window because no OM logs available for
> that period, but the logic is the same)
> {code:java}
> 2024-04-10 14:31:03,015 WARN org.apache.hadoop.hdds.utils.BackgroundService:
> Background task execution failed
> java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key
> Name.
> at
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:45)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService.addToMap(KeyDeletingService.java:316)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService.access$8(KeyDeletingService.java:312)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.submitPurgeKeysRequest(KeyDeletingService.java:251)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:180)
> at
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
> at
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2024-04-10 14:31:50,170 WARN org.apache.hadoop.ipc.Server: Large response
> size 1051235 for call Call#118019916 Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 10.59.2.1:37063
> 2024-04-10 14:31:50,272 WARN org.apache.hadoop.ipc.Server: Large response
> size 1051989 for call Call#118019932 Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 10.59.2.1:37063
> 2024-04-10 14:32:05,664 INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolling segment
> log-204433475_204439100 to index:204439100
> 2024-04-10 14:32:05,665 INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolled log segment from
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204433475
> to
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_204433475-204439100
> 2024-04-10 14:32:05,673 INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> om2@group-9F198C4C3682-SegmentedRaftLogWorker: created new log segment
> /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204439101
> 2024-04-10 14:32:05,706 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20015, elapsed time: 62690ms
> 2024-04-10 14:32:08,939 INFO SecurityLogger.org.apache.hadoop.ipc.Server:
> Auth successful for fdt.dfs@***.COM (auth:KERBEROS)
> 2024-04-10 14:32:08,946 INFO
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
> Authorization successful for fdt.dfs@***.COM (auth:KERBEROS) for
> protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
> 2024-04-10 14:32:11,483 INFO
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest: Detect allocated but
> uncommitted blocks [{blockID={containerID=164049,
> localID=111677748060155900}, length=268435456, offset=0, token=null,
> pipeline=null, createVersion=0, partNumber=0}] in key
> c61d3394d32145e4a9f23c6a3046cfc4.docx.
> ....
> 2024-04-10 14:33:08,236 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20019, elapsed time: 62530ms
> {code}
> After a further investigation, I believe the key which caused failure
> "java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key
> Name." is "/s3v/gb-data/", the following info from SCM audit logs,
> {code:java}
> BlockGroup[groupID='/s3v/gb-data/', blockIDs=[conID: 80813 locID:
> 111677748040965241 bcsId: 0]]
> {code}
> It's a bucket. I'm not sure why a bucket is in the OM deletedTable, as I
> tested with "ozone fs -rm -r -f -skipTrash" with FSO, LEGACY and OBS, non of
> them will result in a bucket in deletedTable. So I guess it could be caused
> by history bugs in the past.
> Anyway, this "java.lang.IllegalStateException" will abort the current round
> of KeyDeletingTask#call, so the PurgeKeysRequest will not be sent to RATIS to
> delete the keys in deletedTable, so keys will be processed again in next
> KeyDeletingService iterator, and those blocks of those keys will be sent to
> SCM again and again. In a SCM log file, 256MB, holding all logs in 15s,
> there is 27 repeated log for one random picked block ID, similar to this.
> {code:java}
> 2024-03-12 18:28:55,727 WARN
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: Container: #34680 was
> deleted for the transaction: txID: 481479568
> containerID: 34680
> localID: 111677748037003018
> localID: 111677748037003295
> localID: 111677748037003639
> localID: 111677748037003811
> localID: 111677748037003880
> localID: 111677748037004088
> localID: 111677748037004282
> localID: 111677748037004521
> localID: 111677748037004658
> localID: 111677748037004818
> localID: 111677748037005071
> localID: 111677748037005223
> localID: 111677748037005365
> localID: 111677748037005641
> localID: 111677748037005752
> localID: 111677748037005934
> localID: 111677748037002486
> localID: 111677748037002740
> count: 0
> {code}
> The consequence is duplicates key block delete requests send to SCM again and
> again. And SCM is busy to create new delete transactions for it, and process
> those transactions. The user's leader SCM terminated itself frequently
> during this period, caused by long GC exceeds 60s.
> One more issue observed in this case is it took almost 60s to finish one
> round of key delete. Currently the key delete task runs every
> "ozone.block.deleting.service.interval" 60s. Currently it's not clear that
> this 60s, whether because SCM is slow or OM ratis is slow. Add more logs to
> tell that.
> {code:java}
> sammi@SAMMICHEN-MB0 master02 % grep "Number of keys deleted" ozone-om.log
> 2024-03-31 16:39:35,197 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 192, elapsed time: 200286ms
> 2024-03-31 16:39:36,256 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 192, elapsed time: 1058ms
> 2024-03-31 22:22:54,103 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 61224ms
> 2024-03-31 22:23:57,816 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 63712ms
> 2024-04-01 22:03:32,881 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 64292ms
> 2024-04-01 22:04:36,903 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 64021ms
> 2024-04-01 22:52:54,213 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 63099ms
> 2024-04-01 22:53:56,579 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20000, elapsed time: 62366ms
> 2024-04-04 10:35:48,089 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20007, elapsed time: 59047ms
> 2024-04-08 11:50:33,453 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20023, elapsed time: 61243ms
> 2024-04-08 11:51:38,612 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20080, elapsed time: 65159ms
> 2024-04-10 14:32:05,706 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20015, elapsed time: 62690ms
> 2024-04-10 14:33:08,236 INFO
> org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys
> deleted: 20019, elapsed time: 62530ms
> {code}
> and
> {code:java}
> sammi@SAMMICHEN-MB0 master02 % grep "Background task execution took"
> ozone-om.log
> 2024-03-31 16:26:01,690 WARN org.apache.hadoop.hdds.utils.BackgroundService:
> KeyDeletingService Background task execution took 613583811660ns >
> 300000000000ns(timeout)
> 2024-03-31 16:36:14,911 WARN org.apache.hadoop.hdds.utils.BackgroundService:
> KeyDeletingService Background task execution took 613220813293ns >
> 300000000000ns(timeout)
> 2024-04-11 05:57:22,754 WARN org.apache.hadoop.hdds.utils.BackgroundService:
> KeyDeletingService Background task execution took 3459626398958ns >
> 300000000000ns(timeout)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]