[
https://issues.apache.org/jira/browse/HDDS-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024848#comment-18024848
]
Ivan Andika commented on HDDS-13728:
------------------------------------
Currently, the possible root cause is that the there is a contention between
the {{DeletedBlockLogImpl}} between the {{addTransactions}} (which will append
the deleted blocks transaction from OM) and the {{getTransactions}} which is
used by SCM block deleting service to get the transactions to send.
>From the metrics, we see that a lot of transactions are skipped, around 2
>millions. The reason might either be that some containers have replicas that
>are in {{NodeStatus#inServiceHealthy}} (i.e. NodeOperationalState.IN_SERVICE &
>NodeState.HEALTHY) or that the container replicas are in datanodes that have a
>lot of commands (i.e.
>{{SCMBlockDeletingService#getDatanodesWithinCommandLimit}} is false) and
>therefore it fails {{DeletedBlockLogImpl#checkInadequateReplica}} and skipped.
We will backport [HDDS-11712
|https://jira.shopee.io/browse/HDDS-11712?atlOrigin=eyJpIjoiYjM0MTA4MzUyYTYxNDVkY2IwMzVjOGQ3ZWQ3NzMwM2QiLCJwIjoianN3LWdpdGxhYlNNLWludCJ9]which
will try not to retry the same set of delete transactions over and over again,
instead it will start from the last transaction ID. The hope is that the next
set of transactions will have adequate replicas and pass
{{{}DeletedBlockLogImpl#checkInadequateReplica{}}}.
> DECOMMISSIONED IN_SERVICE datanodes might cause deletion to be slow
> -------------------------------------------------------------------
>
> Key: HDDS-13728
> URL: https://issues.apache.org/jira/browse/HDDS-13728
> Project: Apache Ozone
> Issue Type: Bug
> Affects Versions: 1.4.1
> Reporter: Ivan Andika
> Assignee: Ashish Kumar
> Priority: Major
>
> We recently encountered an issue where we decommission four datanodes, but
> did not turned them off immediately. We saw that this causes a large pending
> deletion in both the OM and SCM
> From OM logs, we saw that deleting 80,000 keys takes more than 5 minutes
> where normally it only takes few seconds. As a result, the OM deletedTable
> entries kept increasing (to hundreds of millions)
>
> {code:java}
> KeyDeletingService Background task execution took 303525629747ns >
> 300000000000ns(timeout)
> KeyDeletingService Background task execution took 399958041329ns >
> 300000000000ns(timeout){code}
> From SCM logs, we saw that the a lot of deletion transactions time out (logs
> are truncated for visibility).
> {code:java}
> SCM BlockDeletionCommand
> ScmTxStateMachine{dnId=72e9966d-428d-4d05-ab92-f11dccc14d92,
> scmTxID=1757003087471, deletedBlocksTxIds=[9800591406, ...],
> updateTime=2025-09-30T06:53:29.994Z, status=SENT} for Datanode:
> 72e9966d-428d-4d05-ab92-f11dccc14d92 was removed after 300000ms without
> update {code}
> DeletedBlockTransactionScanner also became very slow
> {code:java}
> Totally added 406081 blocks to be deleted for 88 datanodes / REDACTED
> totalnodes: [REDACTED], task elapsed time: 381931ms {code}
> The SCM deletedBlocksTable also kept increasing.
> We suspect it is due to the DeletedBlockLogImpl lock contention between
> addTransactions and getTransactions. One improvement might be to use RW lock
> for DeletedBlockLogImpl.
> However, when we turn off the DECOMMISSIONED datanodes, the deletion
> performance improved significantly. This is odd since DECOMMISSIONED
> datanodes should not trigger any deletion commands (since the machine is
> going to be decommissioned anyway). This leads me to suspect that there might
> be some issues in our deletion implementation. I suspect that
> SCMDeletedBlockTransactionStatusManager#commitTransactions usage of
> ContainerManager#getContainerReplicas (which returns Set<ContainerReplica>)
> might not be correct since it also includes the DECOMMISSIONED datanodes
> which never receives any deletion commands. This might cause the
> commitTransactions to never remove entries from the deletedBlocksTable. We
> might need to exclude DECOMMISSIONED nodes instead.
>
> Our version is based on 1.4.1 version, so maybe there might be recent
> improvements (e.g. in HDDS-11506) we have not incorporated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]