[
https://issues.apache.org/jira/browse/HDDS-11121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shilun Fan updated HDDS-11121:
------------------------------
Summary: DeletedBlockLogImpl#onMessage Inter-process communication UUID
inconsistency. (was: Improve SCM deletion efficiency.)
> DeletedBlockLogImpl#onMessage Inter-process communication UUID inconsistency.
> -----------------------------------------------------------------------------
>
> Key: HDDS-11121
> URL: https://issues.apache.org/jira/browse/HDDS-11121
> Project: Apache Ozone
> Issue Type: Improvement
> Components: SCM
> Reporter: Shilun Fan
> Assignee: Shilun Fan
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2024-07-12-09-37-23-618.png, screenshot-1.png
>
>
> Our Ozone cluster has recently encountered some issues with data deletion. We
> found that the SCM was unable to automatically clean up the data in the
> deletion queue, preventing the completion of the entire deletion process.
> During our problem analysis, we discovered an issue with
> {{{}DeletedBlockLogImpl#onMessage{}}}. The UUID transmitted from the DN via
> RPC was not recognized by the SCM, resulting in an "Unknown Datanode"
> exception. We attempted to fix this issue and made some progress.
> {code:java}
> 024-07-08 12:08:19,606
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
> 1720041450931 report status PENDING
> 2024-07-08 12:08:19,606
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
> 1719241427194 report status PENDING
> 2024-07-08 12:08:19,606
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
> 1720041450931 report status PENDING
> 2024-07-08 12:08:19,606
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
> 1719241427194 report status PENDING
> 2024-07-08 12:08:19,617
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: efadefd7-4d25-42fd-a6ef-fabd64c97d7f Scm Command ID:
> 1720041450023 report status PENDING
> 2024-07-08 12:08:19,664
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID:
> 1720106401909 report status PENDING
> 2024-07-08 12:08:19,664
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID:
> 1719241427294 report status PENDING {code}
> {code:java}
> 2024-07-12 08:35:37,032
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> 888a550f-c59c-4dde-ba3e-3dcf8f9593e0, localDnId =
> 888a550f-c59c-4dde-ba3e-3dcf8f9593e0, remoteDnId == localDnId[false]
> 2024-07-12 08:35:37,032
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> c7919796-18fa-4f00-af94-9b7ebc21a572, localDnId =
> c7919796-18fa-4f00-af94-9b7ebc21a572, remoteDnId == localDnId[false]
> 2024-07-12 08:35:37,032
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> 596cd6c8-ecc7-48da-8039-75fe59d65846, localDnId =
> 596cd6c8-ecc7-48da-8039-75fe59d65846, remoteDnId == localDnId[false]
> 2024-07-12 08:35:37,033
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> de559349-fd76-4a5a-9acb-007432ba1876, localDnId =
> de559349-fd76-4a5a-9acb-007432ba1876, remoteDnId == localDnId[false]
> 2024-07-12 08:35:37,033
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> 6a750295-7e7c-4786-b28c-f78509c41a02, localDnId =
> 6a750295-7e7c-4786-b28c-f78509c41a02, remoteDnId == localDnId[false] {code}
> On July 8th, we applied this PR in the production environment. Currently, SCM
> deletion can proceed normally, as shown in the Grafana screenshot below.
> !image-2024-07-12-09-37-23-618.png!
> !screenshot-1.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]