[
https://issues.apache.org/jira/browse/HDDS-11121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shilun Fan updated HDDS-11121:
------------------------------
Description:
Our Ozone cluster has recently encountered some issues with data deletion. We
found that the SCM was unable to automatically clean up the data in the
deletion queue, preventing the completion of the entire deletion process.
During our problem analysis, we discovered an issue with
{{{}DeletedBlockLogImpl#onMessage{}}}. The UUID transmitted from the DN via RPC
was not recognized by the SCM, resulting in an "Unknown Datanode" exception. We
attempted to fix this issue and made some progress.
{code:java}
024-07-08 12:08:19,606
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
1720041450931 report status PENDING
2024-07-08 12:08:19,606
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
1719241427194 report status PENDING
2024-07-08 12:08:19,606
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
1720041450931 report status PENDING
2024-07-08 12:08:19,606
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
1719241427194 report status PENDING
2024-07-08 12:08:19,617
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: efadefd7-4d25-42fd-a6ef-fabd64c97d7f Scm Command ID:
1720041450023 report status PENDING
2024-07-08 12:08:19,664
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID:
1720106401909 report status PENDING
2024-07-08 12:08:19,664
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID:
1719241427294 report status PENDING {code}
{code:java}
2024-07-12 08:35:37,032
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
888a550f-c59c-4dde-ba3e-3dcf8f9593e0, localDnId =
888a550f-c59c-4dde-ba3e-3dcf8f9593e0, remoteDnId == localDnId[false]
2024-07-12 08:35:37,032
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
c7919796-18fa-4f00-af94-9b7ebc21a572, localDnId =
c7919796-18fa-4f00-af94-9b7ebc21a572, remoteDnId == localDnId[false]
2024-07-12 08:35:37,032
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
596cd6c8-ecc7-48da-8039-75fe59d65846, localDnId =
596cd6c8-ecc7-48da-8039-75fe59d65846, remoteDnId == localDnId[false]
2024-07-12 08:35:37,033
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
de559349-fd76-4a5a-9acb-007432ba1876, localDnId =
de559349-fd76-4a5a-9acb-007432ba1876, remoteDnId == localDnId[false]
2024-07-12 08:35:37,033
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
6a750295-7e7c-4786-b28c-f78509c41a02, localDnId =
6a750295-7e7c-4786-b28c-f78509c41a02, remoteDnId == localDnId[false] {code}
On July 8th, we applied this PR in the production environment. Currently, SCM
deletion can proceed normally, as shown in the Grafana screenshot below.
!image-2024-07-12-09-37-23-618.png!
was:
Our Ozone cluster has recently encountered some issues with data deletion. We
found that the SCM was unable to automatically clean up the data in the
deletion queue, preventing the completion of the entire deletion process.
During our problem analysis, we discovered an issue with
{{{}DeletedBlockLogImpl#onMessage{}}}. The UUID transmitted from the DN via RPC
was not recognized by the SCM, resulting in an "Unknown Datanode" exception. We
attempted to fix this issue and made some progress.
{code:java}
024-07-08 12:08:19,606
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
1720041450931 report status PENDING
2024-07-08 12:08:19,606
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
1719241427194 report status PENDING
2024-07-08 12:08:19,606
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
1720041450931 report status PENDING
2024-07-08 12:08:19,606
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
1719241427194 report status PENDING
2024-07-08 12:08:19,617
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: efadefd7-4d25-42fd-a6ef-fabd64c97d7f Scm Command ID:
1720041450023 report status PENDING
2024-07-08 12:08:19,664
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID:
1720106401909 report status PENDING
2024-07-08 12:08:19,664
[scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID:
1719241427294 report status PENDING {code}
{code:java}
2024-07-12 08:35:37,032
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
888a550f-c59c-4dde-ba3e-3dcf8f9593e0, localDnId =
888a550f-c59c-4dde-ba3e-3dcf8f9593e0, remoteDnId == localDnId[false]
2024-07-12 08:35:37,032
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
c7919796-18fa-4f00-af94-9b7ebc21a572, localDnId =
c7919796-18fa-4f00-af94-9b7ebc21a572, remoteDnId == localDnId[false]
2024-07-12 08:35:37,032
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
596cd6c8-ecc7-48da-8039-75fe59d65846, localDnId =
596cd6c8-ecc7-48da-8039-75fe59d65846, remoteDnId == localDnId[false]
2024-07-12 08:35:37,033
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
de559349-fd76-4a5a-9acb-007432ba1876, localDnId =
de559349-fd76-4a5a-9acb-007432ba1876, remoteDnId == localDnId[false]
2024-07-12 08:35:37,033
[scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
6a750295-7e7c-4786-b28c-f78509c41a02, localDnId =
6a750295-7e7c-4786-b28c-f78509c41a02, remoteDnId == localDnId[false] {code}
!image-2024-07-12-09-37-23-618.png!
> DeletedBlockLogImpl#onMessage Inter-process communication UUID inconsistency
> ----------------------------------------------------------------------------
>
> Key: HDDS-11121
> URL: https://issues.apache.org/jira/browse/HDDS-11121
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Reporter: Shilun Fan
> Assignee: Shilun Fan
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2024-07-12-09-37-23-618.png
>
>
> Our Ozone cluster has recently encountered some issues with data deletion. We
> found that the SCM was unable to automatically clean up the data in the
> deletion queue, preventing the completion of the entire deletion process.
> During our problem analysis, we discovered an issue with
> {{{}DeletedBlockLogImpl#onMessage{}}}. The UUID transmitted from the DN via
> RPC was not recognized by the SCM, resulting in an "Unknown Datanode"
> exception. We attempted to fix this issue and made some progress.
> {code:java}
> 024-07-08 12:08:19,606
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
> 1720041450931 report status PENDING
> 2024-07-08 12:08:19,606
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
> 1719241427194 report status PENDING
> 2024-07-08 12:08:19,606
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
> 1720041450931 report status PENDING
> 2024-07-08 12:08:19,606
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID:
> 1719241427194 report status PENDING
> 2024-07-08 12:08:19,617
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: efadefd7-4d25-42fd-a6ef-fabd64c97d7f Scm Command ID:
> 1720041450023 report status PENDING
> 2024-07-08 12:08:19,664
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID:
> 1720106401909 report status PENDING
> 2024-07-08 12:08:19,664
> [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN
> org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager:
> Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID:
> 1719241427294 report status PENDING {code}
> {code:java}
> 2024-07-12 08:35:37,032
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> 888a550f-c59c-4dde-ba3e-3dcf8f9593e0, localDnId =
> 888a550f-c59c-4dde-ba3e-3dcf8f9593e0, remoteDnId == localDnId[false]
> 2024-07-12 08:35:37,032
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> c7919796-18fa-4f00-af94-9b7ebc21a572, localDnId =
> c7919796-18fa-4f00-af94-9b7ebc21a572, remoteDnId == localDnId[false]
> 2024-07-12 08:35:37,032
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> 596cd6c8-ecc7-48da-8039-75fe59d65846, localDnId =
> 596cd6c8-ecc7-48da-8039-75fe59d65846, remoteDnId == localDnId[false]
> 2024-07-12 08:35:37,033
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> de559349-fd76-4a5a-9acb-007432ba1876, localDnId =
> de559349-fd76-4a5a-9acb-007432ba1876, remoteDnId == localDnId[false]
> 2024-07-12 08:35:37,033
> [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG
> org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId =
> 6a750295-7e7c-4786-b28c-f78509c41a02, localDnId =
> 6a750295-7e7c-4786-b28c-f78509c41a02, remoteDnId == localDnId[false] {code}
> On July 8th, we applied this PR in the production environment. Currently, SCM
> deletion can proceed normally, as shown in the Grafana screenshot below.
> !image-2024-07-12-09-37-23-618.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]