[
https://issues.apache.org/jira/browse/HDDS-11267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870252#comment-17870252
]
Arafat Khan commented on HDDS-11267:
------------------------------------
After investigation, we identified the root causes of the issue where datanodes
report negative container sizes. This problem was particularly noticeable when
attempting to delete containers that had already been marked for deletion. In
these cases, some containers returned negative values for used bytes and block
count metrics.
For example:
{code:java}
sh-4.2$ ozone admin container list | jq '. | {state: .state, containerID:
.containerID, usedBytes: .usedBytes}'
{
"state": "DELETED",
"containerID": 1,
"usedBytes": -100000000
}
{
"state": "DELETED",
"containerID": 2,
"usedBytes": -95420416
}
{
"state": "DELETED",
"containerID": 3,
"usedBytes": -97517568
} {code}
We examined the deletion process in detail:
# *Normal Flow:*
** The OM keeps track of the blocks and keys. When keys are deleted, OM
prepares a list of blocks associated with them and sends a deletion request to
SCM.
** SCM assigns a new transaction ID to the deletion request and sends it to
the datanodes holding the containers with those blocks.
** The datanode retrieves block information from its *{{blockInfo}}* table,
deletes the blocks, and updates the metrics for used bytes and block count
accordingly.
# *Issue with Duplicate Requests:*
** OM may retry sending delete block requests if the same key is taken up
again in the next iteration before the previous transaction is flushed to the
database. There can also be retries for the same key blocks deletion in case of
failures.
** SCM, unaware of the duplication, assigns a new transaction ID and forwards
the request to the datanode.
** When the datanode receives this duplicate request, it attempts to delete
the already-deleted blocks. It fails to find them but still updates the
metrics, leading to negative values.
We confirmed this issue by adding extra logs. For example:
{code:java}
// The first valid request
2024-07-29 12:00:30 2024-07-29 06:30:30,815
[DeleteBlocksCommandHandlerThread-1] INFO
commandhandler.DeleteBlocksCommandHandler: isDuplicateTransaction called with
containerId: 2, containerDataDeleteTxnID: 0, delTX-ID: 2
2024-07-29 12:00:30 localID: 113750153625600011
2024-07-29 12:00:30 localID: 113750153625600014
2024-07-29 12:00:30 localID: 113750153625600017
// The second duplicate request
2024-07-29 12:00:30 2024-07-29 06:30:30,846
[DeleteBlocksCommandHandlerThread-2] INFO
commandhandler.DeleteBlocksCommandHandler: isDuplicateTransaction called with
containerId: 2, containerDataDeleteTxnID: 2, delTX-ID: 6
2024-07-29 12:00:30 localID: 113750153625600011
2024-07-29 12:00:30 localID: 113750153625600014
2024-07-29 12:00:30 localID: 113750153625600017{code}
Upon receiving the duplicate request, no blocks are found to delete, resulting
in the following log:
{code:java}
2024-07-31 13:32:28 2024-07-31 08:02:28,869 [BlockDeletingService#3] WARN
impl.FilePerBlockStrategy: Block file to be deleted does not exist:
/data/hdds/.../chunks/113750153625600011.block {code}
> Ozone Datanode Reporting Negative Container values for UsedBytes and
> BlockCount parameters
> ------------------------------------------------------------------------------------------
>
> Key: HDDS-11267
> URL: https://issues.apache.org/jira/browse/HDDS-11267
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode, Ozone Recon
> Reporter: Arafat Khan
> Assignee: Arafat Khan
> Priority: Major
>
> The issue involves datanodes in Ozone reporting negative container sizes for
> the {{usedBytes}} and block count metrics. This occurs when the Ozone Manager
> sends duplicate block deletion requests to the Storage Container Manager. Due
> to a delay in processing the original request, OM may mistakenly send a
> duplicate request. The datanode, upon receiving the duplicate request,
> attempts to delete blocks that have already been deleted, but still updates
> the metrics, leading to negative values. The proposed solution is to modify
> the deletion process in the datanode to track and ignore duplicate block
> deletion requests, ensuring metrics are not updated incorrectly.
> Recon Reported the following negative sized containers:-
> {code:java}
> sh-4.2$ ozone admin container list | jq '. | {state: .state, containerID:
> .containerID, usedBytes: .usedBytes}'
> {
> "state": "DELETED",
> "containerID": 1,
> "usedBytes": -100000000
> }
> {
> "state": "DELETED",
> "containerID": 2,
> "usedBytes": -95420416
> }
> {
> "state": "DELETED",
> "containerID": 3,
> "usedBytes": -97517568
> }{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]