[jira] [Updated] (HDDS-11714) resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster

Ashish Kumar (Jira) Fri, 15 Nov 2024 02:51:22 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ashish Kumar updated HDDS-11714:
--------------------------------
    Description: 
In case of resetDeletedBlockRetryCount with --all option, scm takes 
[lock|https://github.com/apache/ozone/blob/12419fae1f0418793d952227364b04f1d2c3583b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java#L126]
 and tries to get all the transaction with max retry and then updates DB with 0 
count. In some large scale env this count can be huge which can lead to 
multiple problem.

i) Lock can lead to block all other normal operation.

ii) Since message is passed through ratis, which will fail because of size.

Instead of doing like above we should do this operation in batches to avoid 
long lock and ratis message size failure.

  was:
In case of resetDeletedBlockRetryCount with --all option, scm takes 
[lock|https://github.com/apache/ozone/blob/12419fae1f0418793d952227364b04f1d2c3583b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java#L126]
 and tries to get all the transaction with max retry and then updates DB with 0 
count. In some large scale env this count can be huge which can lead to 
multiple problem.

i) Lock can lead to block all other normal operation.

ii) Since message is passed through ratis, which will fail because of size.

Instead of doing like above we should do this operation in batche to avoid long 
lock and ratis message size failure.


> resetDeletedBlockRetryCount with --all may fail and can cause long db lock in 
> large cluster
> -------------------------------------------------------------------------------------------
>
>                 Key: HDDS-11714
>                 URL: https://issues.apache.org/jira/browse/HDDS-11714
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ashish Kumar
>            Assignee: Aryan Gupta
>            Priority: Major
>
> In case of resetDeletedBlockRetryCount with --all option, scm takes 
> [lock|https://github.com/apache/ozone/blob/12419fae1f0418793d952227364b04f1d2c3583b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java#L126]
>  and tries to get all the transaction with max retry and then updates DB with 
> 0 count. In some large scale env this count can be huge which can lead to 
> multiple problem.
> i) Lock can lead to block all other normal operation.
> ii) Since message is passed through ratis, which will fail because of size.
> Instead of doing like above we should do this operation in batches to avoid 
> long lock and ratis message size failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-11714) resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster

Reply via email to