[
https://issues.apache.org/jira/browse/HDDS-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ashish Kumar updated HDDS-11714:
--------------------------------
Description:
In case of resetDeletedBlockRetryCount with --all option, scm takes
[lock|https://github.com/apache/ozone/blob/12419fae1f0418793d952227364b04f1d2c3583b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java#L126]
and tries to get all the transaction with max retry and then updates DB with 0
count. In some large scale env this count can be huge which can lead to
multiple problem.
i) Lock can lead to block all other normal operation.
ii) Since message is passed through ratis, which will fail because of size.
Instead of doing like above we should do this operation in batches to avoid
long lock and ratis message size failure.
was:
In case of resetDeletedBlockRetryCount with --all option, scm takes
[lock|https://github.com/apache/ozone/blob/12419fae1f0418793d952227364b04f1d2c3583b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java#L126]
and tries to get all the transaction with max retry and then updates DB with 0
count. In some large scale env this count can be huge which can lead to
multiple problem.
i) Lock can lead to block all other normal operation.
ii) Since message is passed through ratis, which will fail because of size.
Instead of doing like above we should do this operation in batche to avoid long
lock and ratis message size failure.
> resetDeletedBlockRetryCount with --all may fail and can cause long db lock in
> large cluster
> -------------------------------------------------------------------------------------------
>
> Key: HDDS-11714
> URL: https://issues.apache.org/jira/browse/HDDS-11714
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ashish Kumar
> Assignee: Aryan Gupta
> Priority: Major
>
> In case of resetDeletedBlockRetryCount with --all option, scm takes
> [lock|https://github.com/apache/ozone/blob/12419fae1f0418793d952227364b04f1d2c3583b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java#L126]
> and tries to get all the transaction with max retry and then updates DB with
> 0 count. In some large scale env this count can be huge which can lead to
> multiple problem.
> i) Lock can lead to block all other normal operation.
> ii) Since message is passed through ratis, which will fail because of size.
> Instead of doing like above we should do this operation in batches to avoid
> long lock and ratis message size failure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]