[jira] [Updated] (HDDS-14725) CLI commands querying SCM appear to hang when SCM is down due to silent client retries

Gargi Jaiswal (Jira) Wed, 25 Feb 2026 20:35:07 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gargi Jaiswal updated HDDS-14725:
---------------------------------
    Description: 
When all SCM instances are down or unreachable, CLI commands that query SCM 
(e.g. {color:#de350b}{{{}ozone admin datanode list{}}}, {{{}decommission{}}}, 
{{{}diskbalancer{}}}, {{{}usageinfo{}}}, {{maintenance}}{color}, etc.) appear 
to hang for up to *~10–15 minutes* before failing.

This is due to SCM client retry configuration:

 
{code:java}
hdds.scmclient.rpc.timeout = 15m
hdds.scmclient.max.retry.timeout = 10m
hdds.scmclient.retry.interval = 2s
hdds.scmclient.max.retry = 15
{code}
 

Retries are happening internally, but no feedback is shown on the CLI during 
this period, creating the impression that the command is stuck and shows error 
after 15mins.

In contrast, in HDFS (e.g. when NameNodes are down), retry attempts are logged 
immediately via {{{}*RetryInvocationHandler*{}}}, providing clear user feedback.

*Proposed Fix:*

Make Retry logs to be shown up in the cli output to 
{color:#de350b}{{stderr}}{color}{{ }}{{in 
}}{color:#de350b}{{SCMFailoverProxyProviderBase}}{color}{{ }}

{{example:}}
{code:java}
// Current behaviour with scm's down
bash-5.1$ ozone admin datanode list
<----------- Seems as stuck for 15mins with no cli error message 
----------->{code}
 
{code:java}
// Proposed fix for all commands querying scm

bash-5.1$ ozone admin datanode list
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 1 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 2 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 3 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 4 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 5 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 6 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 7 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 8 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 9 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 10 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 11 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 12 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 13 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 14 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 15 failover attempt(s). 
Trying to failover after sleeping for 2000ms.}}
{{Invalid host name: local host is: "om1/172.18.0.4"; destination host is: 
"scm1":9860; java.net.UnknownHostException: Invalid host name: local host is: 
"om1/172.18.0.4"; destination host is: "scm1":9860; 
java.net.UnknownHostException; For more details see: 
http://wiki.apache.org/hadoop/UnknownHost; For more details see: 
http://wiki.apache.org/hadoop/UnknownHost {code}
 

  was:
When all SCM instances are down or unreachable, CLI commands that query SCM 
(e.g. {color:#de350b}{{{}ozone admin datanode list{}}}, {{{}decommission{}}}, 
{{{}diskbalancer{}}}, {{{}usageinfo{}}}, {{maintenance}}{color}, etc.) appear 
to hang for up to *~10–15 minutes* before failing.

This is due to SCM client retry configuration:

 
{code:java}
hdds.scmclient.rpc.timeout = 15m
hdds.scmclient.max.retry.timeout = 10m
hdds.scmclient.retry.interval = 2s
hdds.scmclient.max.retry = 15
{code}
 

Retries are happening internally, but no feedback is shown on the CLI during 
this period, creating the impression that the command is stuck and shows error 
after 15mins.

In contrast, in HDFS (e.g. when NameNodes are down), retry attempts are logged 
immediately via {{{}*RetryInvocationHandler*{}}}, providing clear user feedback.

*Proposed Fix:*

Make Retry logs to be shown up in the cli output to 
{color:#de350b}{{stderr}}{color}{{ }}{{in 
}}{color:#de350b}{{SCMFailoverProxyProviderBase}}{color}{{ }}

{{example:}}
{code:java}

{code}
{{// Current behaviour with scm's down}}
{{bash-5.1$ ozone admin datanode list
<----------- Seems as stuck for 15mins with no cli error message ----------->}}
{code:java}

{code}
{{// Proposed fix for all commands querying scm}}
{{bash-5.1$ ozone admin datanode list
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 1 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 2 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 3 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 4 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 5 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 6 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 7 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 8 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 9 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 10 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 11 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 12 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 13 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 14 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 15 failover attempt(s). 
Trying to failover after sleeping for 2000ms.}}
{{Invalid host name: local host is: "om1/172.18.0.4"; destination host is: 
"scm1":9860; java.net.UnknownHostException: Invalid host name: local host is: 
"om1/172.18.0.4"; destination host is: "scm1":9860; 
java.net.UnknownHostException; For more details see: 
http://wiki.apache.org/hadoop/UnknownHost; For more details see: 
http://wiki.apache.org/hadoop/UnknownHost }}

 


> CLI commands querying SCM appear to hang when SCM is down due to silent 
> client retries
> --------------------------------------------------------------------------------------
>
>                 Key: HDDS-14725
>                 URL: https://issues.apache.org/jira/browse/HDDS-14725
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Gargi Jaiswal
>            Assignee: Gargi Jaiswal
>            Priority: Major
>
> When all SCM instances are down or unreachable, CLI commands that query SCM 
> (e.g. {color:#de350b}{{{}ozone admin datanode list{}}}, {{{}decommission{}}}, 
> {{{}diskbalancer{}}}, {{{}usageinfo{}}}, {{maintenance}}{color}, etc.) appear 
> to hang for up to *~10–15 minutes* before failing.
> This is due to SCM client retry configuration:
>  
> {code:java}
> hdds.scmclient.rpc.timeout = 15m
> hdds.scmclient.max.retry.timeout = 10m
> hdds.scmclient.retry.interval = 2s
> hdds.scmclient.max.retry = 15
> {code}
>  
> Retries are happening internally, but no feedback is shown on the CLI during 
> this period, creating the impression that the command is stuck and shows 
> error after 15mins.
> In contrast, in HDFS (e.g. when NameNodes are down), retry attempts are 
> logged immediately via {{{}*RetryInvocationHandler*{}}}, providing clear user 
> feedback.
> *Proposed Fix:*
> Make Retry logs to be shown up in the cli output to 
> {color:#de350b}{{stderr}}{color}{{ }}{{in 
> }}{color:#de350b}{{SCMFailoverProxyProviderBase}}{color}{{ }}
> {{example:}}
> {code:java}
> // Current behaviour with scm's down
> bash-5.1$ ozone admin datanode list
> <----------- Seems as stuck for 15mins with no cli error message 
> ----------->{code}
>  
> {code:java}
> // Proposed fix for all commands querying scm
> bash-5.1$ ozone admin datanode list
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 1 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 2 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 3 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 4 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 5 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 6 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 7 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 8 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 9 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 10 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 11 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 12 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 13 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 14 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 15 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.}}
> {{Invalid host name: local host is: "om1/172.18.0.4"; destination host is: 
> "scm1":9860; java.net.UnknownHostException: Invalid host name: local host is: 
> "om1/172.18.0.4"; destination host is: "scm1":9860; 
> java.net.UnknownHostException; For more details see: 
> http://wiki.apache.org/hadoop/UnknownHost; For more details see: 
> http://wiki.apache.org/hadoop/UnknownHost {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14725) CLI commands querying SCM appear to hang when SCM is down due to silent client retries

Reply via email to