[
https://issues.apache.org/jira/browse/HDDS-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Attila Doroszlai updated HDDS-14725:
------------------------------------
Issue Type: Improvement (was: Bug)
> Display retry messages in cli when scm's are unavailable
> --------------------------------------------------------
>
> Key: HDDS-14725
> URL: https://issues.apache.org/jira/browse/HDDS-14725
> Project: Apache Ozone
> Issue Type: Improvement
> Reporter: Gargi Jaiswal
> Assignee: Gargi Jaiswal
> Priority: Major
> Labels: pull-request-available
>
> When all SCM instances are down or unreachable, CLI commands that query SCM
> (e.g. {color:#de350b}{{{}ozone admin datanode list{}}}, {{{}decommission{}}},
> {{{}diskbalancer{}}}, {{{}usageinfo{}}}, {{maintenance}}{color}, etc.) appear
> to hang for up to *~10–15 minutes* before failing.
> This is due to SCM client retry configuration:
>
> {code:java}
> hdds.scmclient.rpc.timeout = 15m
> hdds.scmclient.max.retry.timeout = 10m
> hdds.scmclient.retry.interval = 2s
> hdds.scmclient.max.retry = 15
> {code}
>
> Retries are happening internally, but no feedback is shown on the CLI during
> this period, creating the impression that the command is stuck and shows
> error after 15mins.
> In contrast, in HDFS (e.g. when NameNodes are down), retry attempts are
> logged immediately via {{{}*RetryInvocationHandler*{}}}, providing clear user
> feedback.
> *Proposed Fix:*
> Make Retry logs to be shown up in the cli output to
> {color:#de350b}{{stderr}}{color}{{ }}{{in
> }}{color:#de350b}{{SCMFailoverProxyProviderBase}}{color}{{ }}
> {{example:}}
> {code:java}
> // Current behaviour with scm's down
> bash-5.1$ ozone admin datanode list
> <----------- Seems as stuck for 15mins with no cli error message
> ----------->{code}
>
> {code:java}
> // Proposed fix for all commands querying scm
> bash-5.1$ ozone admin datanode list
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm2 after 1 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm3 after 2 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm1 after 3 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm2 after 4 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm3 after 5 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm1 after 6 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm2 after 7 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm3 after 8 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm1 after 9 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm2 after 10 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm3 after 11 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm1 after 12 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm2 after 13 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm3 after 14 failover attempt(s).
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking
> StorageContainerLocationProtocolPB over scm1 after 15 failover attempt(s).
> Trying to failover after sleeping for 2000ms.}}
> {{Invalid host name: local host is: "om1/172.18.0.4"; destination host is:
> "scm1":9860; java.net.UnknownHostException: Invalid host name: local host is:
> "om1/172.18.0.4"; destination host is: "scm1":9860;
> java.net.UnknownHostException; For more details see:
> http://wiki.apache.org/hadoop/UnknownHost; For more details see:
> http://wiki.apache.org/hadoop/UnknownHost {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]