[ 
https://issues.apache.org/jira/browse/HDDS-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Doroszlai updated HDDS-14725:
------------------------------------
    Component/s: Ozone CLI

> Display retry messages in cli when scm's are unavailable
> --------------------------------------------------------
>
>                 Key: HDDS-14725
>                 URL: https://issues.apache.org/jira/browse/HDDS-14725
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone CLI
>            Reporter: Gargi Jaiswal
>            Assignee: Gargi Jaiswal
>            Priority: Major
>              Labels: pull-request-available
>
> When all SCM instances are down or unreachable, CLI commands that query SCM 
> (e.g. {color:#de350b}{{{}ozone admin datanode list{}}}, {{{}decommission{}}}, 
> {{{}diskbalancer{}}}, {{{}usageinfo{}}}, {{maintenance}}{color}, etc.) appear 
> to hang for up to *~10–15 minutes* before failing.
> This is due to SCM client retry configuration:
>  
> {code:java}
> hdds.scmclient.rpc.timeout = 15m
> hdds.scmclient.max.retry.timeout = 10m
> hdds.scmclient.retry.interval = 2s
> hdds.scmclient.max.retry = 15
> {code}
>  
> Retries are happening internally, but no feedback is shown on the CLI during 
> this period, creating the impression that the command is stuck and shows 
> error after 15mins.
> In contrast, in HDFS (e.g. when NameNodes are down), retry attempts are 
> logged immediately via {{{}*RetryInvocationHandler*{}}}, providing clear user 
> feedback.
> *Proposed Fix:*
> Make Retry logs to be shown up in the cli output to 
> {color:#de350b}{{stderr}}{color}{{ }}{{in 
> }}{color:#de350b}{{SCMFailoverProxyProviderBase}}{color}{{ }}
> {{example:}}
> {code:java}
> // Current behaviour with scm's down
> bash-5.1$ ozone admin datanode list
> <----------- Seems as stuck for 15mins with no cli error message 
> ----------->{code}
>  
> {code:java}
> // Proposed fix for all commands querying scm
> bash-5.1$ ozone admin datanode list
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 1 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 2 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 3 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 4 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 5 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 6 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 7 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 8 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 9 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 10 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 11 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 12 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm2 after 13 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm3 after 14 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.
> UnknownHostException: Invalid host name, while invoking 
> StorageContainerLocationProtocolPB over scm1 after 15 failover attempt(s). 
> Trying to failover after sleeping for 2000ms.}}
> {{Invalid host name: local host is: "om1/172.18.0.4"; destination host is: 
> "scm1":9860; java.net.UnknownHostException: Invalid host name: local host is: 
> "om1/172.18.0.4"; destination host is: "scm1":9860; 
> java.net.UnknownHostException; For more details see: 
> http://wiki.apache.org/hadoop/UnknownHost; For more details see: 
> http://wiki.apache.org/hadoop/UnknownHost {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to