[
https://issues.apache.org/jira/browse/HDDS-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753940#comment-16753940
]
Doroszlai, Attila commented on HDDS-776:
----------------------------------------
Hi [~elek],
What should be the retry policy? Should it continue indefinitely, or give up
after some time or number of attempts? If it should give up, should it be
configurable? If so, should we use some existing retry config properties, or
introduce new ones for the OM->SCM connection?
Thanks.
> Make OM initialization resilient to dns failures
> ------------------------------------------------
>
> Key: HDDS-776
> URL: https://issues.apache.org/jira/browse/HDDS-776
> Project: Hadoop Distributed Data Store
> Issue Type: Improvement
> Components: OM
> Reporter: Elek, Marton
> Assignee: Doroszlai, Attila
> Priority: Critical
>
> Ozone Manager could be initialized by 'ozone om --init' command and it
> connects to a running scm.
> In case of scm is unavailable because a dns issue the initialization is
> failed without any retry:
> {code}
> 2018-10-31 15:36:26 ERROR OzoneManager:376 - Could not initialize OM version
> file
> java.net.UnknownHostException: Invalid host name: local host is: (unknown);
> destination host is: "releastest2-ozone-scm-0.releastest2-ozone-scm":9863;
> java.net.UnknownHostException; For more details see:
> http://wiki.apache.org/hadoop/UnknownHost
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:768)
> at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:449)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1552)
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at org.apache.hadoop.ipc.Client.call(Client.java:1367)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy9.getScmInfo(Unknown Source)
> at
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.getScmInfo(ScmBlockLocationProtocolClientSideTranslatorPB.java:154)
> at org.apache.hadoop.ozone.om.OzoneManager.omInit(OzoneManager.java:358)
> at
> org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:326)
> at org.apache.hadoop.ozone.om.OzoneManager.main(OzoneManager.java:265)
> Caused by: java.net.UnknownHostException
> at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:450)
> ... 10 more
> {code}
> This is a problem for all the containerized environments. In kubernetes om
> can't be started sometimes. For docker-compose environments we have a 15 sec
> sleep to be sure to avoid this issue.
> Would be great to retry in case of a dns problem.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]