[
https://issues.apache.org/jira/browse/HDDS-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760911#comment-16760911
]
Elek, Marton commented on HDDS-776:
-----------------------------------
+1
With this approach we have multiple retry as the hdoop rpc client itself also
can retry the connection. But Hadoop rpc can't do any DNS re-resolution, so
it's more safe to add this external retry logic.
I tested it with a clean build:
* I removed the WAIT_FOR: scm line from the docker-compose file
* I started only the ozoneManager (docker-compose up -d ozoneManager)
* After a few seconds I started the scm
Without the patch the scm initialization is failed (as scm dns can't be
resolved at the time of the initialization). With the patch it works well:
{code}
ozoneManager_1 | ************************************************************/
ozoneManager_1 | 2019-02-05 15:02:42 INFO OzoneManager:51 - registered UNIX
signal handlers for [TERM, HUP, INT]
ozoneManager_1 | 2019-02-05 15:02:42 WARN OmUtils:143 - ozone.om.db.dirs is
not configured. We recommend adding this setting. Falling back to
ozone.metadata.dirs instead.
ozoneManager_1 | 2019-02-05 15:02:42 INFO RetriableTask:62 - Execution of
task OM#getScmInfo failed, will be retried in 5000 ms
ozoneManager_1 | 2019-02-05 15:02:47 INFO RetriableTask:62 - Execution of
task OM#getScmInfo failed, will be retried in 5000 ms
ozoneManager_1 | 2019-02-05 15:02:52 INFO RetriableTask:62 - Execution of
task OM#getScmInfo failed, will be retried in 5000 ms
ozoneManager_1 | 2019-02-05 15:02:57 INFO RetriableTask:62 - Execution of
task OM#getScmInfo failed, will be retried in 5000 ms
ozoneManager_1 | OM initialization succeeded.Current cluster id for
sd=/data/metadata/om;cid=CID-6123b8bf-de74-4020-9568-fa715c52c71c
ozoneManager_1 | 2019-02-05 15:03:02 INFO OzoneManager:51 - SHUTDOWN_MSG:
ozoneManager_1 | /************************************************************
ozoneManager_1 | SHUTDOWN_MSG: Shutting down OzoneManager at
82427eee2324/172.21.0.2
{code}
Will commit to the trunk, soon.
> Make OM initialization resilient to dns failures
> ------------------------------------------------
>
> Key: HDDS-776
> URL: https://issues.apache.org/jira/browse/HDDS-776
> Project: Hadoop Distributed Data Store
> Issue Type: Improvement
> Components: OM
> Reporter: Elek, Marton
> Assignee: Doroszlai, Attila
> Priority: Critical
> Attachments: HDDS-776.001.patch
>
>
> Ozone Manager could be initialized by 'ozone om --init' command and it
> connects to a running scm.
> In case of scm is unavailable because a dns issue the initialization is
> failed without any retry:
> {code}
> 2018-10-31 15:36:26 ERROR OzoneManager:376 - Could not initialize OM version
> file
> java.net.UnknownHostException: Invalid host name: local host is: (unknown);
> destination host is: "releastest2-ozone-scm-0.releastest2-ozone-scm":9863;
> java.net.UnknownHostException; For more details see:
> http://wiki.apache.org/hadoop/UnknownHost
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:768)
> at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:449)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1552)
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at org.apache.hadoop.ipc.Client.call(Client.java:1367)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy9.getScmInfo(Unknown Source)
> at
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.getScmInfo(ScmBlockLocationProtocolClientSideTranslatorPB.java:154)
> at org.apache.hadoop.ozone.om.OzoneManager.omInit(OzoneManager.java:358)
> at
> org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:326)
> at org.apache.hadoop.ozone.om.OzoneManager.main(OzoneManager.java:265)
> Caused by: java.net.UnknownHostException
> at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:450)
> ... 10 more
> {code}
> This is a problem for all the containerized environments. In kubernetes om
> can't be started sometimes. For docker-compose environments we have a 15 sec
> sleep to be sure to avoid this issue.
> Would be great to retry in case of a dns problem.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]