I think you are hitting this -
https://issues.apache.org/jira/browse/SLIDER-1169


On 9/29/16, 10:21 PM, "Manoj Samel" <manojsamelt...@gmail.com> wrote:

>Hi
>
>Slider version .80 on secure cluster.
>
>In my xxx-site.xml files, the
>    <property>
>      <name>hadoop.registry.zk.quorum</name>
>      <value>zk1_host:2181,zk2_host:2181,zk3_host:2181</value>
>   </property>
>
>However, it appears slider AM uses only the first ZK to connect for
>registry - and fails when the first ZK happens to be down.
>
>In the slider AM log
>
>2016-09-30 02:27:27,279 [main] INFO  appmaster.SliderAppMaster - Loading
>slider-server.xml at
>file:/foo/yarn/local/usercache/xx/appcache/application_1474675565244_3660/
>container_e80_1474675565244_3660_01_000001/confdir/slider-server.xml
>2016-09-30 02:27:27,285 [main] INFO  appmaster.SliderAppMaster - AM
>configuration:
>dfs.namenode.kerberos.principal=hdfs/_HOST@ABC
>hadoop.registry.zk.quorum=zk1_host:2181
>hadoop.registry.zk.root=/registry
>
>Note -- the log shows only the first host, not the quorum string of 3
>host:ports
>
>later in log, it tries to connect to ZK1 but since ZK1 is down, the
>connection fails. The AM fails start any components as a result.
>
>
>2016-09-29 23:32:49,768 [main] INFO  appmaster.SliderAppMaster - Service
>YarnRegistry in state YarnRegistry: STARTED  Connection="fixed ZK quorum
>"zk1_host:2181" " root="/registry" security disabled
>2016-09-29 23:32:49,774 [main-SendThread(bds0211.svc.eng.pdx.wd:2181)]
>WARN
> zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
>closing socket connection and attempting reconnect
>java.net.ConnectException: Connection refused
>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>        at
>sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>        at
>org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.j
>ava:361)
>        at
>org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>
>I would expect that if connection to ZK1 failed, then ZK2, 3 ... etc would
>be tried .. thats what the ZK quorum is for.
>
>Looking into the code, I see this last "Connection" string is coming
>from org.apache.hadoop.registry.client.impl.zk.CuratorService.java
>
>In it, supplyBindingInformation() gets and prints the string in log
>message.
>
>public BindingInformation supplyBindingInformation() {
>    BindingInformation binding = new BindingInformation();
>    String connectString = buildConnectionString();
>    binding.ensembleProvider = new FixedEnsembleProvider(connectString);
>    binding.description =
>        "fixed ZK quorum \"" + connectString + "\"";
>    return binding;
>  }
>
>protected String buildConnectionString() {
>    return getConfig().getTrimmed(KEY_REGISTRY_ZK_QUORUM,
>        DEFAULT_REGISTRY_ZK_QUORUM);
>  }
>
>the getConfig() is from org.apache.hadoop.conf.Configuration.java
>
>Its not clear why the value of hadoop.registry.zk.quorum supplied in
>config
>gets trimmed to first host only. Is this the expected behavior ? Or Bug?
>
>It can't be possible to guarantee that the first zookeeper in quorum will
>always be reachable .. I would expect multiple nodes in quorum to be tried
>for connection
>
>
>Any thoughts would be appreciated ...

Reply via email to