I think you are hitting this - https://issues.apache.org/jira/browse/SLIDER-1169
On 9/29/16, 10:21 PM, "Manoj Samel" <manojsamelt...@gmail.com> wrote: >Hi > >Slider version .80 on secure cluster. > >In my xxx-site.xml files, the > <property> > <name>hadoop.registry.zk.quorum</name> > <value>zk1_host:2181,zk2_host:2181,zk3_host:2181</value> > </property> > >However, it appears slider AM uses only the first ZK to connect for >registry - and fails when the first ZK happens to be down. > >In the slider AM log > >2016-09-30 02:27:27,279 [main] INFO appmaster.SliderAppMaster - Loading >slider-server.xml at >file:/foo/yarn/local/usercache/xx/appcache/application_1474675565244_3660/ >container_e80_1474675565244_3660_01_000001/confdir/slider-server.xml >2016-09-30 02:27:27,285 [main] INFO appmaster.SliderAppMaster - AM >configuration: >dfs.namenode.kerberos.principal=hdfs/_HOST@ABC >hadoop.registry.zk.quorum=zk1_host:2181 >hadoop.registry.zk.root=/registry > >Note -- the log shows only the first host, not the quorum string of 3 >host:ports > >later in log, it tries to connect to ZK1 but since ZK1 is down, the >connection fails. The AM fails start any components as a result. > > >2016-09-29 23:32:49,768 [main] INFO appmaster.SliderAppMaster - Service >YarnRegistry in state YarnRegistry: STARTED Connection="fixed ZK quorum >"zk1_host:2181" " root="/registry" security disabled >2016-09-29 23:32:49,774 [main-SendThread(bds0211.svc.eng.pdx.wd:2181)] >WARN > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, >closing socket connection and attempting reconnect >java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at >org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.j >ava:361) > at >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > >I would expect that if connection to ZK1 failed, then ZK2, 3 ... etc would >be tried .. thats what the ZK quorum is for. > >Looking into the code, I see this last "Connection" string is coming >from org.apache.hadoop.registry.client.impl.zk.CuratorService.java > >In it, supplyBindingInformation() gets and prints the string in log >message. > >public BindingInformation supplyBindingInformation() { > BindingInformation binding = new BindingInformation(); > String connectString = buildConnectionString(); > binding.ensembleProvider = new FixedEnsembleProvider(connectString); > binding.description = > "fixed ZK quorum \"" + connectString + "\""; > return binding; > } > >protected String buildConnectionString() { > return getConfig().getTrimmed(KEY_REGISTRY_ZK_QUORUM, > DEFAULT_REGISTRY_ZK_QUORUM); > } > >the getConfig() is from org.apache.hadoop.conf.Configuration.java > >Its not clear why the value of hadoop.registry.zk.quorum supplied in >config >gets trimmed to first host only. Is this the expected behavior ? Or Bug? > >It can't be possible to guarantee that the first zookeeper in quorum will >always be reachable .. I would expect multiple nodes in quorum to be tried >for connection > > >Any thoughts would be appreciated ...