Hi todd, I see a lot of java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at java.nio.channels.SocketChannel.open(SocketChannel.java:146) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxMana ger.java:324) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager. java:304) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender .process(FastLeaderElection.java:317) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender .run(FastLeaderElection.java:290) at java.lang.Thread.run(Thread.java:619)
Is it possible that there is some firewall? Can all the servers 1-9 connect to all the others using ports that you specified in zoo.cfg i.e 2888/3888? Thanks mahadev On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote: > Looks like we're not getting *any* leader elected now.... Logs attached. > >> -----Original Message----- >> From: Todd Greenwood [mailto:to...@audiencescience.com] >> Sent: Tuesday, August 04, 2009 4:07 PM >> To: zookeeper-dev@hadoop.apache.org >> Subject: RE: Unending Leader Elections in WAN deploy >> >> Patrick, thanks! I'll forward on to IT and I'll report back to you >> shortly... >> >>> -----Original Message----- >>> From: Patrick Hunt [mailto:ph...@apache.org] >>> Sent: Tuesday, August 04, 2009 3:55 PM >>> To: zookeeper-dev@hadoop.apache.org >>> Subject: Re: Unending Leader Elections in WAN deploy >>> >>> Todd, Mahadev and I looked at this and it turns out to be a >> regression. >>> Ironically a patch I created for 3.2 branch to add quorum tests >> actually >>> broke the quorum config -- a default value for a config parameter > was >>> lost. I'm going to submit a patch asap to get the default back, but >> for >>> the time being you can set: >>> >>> electionAlg=3 >>> >>> in each of your config files. >>> >>> You should see reference to FastLeaderElection in your log files if >> this >>> parameter is set correctly. >>> >>> Sorry for the trouble, >>> >>> Patrick >>> >>> Todd Greenwood wrote: >>>> Mahadev, >>>> >>>> I just heard from IT that this build behaves in exactly the same > way >> as >>>> previous versions, e.g. we get continuous leader elections that >>>> disconnect the followers and then get re-elected, and >> disconnect...etc. >>>> >>>> This is from a fresh sync to the 3.2 branch: >>>> >>>> svn co >>>> > http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2 >>>> ./branch-3.2 >>>> >>>> CHANGES.TXT show the various fixes included: >>>> >>>> >> > to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper >>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt >>>> Release 3.2.1 >>>> >>>> Backward compatibile changes: >>>> >>>> BUGFIXES: >>>> ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris >> via >>>> flavio) >>>> >>>> ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris >> via >>>> mahadev) >>>> >>>> ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via > mahadev) >>>> >>>> ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris > via >>>> mahadev) >>>> >>>> ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure) >>>> (giri via mahadev) >>>> >>>> ZOOKEEPER-467. Change log level in BookieHandle (flavio via >> mahadev) >>>> >>>> ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent >> immediate >>>> failure. (chris via mahadev) >>>> >>>> ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev >> via >>>> phunt) >>>> >>>> ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and >>>> other) >>>> embedded clients (ryan rawson via phunt) >>>> >>>> ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio >> via >>>> mahadev) >>>> >>>> ZOOKEEPER-479. QuorumHierarchical does not count groups > correctly >>>> (flavio via mahadev) >>>> >>>> ZOOKEEPER-466. crash on zookeeper_close() when using auth with >> empty >>>> cert >>>> (Chris Darroch via phunt) >>>> >>>> ZOOKEEPER-480. FLE should perform leader check when node is not >>>> leading and >>>> add vote of follower (flavio via mahadev) >>>> >>>> ZOOKEEPER-491. Prevent zero-weight servers from being elected >> (flavio >>>> via >>>> mahadev) >>>> >>>> What can I do to assist you with this issue? >>>> >>>> -Todd >>>> >>>>> -----Original Message----- >>>>> From: Mahadev Konar [mailto:maha...@yahoo-inc.com] >>>>> Sent: Tuesday, August 04, 2009 12:43 PM >>>>> To: zookeeper-dev@hadoop.apache.org >>>>> Subject: Re: Unending Leader Elections in WAN deploy >>>>> >>>>> Hi todd, >>>>> comments in line >>>>> >>>>> >>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com> >>>> wrote: >>>>>> Mahadev, >>>>>> >>>>>> Some quick questions: >>>>>> >>>>>> 1. Version >>>>>> >>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml > is >>>> still >>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in >>>> calling >>>>>> this release 3.2.1? >>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as > we >>>> tag >>>>> the >>>>> release. >>>>> >>>>>> 2. Build targets >>>>>> >>>>>> The package target fails b/c the create-cppunit-configure target >>>> fails >>>>>> due to various problems w/ respect to autoconf. Are these >>>> dependencies >>>>>> documented somewhere ? I'd like to have a fully building system. >>>>>> >>>>>> create-cppunit-configure: >>>>>> [exec] Can't exec "libtoolize": No such file or directory > at >>>>>> /usr/bin/autoreconf line 188. >>>>>> [exec] Use of uninitialized value $libtoolize in pattern >> match >>>>>> (m//) at /usr/bin/autoreconf line 188. >>>>>> [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' > not >>>> found >>>>>> in library >>>>>> [exec] configure.ac:33: error: possibly undefined macro: >>>>>> AM_PATH_CPPUNIT >>>>>> [exec] If this token and others are legitimate, > please >>>> use >>>>>> m4_pattern_allow. >>>>>> [exec] See the Autoconf documentation. >>>>>> [exec] configure.ac:53: error: possibly undefined macro: >>>>>> AC_PROG_LIBTOOL >>>>>> [exec] autoreconf: /usr/bin/autoconf failed with exit > status: >> 1 >>>>>> >>>>> You need auto tools to run this. Please read the README for >> building c >>>>> client library at src/c/ for the installation requirements. >>>>>> 3. Sync failure: >>>>>> >>>>>> This is still failing. >>>>>> >>>>>> svn: URL >>>>>> > 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch' >>>>>> doesn't exist >>>>>> >>>>> Yes this hasn't been fixed yet! >>>>> >>>>> Thanks >>>>> mahadev >>>>>> -Todd >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Todd Greenwood >>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM >>>>>>> To: 'zookeeper-u...@hadoop.apache.org' >>>>>>> Subject: RE: Unending Leader Elections in WAN deploy >>>>>>> >>>>>>> Great news. Thank you Mahadev. I'll report our findings later >>>> today. >>>>>>> -Todd >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Mahadev Konar [mailto:maha...@yahoo-inc.com] >>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM >>>>>>>> To: zookeeper-u...@hadoop.apache.org >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy >>>>>>>> >>>>>>>> Hi Todd, >>>>>>>> I just committed 480 and 491. You can checkout the 3.2 branch >>>> now. >>>>>>>> Thanks >>>>>>>> mahadev >>>>>>>> >>>>>>>> >>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood" > <to...@audiencescience.com> >>>>>> wrote: >>>>>>>>> That'd be perfect. Thanks! >>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Mahadev Konar [mailto:maha...@yahoo-inc.com] >>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM >>>>>>>>>> To: zookeeper-u...@hadoop.apache.org >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy >>>>>>>>>> >>>>>>>>>> Hi Todd, >>>>>>>>>> Most of the patches that you mention should be in the > branch >>>>>> 3.2 by >>>>>>>>> tomm >>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by >>>> tomm. >>>>>>>>> Would >>>>>>>>>> that >>>>>>>>>> suffice for you? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> mahadev >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" >> <to...@audiencescience.com> >>>>>>> wrote: >>>>>>>>>>> Another problem...I've reverted to the latest versions of > the >>>>>>>>> patches >>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two >>>>>> compilation >>>>>>>>>>> errors: >>>>>>>>>>> >>>>>>>>>>> build-generated: >>>>>>>>>>> [javac] Compiling 44 source files to >>>>>>>>>>> >>>> >> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p >>>>>>>>>>> atched/branch-3.2/build/classes >>>>>>>>>>> >>>>>>>>>>> compile-main: >>>>>>>>>>> [javac] Compiling 2 source files to >>>>>>>>>>> >>>> >> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p >>>>>>>>>>> atched/branch-3.2/build/classes >>>>>>>>>>> [javac] >>>>>>>>>>> >>>> >> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p >>>>>>>>> atched/branch- >>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru >>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and >>>>>> getQuorumPeers() >>>>>>>>> have >>>>>>>>>>> the same erasure >>>>>>>>>>> [javac] public String[] getQuorumPeers(); >>>>>>>>>>> [javac] ^ >>>>>>>>>>> [javac] >>>>>>>>>>> >>>> >> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p >>>>>>>>> atched/branch- >>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru >>>>>>>>>>> mStats.java:31: name clash: getServerState() and >>>>>> getServerState() >>>>>>>>> have >>>>>>>>>>> the same erasure >>>>>>>>>>> [javac] public String getServerState(); >>>>>>>>>>> [javac] ^ >>>>>>>>>>> [javac] 2 errors >>>>>>>>>>> >>>>>>>>>>> My build process is pretty simple: >>>>>>>>>>> >>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory >>>>>>>>>>> (src/patched/branch-3.2) >>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory >>>>>>>>>>> 3. build zookeeper in the temp directory >>>>>>>>>>> >>>>>>>>>>> -Todd >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: Todd Greenwood [mailto:to...@audiencescience.com] >>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM >>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org >>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy >>>>>>>>>>>> >>>>>>>>>>>> Flavio, >>>>>>>>>>>> I notice that you've updated the patches referenced for > the >>>> WAN >>>>>>>>>>>> deployment. There appears to be an order dependency w/ >> respect >>>>>> to >>>>>>>>>>> these >>>>>>>>>>>> four patches... >>>>>>>>>>>> >>>>>>>>>>>> ZOOKEEPER-473.patch ZOOKEEPER-479-branch3.2.patch >>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch ZOOKEEPER-491.patch >>>>>>>>>>>> >>>>>>>>>>>> 473 -> 479 (479 fails) >>>>>>>>>>>> >>>>>>>>>>>> >>>> >> > to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper >>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 < >>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch >>>>>>>>>>>> patching file >>>>>>>>>>>> >>>> >> > src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch >>>>>>>>>>>> ical.java >>>>>>>>>>>> patching file >>>>>>>>>>>> >>>> >> > src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java >>>>>>>>>>>> patching file >>>>>>>>>>>> >>>> >> > src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier >>>>>>>>>>>> .java >>>>>>>>>>>> patching file >>>>>>>>>>>> >>>>>> >> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java >>>>>>>>>>>> Hunk #1 FAILED at 93. >>>>>>>>>>>> Hunk #2 FAILED at 145. >>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file >>>>>>>>>>>> >>>> >> > src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej >>>> >> > to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper >>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/ >>>>>>>>>>>> >>>>>>>>>>>> Could you advise as to which patches I need to apply, and > in >>>>>> what >>>>>>>>>>> order? >>>>>>>>>>>> -Todd >>>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: Flavio Junqueira [mailto:f...@yahoo-inc.com] >>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM >>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy >>>>>>>>>>>> >>>>>>>>>>>> Perfect! Thanks for the update, Todd. >>>>>>>>>>>> >>>>>>>>>>>> -Flavio >>>>>>>>>>>> >>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote: >>>>>>>>>>>> >>>>>>>>>>>> Thanks. You were right, I had a stale version of 479. >>>>>> Compilation >>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the > latest >>>>>> patches >>>>>>>>>>>> 473, >>>>>>>>>>>> 479, 481, and 491. >>>>>>>>>>>> >>>>>>>>>>>> -Todd >>>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: Flavio Junqueira [mailto:f...@yahoo-inc.com] >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM >>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy >>>>>>>>>>>> >>>>>>>>>>>> It should be in 479. Perhaps you have a stale version > of >>>> the >>>>>>>>>>> patch. >>>>>>>>>>>> -Flavio >>>>>>>>>>>> >>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote: >>>>>>>>>>>> >>>>>>>>>>>> Flavio, >>>>>>>>>>>> >>>>>>>>>>>> I'm getting a compilation error for patch 491: >>>>>>>>>>>> >>>>>>>>>>>> compile-main: >>>>>>>>>>>> [javac] Compiling 1 source file to >>>>>>>>>>>> >>>>>> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ >>>>>>>>>>>> src/p >>>>>>>>>>>> atched/branch-3.2/build/classes >>>>>>>>>>>> [javac] >>>>>>>>>>>> >>>>>> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ >>>>>>>>>>>> src/p >>>>>>>>>>>> >>>>>> >> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/ >>>>>>>>>>>> FastL >>>>>>>>>>>> eaderElection.java:601: cannot find symbol >>>>>>>>>>>> [javac] symbol : method getWeight(long) >>>>>>>>>>>> [javac] location: interface >>>>>>>>>>>> >> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier >>>>>>>>>>>> [javac] >>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0) >>>>>>>>>>>> [javac] >>>>>> ^ >>>>>>>>>>>> [javac] 1 error >>>>>>>>>>>> >>>>>>>>>>>> I see a reference to getWeight in both >>>>>> FastLeaderElection.java >>>>>>>>>>> in >>>>>>>>>>>> patch >>>>>>>>>>>> 491: >>>>>>>>>>>> >>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+ >>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0) >>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/ >>>>>>>>>>>> FastLeaderElection.java >>>>>>>>>>>> : >>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != >>>>>>>>>>>> 0) >>>>>>>>>>>> >>>>>>>>>>>> However, I don't see a reference to this method in >> patches >>>>>> 473, >>>>>>>>>>>> 479, >>>>>>>>>>>> or >>>>>>>>>>>> 481. I also don't see a reference to this method in > the >>>>>>>>> trunk... >>>>>>>>>>>> -Todd >>>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: Todd Greenwood > [mailto:to...@audiencescience.com] >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM >>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org >>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy >>>>>>>>>>>> >>>>>>>>>>>> Ok, I'll apply that patch and report back. >>>>>>>>>>>> -Todd >>>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: Flavio Junqueira [mailto:f...@yahoo-inc.com] >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM >>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy >>>>>>>>>>>> >>>>>>>>>>>> You're missing 491 from your set of patches. >>>>>>>>>>>> >>>>>>>>>>>> -Flavio >>>>>>>>>>>> >>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote: >>>>>>>>>>>> >>>>>>>>>>>> This repro's in both branch-3.2, and >>>>>> branch-3.2+patches(473, >>>>>>>>>>>> 479, >>>>>>>>>>>> 481). >>>>>>>>>>>> >>>>>>>>>>>> Basically, it seems like the nodes are electing >>>>>> pd4-zook02 >>>>>>>>> to >>>>>>>>>>>> be >>>>>>>>>>>> the >>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not >>>>>>>>>>> supposed >>>>>>>>>>>> to >>>>>>>>>>>> be >>>>>>>>>>>> and >>>>>>>>>>>> then disconnects everyone. Then they re-elect it > again, >>>>>> and >>>>>>>>>>> it >>>>>>>>>>>> loops >>>>>>>>>>>> over and over. >>>>>>>>>>>> >>>>>>>>>>>> ------------- >>>>>>>>>>>> Server config >>>>>>>>>>>> ------------- >>>>>>>>>>>> >>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888 >>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888 >>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888 >>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888 >>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888 >>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888 >>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888 >>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888 >>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888 >>>>>>>>>>>> >>>>>>>>>>>> group.1:1:2:3:4:5 >>>>>>>>>>>> weight.1=1 >>>>>>>>>>>> weight.2=1 >>>>>>>>>>>> weight.3=1 >>>>>>>>>>>> weight.4=1 >>>>>>>>>>>> weight.5=1 >>>>>>>>>>>> >>>>>>>>>>>> group.2:6:7:8:9 >>>>>>>>>>>> weight.6=0 >>>>>>>>>>>> weight.7=0 >>>>>>>>>>>> weight.8=0 >>>>>>>>>>>> weight.9=0 >>>>>>>>>>>> >>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3 >>>>>>>>>>> different >>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only >>>>>>>>> machines >>>>>>>>>>>> in >>>>>>>>>>>> dc1 >>>>>>>>>>>> have voting rights, and the ability to become a > leader. >>>>>> The >>>>>>>>>>>> machines >>>>>>>>>>>> in >>>>>>>>>>>> the pods all have a weight of zero, and are not >> expected >>>>>> to >>>>>>>>>>>> become >>>>>>>>>>>> leaders, or to vote on transactions. >>>>>>>>>>>> >>>>>>>>>>>> Let me know what I can do to help resolve this issue. >>>>>>>>>>>> >>>>>>>>>>>> -Todd >>>>