That's great news, glad it worked out. Thanks for the update Ted. Patrick
On Tue, Mar 20, 2012 at 3:33 PM, Ted Yu <[email protected]> wrote: > We're using the trick Patrick proposed, see: > https://issues.apache.org/jira/browse/HBASE-5603 > > FYI > > On Tue, Mar 20, 2012 at 10:14 AM, Patrick Hunt <[email protected]> wrote: > >> Great. Thanks Ted. >> >> On Tue, Mar 20, 2012 at 10:09 AM, Ted Yu <[email protected]> wrote: >> > Patrick: >> > I logged https://issues.apache.org/jira/browse/ZOOKEEPER-1428 >> > >> > If you feel there is anything missing in the JIRA, feel free to add it. >> > >> > Thanks for your help on this issue. >> > >> > Cheers >> > >> > On Tue, Mar 20, 2012 at 9:42 AM, Patrick Hunt <[email protected]> wrote: >> > >> >> On Tue, Mar 20, 2012 at 9:32 AM, Ted Yu <[email protected]> wrote: >> >> > Near term, if we can find out a way for shell script to detect the >> >> absence >> >> > of particular zookeeper node, rolling-restart.sh can be restored. >> >> > Otherwise we may need to remove it. >> >> >> >> I just tested this out with 3.4, and I see the following for statting >> >> a non-existant znode: >> >> >> >> [zk: (CONNECTED) 1] stat /foobar >> >> Node does not exist: /foobar >> >> >> >> vs statting one that does exist: >> >> >> >> [zk: (CONNECTED) 2] stat / >> >> cZxid = 0x0 >> >> ctime = Wed Dec 31 16:00:00 PST 1969 >> >> mZxid = 0x0 >> >> mtime = Wed Dec 31 16:00:00 PST 1969 >> >> pZxid = 0x0 >> >> cversion = -1 >> >> dataVersion = 0 >> >> aclVersion = 0 >> >> ephemeralOwner = 0x0 >> >> dataLength = 0 >> >> numChildren = 1 >> >> >> >> You can look for "^Node does not exist" in the stat output instead of >> >> checking the exit code. This would get around the problem until a more >> >> permanent solution could be found. >> >> >> >> I hear you re time bound (i'd love to work on this myself). In that >> >> case, would you mind creating a jira based on my suggestion of having >> >> a new command line tool, give your hbase case as an example and any >> >> requirements you might think of. Perhaps Hartmut or one of the other >> >> contributors might be interested to work on this. >> >> https://issues.apache.org/jira/browse/ZOOKEEPER >> >> >> >> Patrick >> >> >> >> > >> >> > On Tue, Mar 20, 2012 at 9:16 AM, Patrick Hunt <[email protected]> >> wrote: >> >> > >> >> >> On Tue, Mar 20, 2012 at 6:57 AM, Ted Yu <[email protected]> wrote: >> >> >> > I looked at the patch for ZOOKEEPER-1059 which should have >> converted >> >> the >> >> >> > NPE to KeeperException.NoNodeException >> >> >> > >> >> >> > Why would 'zkcli stat' command return 0 in case hbase master znode >> >> >> expires ? >> >> >> > >> >> >> > Advice is appreciated. >> >> >> >> >> >> Hi Ted, sorry to see you're having troubles. I think I see the >> >> >> disconnect. ZooKeeperMain is first and foremost a user shell. As such >> >> >> it should not exit unless the quit command is run (or killed >> >> >> explicitly, etc...). In this case ZOOKEEPER-1059 is fixing a bug in >> >> >> the shell. It indeed is converting the NPE into a NoNodeException, >> >> >> which the shell then converts into an error message to the user, and >> >> >> continues. Prior to this patch the shell was failing on the NPE, >> which >> >> >> then generated the non-0 exit from the process. >> >> >> >> >> >> Note that trunk has some further improvements along these lines that >> >> >> you might also run into at some point in the future (3.5+): >> >> >> >> >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-271 >> >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-1391 >> >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-1307 >> >> >> >> >> >> I think what we need is to have a tool that's intended for use both >> >> >> programmatically and by humans, with more strict requirements about >> >> >> input, output formatting and command handling, etc... Please see the >> >> >> work Hartmut has been doing as part of 271 on trunk (3.5.0). Perhaps >> >> >> we can augment these new classes to also support such a tool. However >> >> >> it should instead be a true command line tool, rather than an shell. >> >> >> Would you be available to work on this? >> >> >> >> >> >> Patrick >> >> >> >> >> >> ps. bigtop is now helping to verify cross project compatibility, it >> >> >> would be great if you could introduce some hbase tests that would >> >> >> flag these breakages in future. When bigtop does it's integration (ie >> >> >> runs the hbase tests using the corresponding version of zk) it would >> >> >> find these problems. We'd catch it much earlier. Thanks! >> >> >> >> >> >> >> >> >> > FYI Jon filed a JIRA for the issue below which is a blocker for >> HBase >> >> >> trunk. >> >> >> > >> >> >> > On Tue, Mar 20, 2012 at 12:36 AM, Jonathan Hsieh <[email protected] >> > >> >> >> wrote: >> >> >> > >> >> >> >> I'm trying to test HBASE-5589 -- to see if I can add an API call >> to >> >> >> >> HMasterInterface and do a rolling-restart / upgrade on a live >> cluster >> >> >> which >> >> >> >> lead me down another rabbit hole. >> >> >> >> >> >> >> >> I'm wondering how rolling-restart.sh script worked in the past (I >> can >> >> >> spend >> >> >> >> more time setting up an older version to test this, but figured >> I'd >> >> >> ask). >> >> >> >> >> >> >> >> I'm getting stuck when the bin/rolling-restart.sh tries to wait >> until >> >> >> the >> >> >> >> Master ZNode expires. In this particular case, the script seems >> to >> >> hang >> >> >> >> there forever (even after the /hbase/master ephemeral node >> expires). >> >> >> >> >> >> >> >> Here's the code in the script: >> >> >> >> ---- >> >> >> >> # make sure the master znode has been deleted before continuing >> >> >> >> zparent=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool >> >> >> >> zookeeper.znode.parent` >> >> >> >> if [ "$zparent" == "null" ]; then zparent="/hbase"; fi >> >> >> >> zmaster=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool >> >> >> >> zookeeper.znode.master` >> >> >> >> if [ "$zmaster" == "null" ]; then zmaster="master"; fi >> >> >> >> zmaster=$zparent/$zmaster >> >> >> >> echo -n "Waiting for Master ZNode ${zmaster} to expire" >> >> >> >> while bin/hbase zkcli stat $zmaster >/dev/null 2>&1; do >> >> >> >> echo -n "." >> >> >> >> sleep 1 >> >> >> >> done >> >> >> >> echo #force a newline >> >> >> >> ---- >> >> >> >> >> >> >> >> The problem is that 'bin/hbase zkcli stat /hbase/master ...' >> seems to >> >> >> >> always returns with $? == 0 regardless if the znode is present or >> not >> >> >> >> present! I've checked with Patrick Hunt (ZK committer) and this >> the >> >> >> >> expected behavior. The only non-zero retcodes are for abnormal >> exits >> >> >> >> (exceptions thrown) >> >> >> >> >> >> >> >> Here's the ZK code I was looking through >> >> >> >> >> >> >> >> >> >> >> >> >> >> https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeperMain.java#L736 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeper.java#L980 >> >> >> >> >> >> >> >> >> >> >> >> Thoughts? >> >> >> >> >> >> >> >> Jon. >> >> >> >> >> >> >> >> -- >> >> >> >> // Jonathan Hsieh (shay) >> >> >> >> // Software Engineer, Cloudera >> >> >> >> // [email protected] >> >> >> >> >> >> >> >> >> >>
