I'm trying to test HBASE-5589 -- to see if I can add an API call to
HMasterInterface and do a rolling-restart / upgrade on a live cluster which
lead me down another rabbit hole.
I'm wondering how rolling-restart.sh script worked in the past (I can spend
more time setting up an older version to test this, but figured I'd ask).
I'm getting stuck when the bin/rolling-restart.sh tries to wait until the
Master ZNode expires. In this particular case, the script seems to hang
there forever (even after the /hbase/master ephemeral node expires).
Here's the code in the script:
----
# make sure the master znode has been deleted before continuing
zparent=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool
zookeeper.znode.parent`
if [ "$zparent" == "null" ]; then zparent="/hbase"; fi
zmaster=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool
zookeeper.znode.master`
if [ "$zmaster" == "null" ]; then zmaster="master"; fi
zmaster=$zparent/$zmaster
echo -n "Waiting for Master ZNode ${zmaster} to expire"
while bin/hbase zkcli stat $zmaster >/dev/null 2>&1; do
echo -n "."
sleep 1
done
echo #force a newline
----
The problem is that 'bin/hbase zkcli stat /hbase/master ...' seems to
always returns with $? == 0 regardless if the znode is present or not
present! I've checked with Patrick Hunt (ZK committer) and this the
expected behavior. The only non-zero retcodes are for abnormal exits
(exceptions thrown)
Here's the ZK code I was looking through
https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeperMain.java#L736
https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeper.java#L980
Thoughts?
Jon.
--
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [email protected]