IMPALA-5223: Add waiting for HBase Zookeeper nodes to retry loop Occasionally we'd see HBase fail to startup properly on CentOS 7 clusters. The symptom was that HBase would not open the required nodes in zookeeper, signaling its readiness.
As a workaround, this change includes waiting for the Zookeeper nodes into the retry logic. Change-Id: Id8dbdff4ad02cac1322e7d580e0a6971daf6ea28 Reviewed-on: http://gerrit.cloudera.org:8080/7159 Reviewed-by: Michael Brown <[email protected]> Reviewed-by: anujphadke <[email protected]> Reviewed-by: David Knupp <[email protected]> Tested-by: Lars Volker <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/467ccd19 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/467ccd19 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/467ccd19 Branch: refs/heads/master Commit: 467ccd19508eca0733cb061497a3c2ceca3ea849 Parents: 7a0ee68 Author: Lars Volker <[email protected]> Authored: Mon Jun 12 15:46:25 2017 -0700 Committer: Lars Volker <[email protected]> Committed: Tue Jun 13 05:57:49 2017 +0000 ---------------------------------------------------------------------- testdata/bin/run-hbase.sh | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/467ccd19/testdata/bin/run-hbase.sh ---------------------------------------------------------------------- diff --git a/testdata/bin/run-hbase.sh b/testdata/bin/run-hbase.sh index 2a51105..f264b65 100755 --- a/testdata/bin/run-hbase.sh +++ b/testdata/bin/run-hbase.sh @@ -111,20 +111,27 @@ for ((i=1; i <= HBASE_START_RETRY_ATTEMPTS; ++i)); do if ! ${HBASE_HOME}/bin/start-hbase.sh 2>&1 | tee -a ${HBASE_LOGDIR}/hbase-startup.out then echo "HBase Master startup failed" - elif ! ${HBASE_HOME}/bin/local-regionservers.sh start 2 3 2>&1 | \ + continue + fi + if ! ${HBASE_HOME}/bin/local-regionservers.sh start 2 3 2>&1 | \ tee -a ${HBASE_LOGDIR}/hbase-rs-startup.out then echo "HBase regionserver startup failed" - else - break + continue + fi + if ! ${CLUSTER_BIN}/check-hbase-nodes.py; then + echo "HBase nodes did not come online" + continue fi + # If we made it to here, HBase started up correctly so we can stop the retry logic. + break else # In the last iteration, it's fine for errexit to do its thing. ${HBASE_HOME}/bin/start-hbase.sh 2>&1 | tee -a ${HBASE_LOGDIR}/hbase-startup.out ${HBASE_HOME}/bin/local-regionservers.sh start 2 3 2>&1 | \ tee -a ${HBASE_LOGDIR}/hbase-rs-startup.out + ${CLUSTER_BIN}/check-hbase-nodes.py fi done -${CLUSTER_BIN}/check-hbase-nodes.py echo "HBase startup scripts succeeded"
