[
https://issues.apache.org/jira/browse/HBASE-9563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-9563:
-------------------------
Attachment: 9563.txt
Make it so we do not return non-zero if file w/ master znode is not specified
or found.
Was going to try this first but likely needs more. What we see is that master
is killed but it is sticking around. 30seconds later, we go to start a master
but if fails because JMX port is occupied. This new master on its way out
tries to clear the master znode. Its clearing znode seems to make the stuck
master fail so we have two znode cleaners running about same time. Here is log:
{code}
2013-10-07 05:06:29,713 INFO [AM.ZK.Worker-pool2-t559] master.RegionStates:
Offlined d2ff3ce62dacd333d98feee91f620f8a from
a1806.halxg.cloudera.com,60020,1381147238190
3 2013-10-07 05:06:57,313 INFO [main] util.VersionInfo: HBase 0.96.0
4 2013-10-07 05:06:57,313 INFO [main] util.VersionInfo: Subversion
git://hbase-jenkins.ent.cloudera.com/var/lib/jenkins/jobs/hbase-096/workspace
-r 06a2800d3faf83aec482c210c61d453ce8e759bc
5 2013-10-07 05:06:57,313 INFO [main] util.VersionInfo: Compiled by jenkins
on Mon Oct 7 00:11:57 PDT 2013
6 Mon Oct 7 05:06:57 PDT 2013 Starting master on a1805.halxg.cloudera.com
7 core file size (blocks, -c) 0
8 data seg size (kbytes, -d) unlimited
9 scheduling priority (-e) 0
10 file size (blocks, -f) unlimited
11 pending signals (-i) 386225
12 max locked memory (kbytes, -l) 64
13 max memory size (kbytes, -m) unlimited
14 open files (-n) 32768
15 pipe size (512 bytes, -p) 8
16 POSIX message queues (bytes, -q) 819200
17 real-time priority (-r) 0
18 stack size (kbytes, -s) 8192
19 cpu time (seconds, -t) unlimited
20 max user processes (-u) 32768
21 virtual memory (kbytes, -v) unlimited
22 file locks (-x) unlimited
23 2013-10-07 05:06:57,616 INFO [main] zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
24 2013-10-07 05:06:57,616 INFO [main] zookeeper.ZooKeeper: Client
environment:host.name=a1805.halxg.cloudera.com
25 2013-10-07 05:06:57,616 INFO [main] zookeeper.ZooKeeper: Client
environment:java.version=1.7.0_25
26 2013-10-07 05:06:57,616 INFO [main] zookeeper.ZooKeeper: Client
environment:java.vendor=Oracle Corporation
27 2013-10-07 05:06:57,616 INFO [main] zookeeper.ZooKeeper: Client
environment:java.home=/opt/toolchain/sun-jdk-64bit-1.7.0.25/jre
28 2013-10-07 05:06:57,616 INFO [main] zookeeper.ZooKeeper: Client
environment:java.class.path=/opt/hbase/current/bin/../conf:/opt/toolchain/sun-jdk-64bit-1.7.0.25/lib/tools.jar:/opt/hbase/current/bin/#
29 2013-10-07 05:06:57,617 INFO [main] zookeeper.ZooKeeper: Client
environment:java.library.path=/opt/hadoop/hadoop-2.1.0-beta/lib/native
30 2013-10-07 05:06:57,617 INFO [main] zookeeper.ZooKeeper: Client
environment:java.io.tmpdir=/tmp
31 2013-10-07 05:06:57,617 INFO [main] zookeeper.ZooKeeper: Client
environment:java.compiler=<NA>
32 2013-10-07 05:06:57,617 INFO [main] zookeeper.ZooKeeper: Client
environment:os.name=Linux
33 2013-10-07 05:06:57,617 INFO [main] zookeeper.ZooKeeper: Client
environment:os.arch=amd64
34 2013-10-07 05:06:57,618 INFO [main] zookeeper.ZooKeeper: Client
environment:os.version=3.2.0-43-generic
35 2013-10-07 05:06:57,618 INFO [main] zookeeper.ZooKeeper: Client
environment:user.name=hbase
36 2013-10-07 05:06:57,618 INFO [main] zookeeper.ZooKeeper: Client
environment:user.home=/home/hbase
37 2013-10-07 05:06:57,618 INFO [main] zookeeper.ZooKeeper: Client
environment:user.dir=/home/hbase
38 2013-10-07 05:06:57,619 INFO [main] zookeeper.ZooKeeper: Initiating client
connection, connectString=a1805.halxg.cloudera.com:2181 sessionTimeout=90000
watcher=clean znode for master
39 2013-10-07 05:06:57,651 INFO [main] zookeeper.RecoverableZooKeeper:
Process identifier=clean znode for master connecting to ZooKeeper
ensemble=a1805.halxg.cloudera.com:2181
40 2013-10-07 05:06:57,655 INFO
[main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: Opening
socket connection to server a1805.halxg.cloudera.com/10.20.200.105:2181. Will
not attempt #
41 2013-10-07 05:06:57,661 INFO
[main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: Socket
connection established to a1805.halxg.cloudera.com/10.20.200.105:2181,
initiating session
42 2013-10-07 05:06:57,685 INFO
[main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: Session
establishment complete on server a1805.halxg.cloudera.com/10.20.200.105:2181,
sessionid = #
43 2013-10-07 05:06:59,677 INFO [main] util.VersionInfo: HBase 0.96.0
1 2013-10-07 05:06:59,677 INFO [main] util.VersionInfo: Subversion
git://hbase-jenkins.ent.cloudera.com/var/lib/jenkins/jobs/hbase-096/workspace
-r 06a2800d3faf83aec482c210c61d453ce8e759bc
2 2013-10-07 05:06:59,678 INFO [main] util.VersionInfo: Compiled by jenkins
on Mon Oct 7 00:11:57 PDT 2013
3 2013-10-07 05:06:59,971 INFO [main] zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
4 2013-10-07 05:06:59,971 INFO [main] zookeeper.ZooKeeper: Client
environment:host.name=a1805.halxg.cloudera.com
5 2013-10-07 05:06:59,971 INFO [main] zookeeper.ZooKeeper: Client
environment:java.version=1.7.0_25
6 2013-10-07 05:07:00,008 INFO [main] zookeeper.ZooKeeper: Client
environment:java.vendor=Oracle Corporation
7 2013-10-07 05:07:00,008 INFO [main] zookeeper.ZooKeeper: Client
environment:java.home=/opt/toolchain/sun-jdk-64bit-1.7.0.25/jre
8 2013-10-07 05:07:00,008 INFO [main] zookeeper.ZooKeeper: Client
environment:java.class.path=/opt/hbase/current/bin/../conf:/opt/toolchain/sun-jdk-64bit-1.7.0.25/lib/tools.jar:/opt/hbase/current/bin/#
9 2013-10-07 05:07:00,009 INFO [main] zookeeper.ZooKeeper: Client
environment:java.library.path=/opt/hadoop/hadoop-2.1.0-beta/lib/native
10 2013-10-07 05:07:00,009 INFO [main] zookeeper.ZooKeeper: Client
environment:java.io.tmpdir=/tmp
11 2013-10-07 05:07:00,009 INFO [main] zookeeper.ZooKeeper: Client
environment:java.compiler=<NA>
12 2013-10-07 05:07:00,009 INFO [main] zookeeper.ZooKeeper: Client
environment:os.name=Linux
13 2013-10-07 05:07:00,009 INFO [main] zookeeper.ZooKeeper: Client
environment:os.arch=amd64
14 2013-10-07 05:07:00,009 INFO [main] zookeeper.ZooKeeper: Client
environment:os.version=3.2.0-43-generic
15 2013-10-07 05:07:00,010 INFO [main] zookeeper.ZooKeeper: Client
environment:user.name=hbase
16 2013-10-07 05:07:00,010 INFO [main] zookeeper.ZooKeeper: Client
environment:user.home=/home/hbase
17 2013-10-07 05:07:00,010 INFO [main] zookeeper.ZooKeeper: Client
environment:user.dir=/home/hbase
18 2013-10-07 05:07:00,011 INFO [main] zookeeper.ZooKeeper: Initiating client
connection, connectString=a1805.halxg.cloudera.com:2181 sessionTimeout=90000
watcher=clean znode for master
19 2013-10-07 05:07:00,042 INFO [main] zookeeper.RecoverableZooKeeper:
Process identifier=clean znode for master connecting to ZooKeeper
ensemble=a1805.halxg.cloudera.com:2181
20 2013-10-07 05:07:00,043 WARN [main] hbase.ZNodeClearer: Can't read the
content of the znode file
21 java.io.FileNotFoundException: /tmp/hbase-hbase-master.znode (No such file
or directory)
22 ,...at java.io.FileInputStream.open(Native Method)
23 ,...at java.io.FileInputStream.<init>(FileInputStream.java:138)
24 ,...at java.io.FileInputStream.<init>(FileInputStream.java:97)
25 ,...at java.io.FileReader.<init>(FileReader.java:58)
26 ,...at
org.apache.hadoop.hbase.ZNodeClearer.readMyEphemeralNodeOnDisk(ZNodeClearer.java:95)
27 ,...at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:143)
28 ,...at
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:138)
29 ,...at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
30 ,...at
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
31 ,...at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2787)
32 2013-10-07 05:07:00,046 INFO
[main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: Opening
socket connection to server a1805.halxg.cloudera.com/10.20.200.105:2181. Will
not attempt #
{code}
Elliott suggests setting master in autorestart mode or beyond that, having
master restart retry (doesn't seem to be easy facility for this in the
ClusterManager interface at the mo).
> Autorestart doesn't work if zkcleaner fails
> -------------------------------------------
>
> Key: HBASE-9563
> URL: https://issues.apache.org/jira/browse/HBASE-9563
> Project: HBase
> Issue Type: Bug
> Reporter: Elliott Clark
> Assignee: stack
> Fix For: 0.98.0, 0.96.1
>
> Attachments: 9563.txt
>
>
> I've seen this several times where a master didn't autorestart because zk
> cleaner failed. We should still restart the daemon even if it's not possible
> to clean the zk nodes.
--
This message was sent by Atlassian JIRA
(v6.1#6144)