Liang Xie created HDFS-7763:
-------------------------------
Summary: fix zkfc hung issue due to not catching exception in a
corner case
Key: HDFS-7763
URL: https://issues.apache.org/jira/browse/HDFS-7763
Project: Hadoop HDFS
Issue Type: Bug
Components: ha
Affects Versions: 2.6.0
Reporter: Liang Xie
Assignee: Liang Xie
In our product cluster, we hit both the two zkfc process is hung after a zk
network outage.
the zkfc log said:
{code}
2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2,
closing socket connection and attempting reconnect
2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector:
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further
znode monitoring connection errors.
2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session:
0x4a61bacdd9dfb2 closed
2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal
error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not
retrying further znode monitoring connection errors.
2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on
11300
2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut
down
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
Responder
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping
HealthMonitor thread
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
listener on 11300
{code}
and the thread dump also be uploaded as attachment.
>From the dump, we can see due to the unknown non-daemon
>threads(pool-*-thread-*), the process did not exit, but the critical threads,
>like health monitor and rpc threads had been stopped, so our
>watchdog(supervisord) had not not observed the zkfc process is down or
>abnormal. so the following namenode failover could not be done as expected.
there're two possible fixes here, 1) figure out the unset-thread-name, like
pool-7-thread-1, where them came from and close or set daemon property. i tried
to search but got nothing right now. 2) catch the exception from
ZKFailoverController.run() so we can continue to exec the System.exit, the
attached patch is 2).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)