He Xiaoqiao created HDFS-13760:
----------------------------------
Summary: improve ZKFC fencing action when network of ZKFC interrupt
Key: HDFS-13760
URL: https://issues.apache.org/jira/browse/HDFS-13760
Project: Hadoop HDFS
Issue Type: Improvement
Components: ha
Reporter: He Xiaoqiao
when host of Active NameNode & ZKFC meet network fault for quite a time, HDFS
will be not available since ZKFC located on Standby NameNode will never ssh
fence success due to it could not ssh to Active NameNode. In such situation,
for Client, it could not connect to Active NameNode, then failover to Standby
but it could not provide READ/WRITE.
{code:xml}
2018-07-23 15:57:10,836 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 40
time(s); maxRetries=45
2018-07-23 15:57:30,856 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 41
time(s); maxRetries=45
2018-07-23 15:57:50,872 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 42
time(s); maxRetries=45
2018-07-23 15:58:10,892 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 43
time(s); maxRetries=45
2018-07-23 15:58:30,912 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 44
time(s); maxRetries=45
2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ZKFailoverController: get old
active state exception: org.apache.hadoop.net.ConnectTimeoutException: 20000
millis timeout while waiting for channel to be
ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
local=/ip:port remote=hostname]
2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ActiveStandbyElector: old
active is not healthy. need to create znode
2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ActiveStandbyElector: Elector
callbacks for NameNode at standbynn start create node, now time:
45179010079342817
2018-07-23 15:58:50,936 INFO org.apache.hadoop.ha.ActiveStandbyElector:
CreateNode result: 0 code:OK for path: /hadoop-ha/ns/ActiveStandbyElectorLock
connectionState: CONNECTED for elector id=469098346
appData=0a07727a2d6e6e313312046e6e31331a1f727a2d646174612d6864702d6e6e31332e727a2e73616e6b7561692e636f6d20e83e28d33e
cb=Elector callbacks for NameNode at standbynamenode
2018-07-23 15:58:50,936 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2018-07-23 15:58:50,938 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists:
0a07727a2d6e6e313312046e6e31341a1f727a2d646174612d6864702d6e6e31342e727a2e73616e6b7561692e636f6d20e83e28d33e
2018-07-23 15:58:50,939 INFO org.apache.hadoop.ha.ZKFailoverController: Should
fence: NameNode at activenamenode
2018-07-23 15:59:10,960 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: activenamenode. Already tried 0 time(s); maxRetries=1
2018-07-23 15:59:30,980 WARN org.apache.hadoop.ha.FailoverController: Unable to
gracefully make NameNode at activenamenode standby (unable to connect)
org.apache.hadoop.net.ConnectTimeoutException: Call From standbynamenode to
activenamenode failed on socket timeout exception:
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending local=ip:port
remote=activenamenode]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
{code}
I propose that when Active NameNode meet network fault, ZKFC force this
NameNode to become Standby, and another ZKFC could hold the ZNode for election
and transition other NameNode to Active even when ssh fence fail.
There is no available patch now, and I am very welcome to hear some suggestion.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]