[
https://issues.apache.org/jira/browse/HDFS-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Toshihiro Suzuki updated HDFS-14123:
------------------------------------
Attachment: HDFS-14123.01.patch
> NameNode failover doesn't happen when running fsfreeze for the NameNode dir
> (dfs.namenode.name.dir)
> ---------------------------------------------------------------------------------------------------
>
> Key: HDFS-14123
> URL: https://issues.apache.org/jira/browse/HDFS-14123
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Reporter: Toshihiro Suzuki
> Assignee: Toshihiro Suzuki
> Priority: Major
> Attachments: HDFS-14123.01.patch, HDFS-14123.01.patch,
> HDFS-14123.01.patch
>
>
> I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for
> test purpose, but NameNode failover didn't happen.
> {code}
> fsfreeze -f /mnt
> {code}
> /mnt is a separate filesystem partition from /. And the NameNode dir
> "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode.
> I checked the source code, and I found monitorHealth RPC from ZKFC doesn't
> fail even if the NameNode dir is frozen. I think that's why the failover
> doesn't happen.
> Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets
> stuck like the following, and it keeps holding the write lock of
> FSNamesystem, which causes HDFS service down:
> {code}
> "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0
> tid=0x00007f56b96e2000 nid=0x5042 in Object.wait() [0x00007f56937bb000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317)
> - locked <0x00000000c58ca268> (a
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422)
> - locked <0x00000000c58ca268> (a
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316)
> - locked <0x00000000c58ca268> (a
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307)
> at
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148)
> at
> org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727)
> Locked ownable synchronizers:
> - <0x00000000c5f4ca10> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
> {code}
> I believe NameNode failover should happen in this case. One idea is to check
> if the NameNode dir is working when NameNode receives monitorHealth RPC from
> ZKFC.
> I will attach a patch for this idea.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]