[ https://issues.apache.org/jira/browse/HDFS-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385951#comment-15385951 ]
Amit Anand commented on HDFS-10659: ----------------------------------- Steps to reproduce =================== 1. Configure an HA cluster with at least 3 JNs 2. Shutdown 1st JN and move JN current directory to current.bak 3. Recreate current directory with correct permissions and copy VERSION file from current.bak to current (do not create paxos directory) 4. Shutdown 2nd JN and repeat step 2 and 3 5. Watch NN logs and see how NN crashes due to missing paxos directory To recover your cluster 1. Create "paxos" directory under JN current directory (make sure permissions are set correctly) 2. Restart JNs 3. Restart NNs > Namenode crashes after Journalnode re-installation in an HA cluster due to > missing paxos directory > -------------------------------------------------------------------------------------------------- > > Key: HDFS-10659 > URL: https://issues.apache.org/jira/browse/HDFS-10659 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, journal-node > Affects Versions: 2.7.1 > Reporter: Amit Anand > > In my environment I am seeing {{Namenodes}} crashing down after > {{Journalnodes}} are re-installed. We manage multiple clusters and do rolling > upgrades followed by rolling re-install of each node including master(NN, JN, > RM, ZK) nodes. When a journal node is re-installed or moved to a new > disk/host, instead of running {{"initializeSharedEdits"}} command, I copy > {{VERSION}} file from one of the other {{Journalnode}} and that allows my > {{NN}} to start writing data to the newly installed {{Journalnode}}. > To acheive quorum for JN and recover unfinalized segments NN during starupt > creates NNNN.tmp files under {{"<disk>/jn/current/paxos"}} directory . In > current implementation "paxos" directry is only created during > {{"initializeSharedEdits"}} command and if a JN is re-installed the "paxos" > directory is not created upon JN startup or by NN while writing NNNN.tmp > files which causes NN to crash with following error message: > {code} > 192.168.100.16:8485: /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No > such file or directory) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.<init>(FileOutputStream.java:221) > at java.io.FileOutputStream.<init>(FileOutputStream.java:171) > at > org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58) > at > org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971) > at > org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846) > at > org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205) > at > org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249) > at > org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) > {code} > The current > [getPaxosFile|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java#L128-L130] > method simply returns a path to a file under "paxos" directory without > verifiying its existence. Since "paxos" directoy holds files that are > required for NN recovery and acheiving JN quorum my proposed solution is to > add a check to "getPaxosFile" method and create the {{"paxos"}} directory if > it is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org