[ https://issues.apache.org/jira/browse/HDFS-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387632#comment-15387632 ]
Amit Anand commented on HDFS-10659: ----------------------------------- [~jingzhao] I believe manual creation of {{current}} directory is required to recover the {{JN}}. The {{current}} directory is not created on {{JN}} startup. Below is what I did {code} root@bcpc-vm6:~# jps -m 21978 Jps -m 25015 Bootstrap start 26955 QuorumPeerMain /etc/zookeeper/conf/zoo.cfg 28040 jmxtrans-all.jar -e -j /opt/jmxtrans/json -s 15 -c false 21555 JournalNode root@bcpc-vm6:~# ls -ltr /disk/1/dfs/jn/Test-Laptop/ total 64 -rw-rw-r-- 1 hdfs hdfs 31 Jul 21 08:47 in_use.lock drwxr-xr-x 3 hdfs hdfs 40960 Jul 21 08:47 current root@bcpc-vm6:~# service hadoop-hdfs-journalnode stop * Stopping Hadoop journalnode: stopping journalnode root@bcpc-vm6:~# jps -m 22805 Jps -m 25015 Bootstrap start 26955 QuorumPeerMain /etc/zookeeper/conf/zoo.cfg 28040 jmxtrans-all.jar -e -j /opt/jmxtrans/json -s 15 -c false root@bcpc-vm6:~# ls -ltr /disk/1/dfs/jn/Test-Laptop/ total 60 drwxr-xr-x 3 hdfs hdfs 40960 Jul 21 08:47 current root@bcpc-vm6:~# mv /disk/1/dfs/jn/Test-Laptop/current /disk/1/dfs/jn/Test-Laptop/current.bak root@bcpc-vm6:~# ls -ltr /disk/1/dfs/jn/Test-Laptop/ total 60 drwxr-xr-x 3 hdfs hdfs 40960 Jul 21 08:47 current.bak root@bcpc-vm6:~# root@bcpc-vm6:~# service hadoop-hdfs-journalnode start * Starting Hadoop journalnode: starting journalnode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-journalnode-bcpc-vm6.out root@bcpc-vm6:~# jps -m 25015 Bootstrap start 26955 QuorumPeerMain /etc/zookeeper/conf/zoo.cfg 23527 JournalNode 28040 jmxtrans-all.jar -e -j /opt/jmxtrans/json -s 15 -c false 23758 Jps -m root@bcpc-vm6:~# ls -ltr /disk/1/dfs/jn/Test-Laptop/ total 64 drwxr-xr-x 3 hdfs hdfs 40960 Jul 21 08:47 current.bak -rw-rw-r-- 1 hdfs hdfs 31 Jul 21 08:50 in_use.lock root@bcpc-vm1:~# sudo -u hdfs hdfs dfsadmin -rollEdits Successfully rolled edit logs. New segment starts at txid 82525 root@bcpc-vm1:~# ssh bcpc-vm6 root@bcpc-vm6:~# ls -ltr /disk/1/dfs/jn/Test-Laptop/ total 64 drwxr-xr-x 3 hdfs hdfs 40960 Jul 21 08:47 current.bak -rw-rw-r-- 1 hdfs hdfs 31 Jul 21 08:50 in_use.lock {code} After bringing down one of the {{JN}} and before/after running {{dfsadmin -rollEdits}} I see following error messages in the {{JN (bcpc-vm6 in this example)}} log file: {code} 2016-07-21 08:50:22,471 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /disk/1/dfs/jn/Test-Laptop/in_use.lock acquired by nodename 23...@bcpc-vm6.bcpc.example.com 2016-07-21 08:50:22,494 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.startLogSegment from 192.168.100.11:42532 Call#10236 Retry#0 org.apache.hadoop.hdfs.qjournal.protocol.JournalNotFormattedException: Journal Storage Directory /disk/1/dfs/jn/Test-Laptop not formatted at org.apache.hadoop.hdfs.qjournal.server.Journal.checkFormatted(Journal.java:461) at org.apache.hadoop.hdfs.qjournal.server.Journal.startLogSegment(Journal.java:501) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:161) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:186) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) 2016-07-21 08:50:22,508 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Updating lastPromisedEpoch from 0 to 2 for client /192.168.100.11 2016-07-21 08:50:22,509 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.heartbeat from 192.168.100.11:42532 Call#10237 Retry#0 java.io.FileNotFoundException: /disk/1/dfs/jn/Test-Laptop/current/last-promised-epoch.tmp (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:221) at java.io.FileOutputStream.<init>(FileOutputStream.java:171) at org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58) at org.apache.hadoop.hdfs.util.PersistentLongFile.writeFile(PersistentLongFile.java:78) at org.apache.hadoop.hdfs.util.PersistentLongFile.set(PersistentLongFile.java:64) at org.apache.hadoop.hdfs.qjournal.server.Journal.updateLastPromisedEpoch(Journal.java:316) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:424) at org.apache.hadoop.hdfs.qjournal.server.Journal.heartbeat(Journal.java:407) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.heartbeat(JournalNodeRpcServer.java:154) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.heartbeat(QJournalProtocolServerSideTranslatorPB.java:172) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25423) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) 2016-07-21 08:50:57,279 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for nn/f-bcpc-vm2.bcpc.example....@bcpc.example.com (auth:KERBEROS) 2016-07-21 08:50:57,285 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for nn/f-bcpc-vm2.bcpc.example....@bcpc.example.com (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol 2016-07-21 08:50:57,293 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.getEditLogManifest from 192.168.100.12:33500 Call#4821 Retry#0 org.apache.hadoop.hdfs.qjournal.protocol.JournalNotFormattedException: Journal Storage Directory /disk/1/dfs/jn/Test-Laptop not formatted {code} Looks like {{current}} directory, {{VERSION}} file and {{paxos}} directory are only created during {{namenode -initializeSharedEdits}}. > Namenode crashes after Journalnode re-installation in an HA cluster due to > missing paxos directory > -------------------------------------------------------------------------------------------------- > > Key: HDFS-10659 > URL: https://issues.apache.org/jira/browse/HDFS-10659 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, journal-node > Affects Versions: 2.7.1 > Reporter: Amit Anand > > In my environment I am seeing {{Namenodes}} crashing down after majority of > {{Journalnodes}} are re-installed. We manage multiple clusters and do rolling > upgrades followed by rolling re-install of each node including master(NN, JN, > RM, ZK) nodes. When a journal node is re-installed or moved to a new > disk/host, instead of running {{"initializeSharedEdits"}} command, I copy > {{VERSION}} file from one of the other {{Journalnode}} and that allows my > {{NN}} to start writing data to the newly installed {{Journalnode}}. > To acheive quorum for JN and recover unfinalized segments NN during starupt > creates NNNN.tmp files under {{"<disk>/jn/current/paxos"}} directory . In > current implementation "paxos" directry is only created during > {{"initializeSharedEdits"}} command and if a JN is re-installed the "paxos" > directory is not created upon JN startup or by NN while writing NNNN.tmp > files which causes NN to crash with following error message: > {code} > 192.168.100.16:8485: /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No > such file or directory) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.<init>(FileOutputStream.java:221) > at java.io.FileOutputStream.<init>(FileOutputStream.java:171) > at > org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58) > at > org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971) > at > org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846) > at > org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205) > at > org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249) > at > org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) > {code} > The current > [getPaxosFile|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java#L128-L130] > method simply returns a path to a file under "paxos" directory without > verifiying its existence. Since "paxos" directoy holds files that are > required for NN recovery and acheiving JN quorum my proposed solution is to > add a check to "getPaxosFile" method and create the {{"paxos"}} directory if > it is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org