Re: Review Request 29950: Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum not established

Alejandro Fernandez Thu, 15 Jan 2015 16:24:07 -0800


> On Jan. 16, 2015, 12:07 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py,
> >  line 50
> > <https://reviews.apache.org/r/29950/diff/1/?file=823094#file823094line50>
> >
> >     If not specified, this should be defaulted to HTTP_ONLY


Will fix this.


> On Jan. 16, 2015, 12:07 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py,
> >  lines 55-56
> > <https://reviews.apache.org/r/29950/diff/1/?file=823094#file823094line55>
> >
> >     This will not work in HA mode. The NameNode is a combination of 
> > `dfs.namenode.http-address`, the HA cluster name, and the `nn` identifier. 
> > Such as:
> >     
> >     dfs.namenode.http-address.c1ha.nn2

With the current code, it returns a value like "c6408.ambari.apache.org:50070"
And the function get_jmx_data will convert it to something like 
"http://c6408.ambari.apache.org:50070/jmx";, which does appear to work


> On Jan. 16, 2015, 12:07 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py,
> >  lines 87-88
> > <https://reviews.apache.org/r/29950/diff/1/?file=823094#file823094line87>
> >
> >     kinit needed here?

kinit happens just before in line 83


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/#review68366
-----------------------------------------------------------


On Jan. 15, 2015, 10:43 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29950/
> -----------------------------------------------------------
> 
> (Updated Jan. 15, 2015, 10:43 p.m.)
> 
> 
> Review request for Ambari, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, 
> Srimanth Gunturi, Sid Wagle, Tom Beerbower, and Yurii Shylov.
> 
> 
> Bugs: AMBARI-9163
>     https://issues.apache.org/jira/browse/AMBARI-9163
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> The active namenode shutdowns during the first call to get the safemode 
> status.
> `
> su - hdfs -c 'hdfs dfsadmin -safemode get'
> `
> 
> returned
> `
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> `
> 
> The active namenode shows the following during the same time window,
> `
> 2015-01-15 00:35:04,233 WARN  client.QuorumJournalManager 
> (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed 
> to write txns 52-52. Will try to write to this JN again after the next log 
> roll.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException):
>  Can't write, no segment open
>       at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
>       at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
>       at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
>       at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>       at 
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> 
>       at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>       at com.sun.proxy.$Proxy12.journal(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
>       at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
>       at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> `
> 
> This issue is intermittent because it depends on the behavior of the 
> Journalnodes, so this will require more work to the scripts.
> 
> Today, our orchestration restarts one Journalnode at a time. However, the 
> current log segment is null because it has not yet rolled to a new one, which 
> can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some 
> conditions are true.
> 
> The runbook has  more details,
> `
> // Function to ensure all JNs are up and are functional
> ensureJNsAreUp(Jn1, Jn2, Jn3) {
>   rollEdits at the namenode // hdfs dfsadmin -rollEdit
>   get “LastAppliedOrWrittenTxId” from NN jmx
>   wait till "LastWrittenTxId" from all JNs is >= previous step transaction 
> ID, timeout after 3 mins
> }
> 
> // Before bringing down a journal node ensure that the other two journal 
> nodes are up
> ensureJNsAreUp
> for each JN {
>   do upgrade of one JN
>   ensureJNsAreUp
> }
> 
> `
> 
> Root caused to:
> https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java
>  line 344
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml 
> ce0ab297a8c8e665e8ffde79b9b36be2d29d117c 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py
>  15e068947307a321566385fb670232af7f78d71b 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py
>  PRE-CREATION 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py
>  93efae35281e7d3d175ecc95b3af4e531cf69b64 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py
>  f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682 
> 
> Diff: https://reviews.apache.org/r/29950/diff/
> 
> 
> Testing
> -------
> 
> Copied changes files to a 3-node HA cluster and verified that the upgrade 
> worked twice.
> Unit Tests passed,
> 
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] Total time: 30:23.410s
> [INFO] Finished at: Thu Jan 15 14:43:23 PST 2015
> [INFO] Final Memory: 61M/393M
> [INFO] 
> ------------------------------------------------------------------------
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>

Re: Review Request 29950: Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum not established

Reply via email to