[
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936092#comment-14936092
]
Chris mildebrandt commented on HDFS-8298:
-----------------------------------------
We were able to change the following in our hdfs-site.xml which fixed our very
specific issue with controlled outages:
<property>
<name>dfs.qjournal.start-segment.timeout.ms</name>
<value>20000</value>
</property>
<property>
<name>dfs.qjournal.prepare-recovery.timeout.ms</name>
<value>120000</value>
</property>
<property>
<name>dfs.qjournal.accept-recovery.timeout.ms</name>
<value>120000</value>
</property>
<property>
<name>dfs.qjournal.finalize-segment.timeout.ms</name>
<value>120000</value>
</property>
<property>
<name>dfs.qjournal.select-input-streams.timeout.ms</name>
<value>20000</value>
</property>
<property>
<name>dfs.qjournal.get-journal-state.timeout.ms</name>
<value>120000</value>
</property>
<property>
<name>dfs.qjournal.new-epoch.timeout.ms</name>
<value>120000</value>
</property>
<property>
<name>dfs.qjournal.write-txns.timeout.ms</name>
<value>20000</value>
</property>
The above values are the default, I'll let others determine the value that's
most appropriate for them....if it's even appropriate for them to change at
all. As usual, don't blindly change without understanding the impacts of your
change.
> HA: NameNode should not shut down completely without quorum, doesn't recover
> from temporary network outages
> -----------------------------------------------------------------------------------------------------------
>
> Key: HDFS-8298
> URL: https://issues.apache.org/jira/browse/HDFS-8298
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: ha, HDFS, namenode, qjm
> Affects Versions: 2.6.0
> Environment: HDP 2.2
> Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal
> nodes (eg. network interruption), the NameNode shuts down entirely, when it
> should instead go in to a standby mode so that it can stay online and retry
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary
> network outage is resolved, the entire cluster remains offline indefinitely
> until operator intervention, whereas it could have self-repaired after
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for
> required journal (JournalAndStre
> am(mgr=QJM to [<ip>:8485, <ip>:8485, <ip>:8485], stream=QuorumOutputStream
> starting at txid 54270281))
> java.io.IOException: Interrupted waiting 20000ms for a quorum of nodes to
> respond.
> at
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
> at
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
> at
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
> at java.lang.Thread.run(Thread.java:745)
> 2015-04-15 15:59:26,901 WARN client.QuorumJournalManager
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at
> txid 54270281
> 2015-04-15 15:59:26,904 INFO util.ExitUtil (ExitUtil.java:terminate(124)) -
> Exiting with status 1
> 2015-04-15 15:59:27,001 INFO namenode.NameNode (StringUtils.java:run(659)) -
> SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at <custom_scrubbed>/<ip>
> ************************************************************/{code}
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)