[jira] [Commented] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

Hari Sekhon (JIRA) Tue, 01 Aug 2017 05:58:36 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108851#comment-16108851
 ]


Hari Sekhon commented on HDFS-8298:
-----------------------------------

[~qwertymaniac] I understand that this is current design - but this doesn't 
mean it couldn't be improved - hence why I filed this as an improvement and not 
a bug, although in common sense terms it is a bug of design - traditional High 
Availability solutions don't shut down permanently for temporary network 
outages.

The specific idea for improvement is simply to drop to Standby mode, allow no 
more edits and then retry every 30 secs (configurable) to try to regain QJM 
quorum and re-promote one of the NameNodes to Active once quorum is 
re-established. If you can't do this because of the write behind log then at 
least only kill the Active Namenode and allow the standby to stay online and be 
promoted to Active once quorum is re-established. Ideally if necessary improve 
the Active NameNode to be able to discard the transactions that cannot be 
committed to the edits log without the quorum and then drop to standby read 
only mode without shutting down the whole process. I can't see any reason why 
this wouldn't be possible even if it required more code change to fix this 
behaviour.

In this case the edits logs would be protected from diverging and there is no 
reason not to keep the process alive as it then makes it possible to re-elect 
an active namenode once the quorum is re-established and would give more 
availability, which is really the point of HA.

Right now customers are working around a flawed design by restarting things 
whenever there is any minor temporary network interruptions.

[~andrew.wang] I've just had another large customer encounter the same issue 
and of course they just started the cluster again, carry on and live with it - 
they don't even bother raising it to the vendors to debug it since it works 
again after a restart, but it's still broken behaviour. Even on site I only 
hear about these things in passing conversation. Temporary network problems are 
more common than you'd think as anybody who has been an industrial level 
networking specialist will know. The fact that today customers are simply 
restarting their clusters and living with it whenever this crops up doesn't 
make it uncommon, their system administrators simply don't understand the 
design enough to understand that this could have been improved. I've personally 
seen this more times than I've reported and I know other people don't even 
bother taking the time to report these things, either because they don't 
understand what could be improved or because they can't be bothered to use 
their time to help vendors improve their product, it's quicker to just start 
the cluster again, it works and they want to forget about it and move on.

Also consider weekend planned maintenance network outages, this has also 
happened to me before and there is no reason I should be coming in Monday 
mornings every few months to a cluster that is down because the design didn't 
get fixed (yes you could argue the network team should have notified us of 
quarterly maintenance windows, maybe they did and we missed the email or 
forgot, perhaps everybody should script cluster shutdown and startup around 
maintenance windows and have monitoring that tries to auto-restart the cluster 
if it's down - but this is all a plaster to the symptoms rather than a cure to 
the tech design - and other times it's not planned maintenance but actual 
unpredictable network faults).

> HA: NameNode should not shut down completely without quorum, doesn't recover 
> from temporary network outages
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-8298
>                 URL: https://issues.apache.org/jira/browse/HDFS-8298
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ha, namenode, qjm
>    Affects Versions: 2.6.0
>            Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal 
> nodes (eg. network interruption), the NameNode shuts down entirely, when it 
> should instead go in to a standby mode so that it can stay online and retry 
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary 
> network outage is resolved, the entire cluster remains offline indefinitely 
> until operator intervention, whereas it could have self-repaired after 
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
> required journal (JournalAndStre
> am(mgr=QJM to [<ip>:8485, <ip>:8485, <ip>:8485], stream=QuorumOutputStream 
> starting at txid 54270281))
> java.io.IOException: Interrupted waiting 20000ms for a quorum of nodes to 
> respond.
>         at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>         at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>         at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
>         at java.lang.Thread.run(Thread.java:745)
> 2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
> txid 54270281
> 2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
> Exiting with status 1
> 2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
> SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at <custom_scrubbed>/<ip>
> ************************************************************/{code}
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

Reply via email to