[ 
https://issues.apache.org/jira/browse/HDFS-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056951#comment-13056951
 ] 

Matt Foley commented on HDFS-2010:
----------------------------------

Hi Aaron and Todd, in FSEditLog:
{code}
+      if (badJournals.size() >= stillGoodJournals.size()) {
+        LOG.error("Could not sync any journal to persistent storage. " +
+            "Unsynced transactions: " + (txid - synctxid));
+        runtime.exit(1);
+      }
{code}

The test "if (badJournals.size() >= stillGoodJournals.size())" probably should 
be "if (badJournals.size() >= journals.size())", because:  Suppose you start 
with 5 journals, and fail 3 of them in the block
{code}
        for (JournalAndStream jas : journals) {
          if (!jas.isActive()) continue;
          try {
            jas.getCurrentStream().setReadyToFlush();
            stillGoodJournals.add(jas);
          } catch (IOException ie) {
            LOG.error("Unable to get ready to flush.", ie);
            badJournals.add(jas);
          }
        }
{code}
Then suppose both remaining candidate journals actually sync successfully in 
the "// do the sync" block.  You'll still conclude that (badJournals.size() >= 
stillGoodJournals.size()), and wrongly call exit().

Also, I find the name "stillGoodJournals" confusing, because when a journal was 
found to be bad, in the "// do the sync" block, it isn't removed from the 
"stillGoodJournals" list.  Perhaps "candidateJournalsToSync" would be more 
descriptive?

> Clean up and test behavior under failed edit streams
> ----------------------------------------------------
>
>                 Key: HDFS-2010
>                 URL: https://issues.apache.org/jira/browse/HDFS-2010
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: name-node
>    Affects Versions: Edit log branch (HDFS-1073)
>            Reporter: Todd Lipcon
>            Assignee: Aaron T. Myers
>             Fix For: Edit log branch (HDFS-1073)
>
>         Attachments: hdfs-2010.0.patch, hdfs-2010.1.patch
>
>
> Right now there is very little test coverage of situations where one or more 
> of the edits directories fails. In trunk, the behavior when all of the edits 
> directories are dead is that the NN prints a fatal level log message and 
> calls Runtime.exit(-1).
> I don't think this is really the behavior we want. Needs a bit of thought, 
> but I think something like the following would make more sense:
> - any calls currently waiting on logSync should end up throwing an exception
> - NN should probably enter safe mode
> - ops can restore edits directories and then ask the NN to restore storage, 
> at which point it could edit safemode
> - alternatively, ops could call ask the NN to do saveNamespace and then shut 
> it down

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to