[ 
https://issues.apache.org/jira/browse/HDFS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506606#comment-13506606
 ] 

Kihwal Lee commented on HDFS-4233:
----------------------------------

Here are more details: rollEditLog() was called via RPC from SNN and opening of 
new edit files failed. The exception was sent back to the caller, but no action 
was taken locally. From this point on, the edit log state is  
BETWEEN_LOG_SEGMENTS and no further rolling was allowed because 
endCurrentLogSegment() fails. But logSync() and logEdit() went on as if nothing 
is wrong.

Trunk does not have this issue. In {{mapJournalsAndReportErrors()}}, if a 
journal marked as required fails, namenode will terminate. But if none is 
marked required, it will simply throw an exception even if all journals fail. 
But logSync() will log FATAL and terminate since JournalSet#isEmpty() works 
diferently in trunk.

In branch-0.23, FSEditLog maintains a list of journals. logSync() invokes 
isEmpty(), but it won't check the validity of journals in the list. Instead it 
checks one by one in a loop. Although it already has a logic for counting and 
disabling bad journals, there is nothing equivalent to the resource 
availability check in trunk/branch-2.  I think the best place to add this is 
{disableAndReportErrorOnJournals()}. This will make the failure behavior almost 
same as what is already implemented in truck/branch-2.

This issue does not exit in branch-1, where rollEditLog() clears 
{{editStreams}} before creating new edit files. Since it calls 
{{exitIfNoStreams()}} before returning, namenode will terminate if no edit 
stream was successfully created.

As for test cases, trunk already has TestEditLogJournalFailures.  I will create 
a new patch for branch-0.23 and a test case.
                
> NN keeps serving even after no journals started while rolling edit
> ------------------------------------------------------------------
>
>                 Key: HDFS-4233
>                 URL: https://issues.apache.org/jira/browse/HDFS-4233
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 0.23.5
>            Reporter: Kihwal Lee
>            Priority: Blocker
>         Attachments: hdfs-4233-branch-0.23-quick-death.patch
>
>
> We've seen namenode keeps serving even after rollEditLog() failure. Instead 
> of taking a corrective action or regard this condition as FATAL, it keeps on 
> serving and modifying its file system state. No logs are written from this 
> point, so if the namenode is restarted, there will be data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to