[
https://issues.apache.org/jira/browse/HDFS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506606#comment-13506606
]
Kihwal Lee commented on HDFS-4233:
----------------------------------
Here are more details: rollEditLog() was called via RPC from SNN and opening of
new edit files failed. The exception was sent back to the caller, but no action
was taken locally. From this point on, the edit log state is
BETWEEN_LOG_SEGMENTS and no further rolling was allowed because
endCurrentLogSegment() fails. But logSync() and logEdit() went on as if nothing
is wrong.
Trunk does not have this issue. In {{mapJournalsAndReportErrors()}}, if a
journal marked as required fails, namenode will terminate. But if none is
marked required, it will simply throw an exception even if all journals fail.
But logSync() will log FATAL and terminate since JournalSet#isEmpty() works
diferently in trunk.
In branch-0.23, FSEditLog maintains a list of journals. logSync() invokes
isEmpty(), but it won't check the validity of journals in the list. Instead it
checks one by one in a loop. Although it already has a logic for counting and
disabling bad journals, there is nothing equivalent to the resource
availability check in trunk/branch-2. I think the best place to add this is
{disableAndReportErrorOnJournals()}. This will make the failure behavior almost
same as what is already implemented in truck/branch-2.
This issue does not exit in branch-1, where rollEditLog() clears
{{editStreams}} before creating new edit files. Since it calls
{{exitIfNoStreams()}} before returning, namenode will terminate if no edit
stream was successfully created.
As for test cases, trunk already has TestEditLogJournalFailures. I will create
a new patch for branch-0.23 and a test case.
> NN keeps serving even after no journals started while rolling edit
> ------------------------------------------------------------------
>
> Key: HDFS-4233
> URL: https://issues.apache.org/jira/browse/HDFS-4233
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 0.23.5
> Reporter: Kihwal Lee
> Priority: Blocker
> Attachments: hdfs-4233-branch-0.23-quick-death.patch
>
>
> We've seen namenode keeps serving even after rollEditLog() failure. Instead
> of taking a corrective action or regard this condition as FATAL, it keeps on
> serving and modifying its file system state. No logs are written from this
> point, so if the namenode is restarted, there will be data loss.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira