[ 
https://issues.apache.org/jira/browse/HDFS-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448966#comment-13448966
 ] 

Colin Patrick McCabe commented on HDFS-3540:
--------------------------------------------

Let me go into a little more detail here.

When we were originally talking about Recovery Mode, one big concern we had was 
that system administrators would overuse Recovery Mode to fix issues that might 
be better addressed in a different way.  Of course, it's impossible to prevent 
all misuse-- human beings are not perfect, and any tool can be misused.  That's 
the reason why we made recovery mode a startup option, rather than a 
configuration.  It would be too easy for people to set the configuration and 
then leave it set even after the problem was gone.  That's also the reason why 
an NameNode in RM exits as soon as it has loaded the edit log and written a new 
FSImage.  This was all discussed in HDFS-3004.

Obviously edit log toleration goes against those assumptions, and in a way that 
frankly, I think is very dangerous.

Recovery Mode is generally an extensible concept.  Since it has nothing to do 
with the physical structure of the edit log on-disk, it can be extended to 
handle arbitrary types of corruption.  For example, what if you encounter an 
edit that relies on a directory that doesn't exist (because of corruption 
earlier in the log)?  This is something that recovery mode could conceivably 
handle by displaying a prompt and asking "would you like to create the parent 
directory for the directory this edit references?"

Edit Log Toleration is not extensible.  It can only ever handle one type of 
corruption: tail corruption.  But we rarely see tail corruption any more, since 
FSEditLog preallocation was improved in branch-1 (HDFS-3596).  I can't think of 
a single case of tail corruption we've seen in the past few months.  Many of 
the cases of corruption we've seen have been HDFS-3652, and edit log toleration 
is inherently useless for this purpose.  Missing features can be fixed; 
inherent uselessness cannot.

And these are just the technical arguments.  There's many more convincing 
process-based arguments.  branch-1 is a stable branch.  We should be fixing 
bugs, not making major changes.  We should be trying to minimize the divergence 
between branch-1 and branch-2, not amplify it.  People already know how to use 
recovery mode.  We're not going to retrain people to use an (in my opinion more 
error-prone) system that does the same thing.

Let's just fix the bugs we have (I have pointed out some in this thread), get 
stuff working, and focus our efforts on the future not the past.
                
> Further improvement on recovery mode and edit log toleration in branch-1
> ------------------------------------------------------------------------
>
>                 Key: HDFS-3540
>                 URL: https://issues.apache.org/jira/browse/HDFS-3540
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 1.2.0
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>
> *Recovery Mode*: HDFS-3479 backported HDFS-3335 to branch-1.  However, the 
> recovery mode feature in branch-1 is dramatically different from the recovery 
> mode in trunk since the edit log implementations in these two branch are 
> different.  For example, there is UNCHECKED_REGION_LENGTH in branch-1 but not 
> in trunk.
> *Edit Log Toleration*: HDFS-3521 added this feature to branch-1 to remedy 
> UNCHECKED_REGION_LENGTH and to tolerate edit log corruption.
> There are overlaps between these two features.  We study potential further 
> improvement in this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to