[ 
https://issues.apache.org/jira/browse/HDFS-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444776#comment-13444776
 ] 

Colin Patrick McCabe commented on HDFS-3540:
--------------------------------------------

bq. If I have not missed anything, there are two risks in the branch-1 Recovery 
Mode feature:  If there is a stray OP_INVALID byte, it could be misinterpreted 
as an end-of-log and lead to silent data loss.

Recovery mode will always prompt before doing anything which could lead to data 
loss.  So no, stray {{OP_INVALID}} bytes will not lead to silent data loss.

Actually, looking at change 1349086, which was introduced by HDFS-3521, I see 
that it broke end-of-file checking by default.  Since 
{{dfs.namenode.edits.toleration.length}} is -1 by default, 
{{FSEditLog#checkEndOfLog}} is never invoked.  However, this is not a problem 
with Recovery Mode; it's a problem with change 1349086.

bq. Recovery Mode does not consider the corruption length.

Recovery Mode does consider the corruption length.  The location at which the 
problem occurred is printed out.  This is the message "Failed to parse edit log 
(<file name>) at position <position>, edit log length is <length>..."  This 
information is provided to allow the system administrator to make an informed 
decision.

bq. Therefore, I suggest to remove Recovery Mode from branch-1 and change the 
default toleration length to 0.

Recovery mode has already proven itself useful in the field in code lines 
derived from branch-1.  I don't see any reason to remove it.

I agree that {{dfs.namenode.edits.toleration.length}} should be 0 by default.

At the end of the day, both edit log toleration and Recovery Mode can cause 
data loss.  The difference is that Recovery Mode will prompt the system 
administrator before hand, and edit log toleration will not.  This is the 
reason why I opposed edit log toleration originally, and it's the reason why I 
believe it should be off by default now.  Silent data loss is not a feature-- 
not one that we want, anyway.
                
> Further improvement on recovery mode and edit log toleration in branch-1
> ------------------------------------------------------------------------
>
>                 Key: HDFS-3540
>                 URL: https://issues.apache.org/jira/browse/HDFS-3540
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 1.2.0
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>
> *Recovery Mode*: HDFS-3479 backported HDFS-3335 to branch-1.  However, the 
> recovery mode feature in branch-1 is dramatically different from the recovery 
> mode in trunk since the edit log implementations in these two branch are 
> different.  For example, there is UNCHECKED_REGION_LENGTH in branch-1 but not 
> in trunk.
> *Edit Log Toleration*: HDFS-3521 added this feature to branch-1 to remedy 
> UNCHECKED_REGION_LENGTH and to tolerate edit log corruption.
> There are overlaps between these two features.  We study potential further 
> improvement in this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to