[ 
https://issues.apache.org/jira/browse/HDFS-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444659#comment-13444659
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3540:
----------------------------------------------

If I have not missed anything, there are two risks in the branch-1 Recovery 
Mode feature:
# If there is a stray OP_INVALID byte, it could be misinterpreted as an 
end-of-log and lead to silent data loss.
# Recovery Mode does not consider the corruption length.  If an edit log is 
corrupted in the beginning and the admin mistakenly selects "stop reading" in 
Recovery Mode, then a large portion of the edit log is ignored.  It could cause 
unnecessary data loss even if the edit log has been backed up since datanodes 
will delete data.  In many cases, such data loss could be prevented or reduced 
because the edit log could possibly be recovered by other means.  This case 
arguably is an operation mistake.  However, Recovery Mode enables such mistake.

The Edit Log Toleration feature does not have these two risks if the toleration 
length is set to 0 (or a small number).  Edit Log Toleration always checks all 
bytes in the edit log, so #1 won't happen.  For #2, the length of corrupted 
data being tolerated is limited by the toleration length.  If an edit log is 
corrupted in the beginning and the corrupted length is large, then it will 
throw an exception.

Therefore, I suggest to remove Recovery Mode from branch-1 and change the 
default toleration length to 0.

                
> Further improvement on recovery mode and edit log toleration in branch-1
> ------------------------------------------------------------------------
>
>                 Key: HDFS-3540
>                 URL: https://issues.apache.org/jira/browse/HDFS-3540
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 1.2.0
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>
> *Recovery Mode*: HDFS-3479 backported HDFS-3335 to branch-1.  However, the 
> recovery mode feature in branch-1 is dramatically different from the recovery 
> mode in trunk since the edit log implementations in these two branch are 
> different.  For example, there is UNCHECKED_REGION_LENGTH in branch-1 but not 
> in trunk.
> *Edit Log Toleration*: HDFS-3521 added this feature to branch-1 to remedy 
> UNCHECKED_REGION_LENGTH and to tolerate edit log corruption.
> There are overlaps between these two features.  We study potential further 
> improvement in this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to