[ 
https://issues.apache.org/jira/browse/HADOOP-4663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668177#action_12668177
 ] 

Raghu Angadi commented on HADOOP-4663:
--------------------------------------

@Dhruba, could you check if following analysis of a possible corruption 
correct? :

Say a block is being written and gen stamp is consistent across all three 
datanodes (common case) :

 * Say block sizes after a cluster restart are : x+5, x+10, and x+15 (on three 
datanode respectively). But this does not mean checksum file is correct since 
DataNode could be killed any time even OS could restart. 
 * When datanodes join the cluster, blocks on D1 and D2 will be deleted since 
they are smaller than block on D3. Later block on D3 will be reported as 
corrupt since Checksums don't match. This is hard corruption. 
 * Note what we will lose any data on the block that was synced earlier.  

> I am unable to visualize a scenario where this could cause a bug. 

It does not imply it is correct. The most important job of HDFS is keep the 
data intact. It should take priority over new features or scheduled. IMHO The 
current approach of _"it is correct until proven wrong"_ does not really suit 
for critical parts. For e.g. couple of months back there were no known 
corruption issues.. but we later saw many such issues. 

I am not saying we can prove everything. But we should be conservative and do 
only what is already known to be correct. In this case, the block files are 
valid only until the last sync.. so truncating to last sync (or to some known 
to be good length) is better.



> Datanode should delete files under tmp when upgraded from 0.17
> --------------------------------------------------------------
>
>                 Key: HADOOP-4663
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4663
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Raghu Angadi
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.1
>
>         Attachments: deleteTmp.patch, deleteTmp2.patch, deleteTmp_0.18.patch, 
> handleTmp1.patch
>
>
> Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp  
> directory since these files are not valid anymore. But in 0.18 it moves these 
> files to normal directory incorrectly making them valid blocks. One of the 
> following would work :
> - remove the tmp files during upgrade, or
> - if the files under /tmp are in pre-18 format (i.e. no generation), delete 
> them.
> Currently effect of this bug is that, these files end up failing block 
> verification and eventually get deleted. But cause incorrect over-replication 
> at the namenode before that.
> Also it looks like our policy regd treating files under tmp needs to be 
> defined better. Right now there are probably one or two more bugs with it. 
> Dhruba, please file them if you rememeber.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to