[ 
https://issues.apache.org/jira/browse/HADOOP-4663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663891#action_12663891
 ] 

Konstantin Shvachko commented on HADOOP-4663:
---------------------------------------------

I think the proposed approach does not solve the problem.
It improves in a sense current state but does not eliminate the problem 
completely.

The idea of promoting blocks from "tmp" to the real storage is necessary to 
support sync(). We are trying to cover the following sequence of events:
# a client starts writing to a block and says sync();
# the data-node does the write and the sync() and then fails;
# the data-node restarts and the sync-ed block should appear as a valid block 
on this node, because the semantic of sync demands the sync-ed data to survive 
failures of clients, data-nodes and the name-node.

It seams natural to promote blocks from tmp to the real storage during startup, 
but this caused problems because together with the sync-ed blocks the data-node 
also promotes all other potentially incomplete blocks from the tmp.

One source of incomplete blocks in tmp is the internal block replication 
initiated by the name-node but not completed on the data-node due to a failure. 

(I) The proposal is to divide tmp into 2 directories 
"blocks_being_replicated" and "blocks_being_written". This excludes partially 
replicated 
blocks from being promoted to the real storage during data-node restarts.

But this does not cover another source of incomplete blocks, which is the 
regular block writing by a client. It also can fail, remain incomplete, and 
will be promoted to the main storage during data-node restart.
The question is why do we not care about these blocks? 
Why they cannot cause the same problems as the ones that are being replicated?

Suppose I do not use sync-s. The incomplete blocks will still be promoted to 
the real storage, will be reported to the name-node and the name-node will have 
to process them and finally remove most of them. Isn't it a degradation in 
performance.

(II) I'd rather consider dividing into two directories one having "transient" 
and another having "persistent" blocks. The transient blocks should be removed 
during startup, and persistent should be promoted into the real storage. Once 
sync-ed a block should be moved into the persistent directory.

(III) Another variation is to promote sync-ed (persistent) blocks directly to 
the main storage. The problem here is to deal with blockReceived and 
blockReports, which I feel can be solved.

(IV) Yet another approach is to keep the storage structure unchanged and write 
each sync-ed (finalized) portions of the block into main storage by appending 
that portion to the real block file. This will require an extra disk io for 
merging files (instead of renaming) but will let us discard everything that is 
in tmp as we do now.

What we are trying to do with the directories is to assign properties to the 
block replicas and make them survive crashes. Previously there were just 2 
properties final and transient. So we had 2 directories: tmp and main. Now we 
need a third property, which says the block is persistent but not finalized 
yet. So we tend to introduce yet another directory. I think this is too 
straightforward, because there is plenty of other approaches to implement 
boolean properties on entities.
Why we are not considering them?

Also worth noting this issue directly affects appends, because a replica being 
appended is first copied to the tmp directory and then treated as described 
above.


> Datanode should delete files under tmp when upgraded from 0.17
> --------------------------------------------------------------
>
>                 Key: HADOOP-4663
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4663
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Raghu Angadi
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: deleteTmp.patch, deleteTmp2.patch, deleteTmp_0.18.patch
>
>
> Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp  
> directory since these files are not valid anymore. But in 0.18 it moves these 
> files to normal directory incorrectly making them valid blocks. One of the 
> following would work :
> - remove the tmp files during upgrade, or
> - if the files under /tmp are in pre-18 format (i.e. no generation), delete 
> them.
> Currently effect of this bug is that, these files end up failing block 
> verification and eventually get deleted. But cause incorrect over-replication 
> at the namenode before that.
> Also it looks like our policy regd treating files under tmp needs to be 
> defined better. Right now there are probably one or two more bugs with it. 
> Dhruba, please file them if you rememeber.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to