[
https://issues.apache.org/jira/browse/HADOOP-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591080#action_12591080
]
lohit vijayarenu commented on HADOOP-2065:
------------------------------------------
When a datanode reports block as corrupt, instead of deleting we mark the
(datanode-block) to be corrupt and request replication of this block
- Add a new synchronized method something like
FSNameSystem.markAsCorrupt(Block, DatanodeDescriptor) which could mark replica
of the this particular Datanode to be corrupt. It would also add this block to
neededReplication queue.
- Each DatanodeDescriptor hold a set of corrupt blocks and provide methods to
lookup given a block.
- Modify NumReplicas class to filter out nodes with such replica copies and
report them via corruptReplicas() similar to decommissionedReplicas()
- While choosing the src node for replication in chooseSourceDatanode() we use
the copies which are not yet marked as corrupt
- Inside addStoredBlock() whenever we add a new node(replica) we also check, if
we already have a corrupt copy. If we have reached the desired replication for
this block and the corrupt block is in excess, we invalidate it here.
I think this would take care of retaining all corrupt copies, but one case when
I see a problem is pendingReplication thread which would keep on looping to
replicate corrupt blocks. We could have a check here to see if number of
pending replicas for block is equal to the number of corrupt copies and remove
from pendingReplication thread.
> Replication policy for corrupted block
> ---------------------------------------
>
> Key: HADOOP-2065
> URL: https://issues.apache.org/jira/browse/HADOOP-2065
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.14.1
> Reporter: Koji Noguchi
> Assignee: lohit vijayarenu
> Fix For: 0.18.0
>
>
> Thanks to HADOOP-1955, even if one of the replica is corrupted, the block
> should get replicated from a good replica relatively fast.
> Created this ticket to continue the discussion from
> http://issues.apache.org/jira/browse/HADOOP-1955#action_12531162.
> bq. 2. Delete corrupted source replica
> bq. 3. If all replicas are corrupt, stop replication.
> For (2), it'll be nice if the namenode can delete the corrupted block if
> there's a good replica on other nodes.
> For (3), I prefer if the namenode can still replicate the block.
> Before 0.14, if the file was corrupted, users were still able to pull the
> data and decide if they want to delete those files. (HADOOP-2063)
> In 0.14 and later, we cannot/don't replicate these blocks so they eventually
> get lost.
> To make the matters worse, if the corrupted file is accessed, all the
> corrupted replicas would be deleted except for one and stay as replication
> factor of 1 forever.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.