[ 
https://issues.apache.org/jira/browse/HDFS-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001756#comment-15001756
 ] 

Walter Su commented on HDFS-9381:
---------------------------------

bq. Another thought is, how about when pendingReplication decrement called and 
if we have still entry left in pending Replications, then we can force timeout 
quickly for that block to move to neededReplications if block is in striped 
mode. Because in striped mode our decision is to allow only one block 
replication at once. Thoughts?
We allow one ErasureCodingWork at once. But one work can recover multiple 
internal blocks. These internal blocks can be reported at different time. When 
one internal block is reported, we still need to wait others to be reported 
instead of forcing timeout.

bq. maybe we need to have an "inactive" pending replication list, and 
periodically promote a block to the active list when EC recovery is done?
Have a fake pending list instead of manipulating the old one. Interesting.

The purpose here is to reduce the lock time needed by {{neededReplications}}, 
to remove the busy waiting for previous ECWork. But I think 
{{replicationRecheckInterval}} by default is 3s. So a lock-unlock operation 
can't hurt NN performance very much, unless there are many striped blocks of 
this kind. But I think they are rare? We have a block of this kind when we 
short of DNs, we can't choose enough DNs to schedule recovery at once, so we 
shedule twice.
And after HDFS-8966, the affect of this issue is smaller.

> When same block came for replication for Striped mode, we can move that block 
> to PendingReplications
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9381
>                 URL: https://issues.apache.org/jira/browse/HDFS-9381
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: erasure-coding, namenode
>    Affects Versions: 3.0.0
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-9381.00.patch
>
>
> Currently I noticed that we are just returning null if block already exists 
> in pendingReplications in replication flow for striped blocks.
> {code}
> if (block.isStriped()) {
>       if (pendingNum > 0) {
>         // Wait the previous recovery to finish.
>         return null;
>       }
> {code}
>  Here if we just return null and if neededReplications contains only fewer 
> blocks(basically by default if less than numliveNodes*2), then same blocks 
> can be picked again from neededReplications from next loop as we are not 
> removing element from neededReplications. Since this replication process need 
> to take fsnamesystmem lock and do, we may spend some time unnecessarily in 
> every loop. 
> So my suggestion/improvement is:
>  Instead of just returning null, how about incrementing pendingReplications 
> for this block and remove from neededReplications? and also another point to 
> consider here is, to add into pendingReplications, generally we need target 
> and it is nothing but to which node we issued replication command. Later when 
> after replication success and DN reported it, block will be removed from 
> pendingReplications from NN addBlock. 
>  So since this is newly picked block from neededReplications, we would not 
> have selected target yet. So which target to be passed to pendingReplications 
> if we add this block? One Option I am thinking is, how about just passing 
> srcNode itself as target for this special condition? So, anyway if the block 
> is really missed, srcNode will not report it. So this block will not be 
> removed from pending replications, so that when it is timed out, it will be 
> considered for replication again and that time it will find actual target to 
> replicate while processing as part of regular replication flow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to