[ 
https://issues.apache.org/jira/browse/HDFS-11674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005886#comment-16005886
 ] 

Vinayakumar B commented on HDFS-11674:
--------------------------------------

bq. Could you please clarify how this part works? getBlockLocations sorts the 
blocks by network distance from the caller, randomizing replicas at the same 
distance. So lastBlock.getLocations()\[2\] may be the first replica in the 
pipeline some times.

In the below part of code, blockLocations were queried first and then set as 
pipeline explicitly for the test purpose. Also note that there is no 'sorting 
on distance' done for append calls. Its currently only for 
'getBlockLocations()' cal. May be could do that in a following Jira.
{code:java}
/*
 * Reset the pipeline for the append in such a way that, datanode which is
 * down is one of the mirror, not the first datanode.
 */
HdfsBlockLocation blockLocation = (HdfsBlockLocation) fs.getClient()
    .getBlockLocations(file.toString(), 0, BLOCK_SIZE)[0];
LocatedBlock lastBlock = blockLocation.getLocatedBlock();
.
.
.
DFSTestUtil.setPipeline((DFSOutputStream) os.getWrappedStream(),
  lastBlock);{code}

bq. I ran this test 5 times and it timed out once waiting for the file to be 
closed. I didn't debug it further though.
I will also check again, not sure whats wrong. But I am sure that its not 
because of current change or test. Could you paste console logs if possible.

> reserveSpaceForReplicas is not released if append request failed due to 
> mirror down and replica recovered
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11674
>                 URL: https://issues.apache.org/jira/browse/HDFS-11674
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Vinayakumar B
>            Assignee: Vinayakumar B
>            Priority: Critical
>              Labels: release-blocker
>         Attachments: HDFS-11674-01.patch, HDFS-11674-02.patch
>
>
> Scenario:
> 1. 3 Node cluster with 
> "dfs.client.block.write.replace-datanode-on-failure.policy"  as DEFAULT
> Block is written with x data.
> 2. One of the Datanode, NOT the first DN, is down
> 3. Client tries to append data to block and fails since one DN is down.
> 4. calls recoverLease() on the file.
> 5. Successfull recovery happens.
> Issue:
> 1. DNs which were connected from client before encountering mirror down, will 
> have the reservedSpaceForReplicas incremented, BUT never decremented. 
> 2. So in long run DN's all space will be in reservedSpaceForReplicas 
> resulting OutOfSpace errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to