[jira] [Commented] (HDFS-9646) ErasureCodingWorker may fail when recovering data blocks with length less than the first internal block

Kai Zheng (JIRA) Mon, 18 Jan 2016 06:45:08 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105368#comment-15105368
 ]


Kai Zheng commented on HDFS-9646:
---------------------------------

The change of {{reportCheckSumFailure}} obviously corrected a coding error. The 
corrupted map should be iterated and processed all. Good catch!
{code}
  /**
   * DFSInputStream reports checksum failure.
   * Case I : client has tried multiple data nodes and at least one of the
   * attempts has succeeded. We report the other failures as corrupted block to
   * namenode.
   * Case II: client has tried out all data nodes, but all failed. We
   * only report if the total number of replica is 1. We do not
   * report otherwise since this maybe due to the client is a handicapped client
   * (who can not read).
   * @param corruptedBlockMap map of corrupted blocks
   * @param dataNodeCount number of data nodes who contains the block replicas
   */
void reportCheckSumFailure(Map<ExtendedBlock, Set<DatanodeInfo>> 
corruptedBlockMap, int dataNodeCount)
{code}
Looking at the comment and the method signature, it's kinds of confusing. The 
map contains multiple blocks to report or not, for each block, it looks like to 
need a {{dataNodeCount}} value to decide report or not according to the 
reasonable logic. However, only one dataNodeCount value is passed to. Looking 
at the place how it's called, the locations of current block is used. Not sure 
this is related to the reported issue and better to be handled here, though.

> ErasureCodingWorker may fail when recovering data blocks with length less 
> than the first internal block
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9646
>                 URL: https://issues.apache.org/jira/browse/HDFS-9646
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: erasure-coding
>    Affects Versions: 3.0.0
>            Reporter: Takuya Fukudome
>            Assignee: Jing Zhao
>            Priority: Critical
>         Attachments: HDFS-9646.000.patch, HDFS-9646.001.patch, 
> HDFS-9646.002.patch, test-reconstruct-stripe-file.patch
>
>
> This is reported by [~tfukudom]: ErasureCodingWorker may fail with the 
> following exception when recovering a non-full internal block.
> {code}
> 2016-01-06 11:14:44,740 WARN  datanode.DataNode 
> (ErasureCodingWorker.java:run(467)) - Failed to recover striped block: 
> BP-987302662-172.29.4.13-1450757377698:blk_-92233720368
> 54322288_29751
> java.io.IOException: Transfer failed for all targets.
>         at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker$ReconstructAndTransferBlock.run(ErasureCodingWorker.java:455)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9646) ErasureCodingWorker may fail when recovering data blocks with length less than the first internal block

Reply via email to