[jira] [Commented] (HDFS-9646) ErasureCodingWorker may fail when recovering data blocks with length less than the first internal block

Kai Zheng (JIRA) Tue, 19 Jan 2016 18:19:36 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107837#comment-15107837
 ]


Kai Zheng commented on HDFS-9646:
---------------------------------

Thanks [~jingzhao] for the great update!
1. A question:
bq. // stop at least one DN to trigger recovery
Wonder if it is or should, recoverying can also be triggered by corrupt case 
(the DN is live or not stopped). I don't check the NN side codes, but I 
thought, when an internal block in a block group was reported bad/corrupt by 
client or DN, NN will arrange and schedule an EC recovery task?

2. Suggestion. Woner if we could share the following utility between client and 
datanode.
{code}
+    private void addCorruptedBlock(ExtendedBlock blk, DatanodeInfo node,
+        Map<ExtendedBlock, Set<DatanodeInfo>> corruptionMap) {
+      Set<DatanodeInfo> dnSet = corruptionMap.get(blk);
+      if (dnSet == null) {
+        dnSet = new HashSet<>();
+        corruptionMap.put(blk, dnSet);
+      }
+      if (!dnSet.contains(node)) {
+        dnSet.add(node);
+      }
+    }
{code}
3. Minor. Maybe {{isDebugEnabled}} condition isn't necessary here. Also some 
other similar places.
{code}
-      if (DFSClient.LOG.isDebugEnabled()) {
-        DFSClient.LOG.debug("Exception during striped read task", e);
+      if (LOG.isDebugEnabled()) {
+        LOG.debug("Exception during striped read task", e);
       }
{code}

> ErasureCodingWorker may fail when recovering data blocks with length less 
> than the first internal block
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9646
>                 URL: https://issues.apache.org/jira/browse/HDFS-9646
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: erasure-coding
>    Affects Versions: 3.0.0
>            Reporter: Takuya Fukudome
>            Assignee: Jing Zhao
>            Priority: Critical
>         Attachments: HDFS-9646.000.patch, HDFS-9646.001.patch, 
> HDFS-9646.002.patch, HDFS-9646.003.patch, test-reconstruct-stripe-file.patch
>
>
> This is reported by [~tfukudom]: ErasureCodingWorker may fail with the 
> following exception when recovering a non-full internal block.
> {code}
> 2016-01-06 11:14:44,740 WARN  datanode.DataNode 
> (ErasureCodingWorker.java:run(467)) - Failed to recover striped block: 
> BP-987302662-172.29.4.13-1450757377698:blk_-92233720368
> 54322288_29751
> java.io.IOException: Transfer failed for all targets.
>         at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker$ReconstructAndTransferBlock.run(ErasureCodingWorker.java:455)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9646) ErasureCodingWorker may fail when recovering data blocks with length less than the first internal block

Reply via email to