[jira] [Commented] (HDFS-9833) Erasure coding: recomputing block checksum on the fly by reconstructing the missed/corrupt block data

Kai Zheng (JIRA) Wed, 25 May 2016 01:08:40 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299652#comment-15299652
 ]


Kai Zheng commented on HDFS-9833:
---------------------------------

1. The following two would be good to have the same names:
{code}
+  static public byte[] convertBlockIndices(List<Integer> blockIndices) {
+    @SuppressWarnings("unchecked")
+    byte[] blkIndices = new byte[blockIndices.size()];
+    for (int i = 0; i < blockIndices.size(); i++) {
+      blkIndices[i] = (byte) blockIndices.get(i).intValue();
+    }
+    return blkIndices;
+  }
+

+  public static List<Integer> convert(byte[] blockIndices) {
+    List<Integer> results = new ArrayList<>(blockIndices.length);
+    for (byte bt : blockIndices) {
+      results.add(Integer.valueOf(bt));
+    }
+    return results;
+  }
+
{code}

2. HashMap here might be little heavy, an array should work instead.
{code}
HashMap<Byte, ECBlockInfo> liveDns = new HashMap<>(datanodes.length);
{code}

3. The logic below looks like to recompute only one block checksum a time. The 
reconstruction can be done just by a time if multiple blocks are in the 
question. Sure it can be updated in follow-on tasks.
{code}
+      for (int idx = 0; idx < numDataUnits && idx < blkIndxLen; idx++) {
+        try {
+          ECBlockInfo ecBlkInfo = liveDns.get((byte) idx);
+          if (null == ecBlkInfo) {
+            // reconstruct block and calculate checksum for missing node
+            recalculateChecksum(idx);
+          } else {
+            try {
+              ExtendedBlock block = StripedBlockUtil.constructInternalBlock(
+                  blockGroup, ecPolicy.getCellSize(), numDataUnits, idx);
+              checksumBlock(block, idx, ecBlkInfo.getToken(),
+                  ecBlkInfo.getDn());
+            } catch (IOException ioe) {
+              LOG.warn("Exception while reading checksum", ioe);
+              // reconstruct block and calculate checksum for the failed node
+              recalculateChecksum(idx);
+            }
+          }
+        } catch (IOException e) {
+          LOG.warn("Failed to get the checksum", e);
+        }
{code}

4. Could we have some wrapper like *ReconstructionInfo* to contain the relevant 
parameters? I'm afraid we may need more in future ...
{code}
   StripedReader(StripedReconstructor reconstructor, DataNode datanode,
-                Configuration conf,
-                BlockECReconstructionInfo reconstructionInfo) {
+      Configuration conf, ErasureCodingPolicy ecPolicy,
+      ExtendedBlock blockGroup, byte[] liveIndices, DatanodeInfo[] sources) {
{code}

5. The following *TODO* should be resolved since you have added two tests in 
datanode failures.
{code}
// TODO: allow datanode failure, HDFS-9833
{code}

> Erasure coding: recomputing block checksum on the fly by reconstructing the 
> missed/corrupt block data
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9833
>                 URL: https://issues.apache.org/jira/browse/HDFS-9833
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Rakesh R
>              Labels: hdfs-ec-3.0-must-do
>         Attachments: HDFS-9833-00-draft.patch, HDFS-9833-01.patch, 
> HDFS-9833-02.patch, HDFS-9833-03.patch, HDFS-9833-04.patch
>
>
> As discussed in HDFS-8430 and HDFS-9694, to compute striped file checksum 
> even some of striped blocks are missed, we need to consider recomputing block 
> checksum on the fly for the missed/corrupt blocks. To recompute the block 
> checksum, the block data needs to be reconstructed by erasure decoding, and 
> the main needed codes for the block reconstruction could be borrowed from 
> HDFS-9719, the refactoring of the existing {{ErasureCodingWorker}}. In EC 
> worker, reconstructed blocks need to be written out to target datanodes, but 
> here in this case, the remote writing isn't necessary, as the reconstructed 
> block data is only used to recompute the checksum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-9833) Erasure coding: recomputing block checksum on the fly by reconstructing the missed/corrupt block data

Reply via email to