[ 
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156467#comment-15156467
 ] 

Rakesh R commented on HDFS-8430:
--------------------------------

Its really an interesting work! Thank you [~drankye] and all others for the 
detailed thoughts. I'm trying to understand more about the work, it would be 
great if you could give few clarifications about the following points. Please 
excuse, if I'm asking questions that are already explained in jira. Thanks!

{code}
For example, Consider there are two block groups for a striped file "file_1" -> 
bg0 and bg1.
blockgroup0 => bg0_cell_00, bg0_cell_01, bg0_cell_02, bg0_cell_03, bg0_cell_04, 
bg0_cell_05
blockgroup0 => bg0_cell_10, bg0_cell_11, bg0_cell_12, bg0_cell_13, bg0_cell_14, 
bg0_cell_15
blockgroup0 => bg0_cell_20, bg0_cell_21, bg0_cell_22, bg0_cell_23, bg0_cell_24, 
bg0_cell_25

blockgroup1 => bg1_cell_00, bg1_cell_01, bg1_cell_02, bg1_cell_03, bg1_cell_04, 
bg1_cell_05
blockgroup1 => bg1_cell_10, bg1_cell_11, bg1_cell_12, bg1_cell_13, bg1_cell_14, 
bg1_cell_15
blockgroup1 => bg1_cell_20, bg1_cell_21, bg1_cell_22, bg1_cell_23, bg1_cell_24, 
bg1_cell_25
{code}

*Query1)* Does the proposal use pre-computed block checksum present in the 
block metadata instead of re-calculating it again?

*Query2)* This question is continuation to the first one. Could you tell me the 
finalized/agreed approach for the checksum computation. I could see two 
approaches for a striped file:
- Approach1:- do it in ROW wise, each stripe by stripe, get the pre-computed 
checksum values bg0_cell_00, bg0_cell_01, bg0_cell_02, bg0_cell_03, 
bg0_cell_04, bg0_cell_05 or 
- Approach2:- do it in COLUMN wise, clubbing cells across the block group and 
then do computeChecksum(bg0_cell_00, bg0_cell_10, bg0_cell_20).

*Query3)* How do the new algorithm applys to a contiguous file? Does it split 
the file into smaller cells and each cell size could be of 64KB size?

*Query4)* I hope you are planning to use the current MD5MD5CRC32, isn't it?

> Erasure coding: compute file checksum for stripe files
> ------------------------------------------------------
>
>                 Key: HDFS-8430
>                 URL: https://issues.apache.org/jira/browse/HDFS-8430
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Walter Su
>            Assignee: Kai Zheng
>         Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a  distributed file checksum algorithm. It's designed 
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped 
> block group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to