[ 
https://issues.apache.org/jira/browse/HDFS-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268251#comment-15268251
 ] 

Rakesh R commented on HDFS-9833:
--------------------------------

Following is the brief idea about the proposed approach. Kindly go through this 
and would be great to see the feedback on this. Thanks!

In our existing striped checksum logic, client is connecting to the first 
datanode in the block locations and sending {{Op.BLOCK_GROUP_CHECKSUM}} 
command. He will iterate over {{ecPolicy.getNumDataUnits()}} datanodes and 
invokes {{Op.BLOCK_CHECKSUM}} command one by one. During these operations it 
can hit {{IOException}} and fail the checksum call.

To begin with, I think will catch generic {{IOException}} while performing 
operation on a datanode. The block corresponding to the failed datanode will be 
chosen for reconstruction and then recompute checksum with the reconstructed 
block data.
# Datanode side changes:
If there is an IOException while performing {{Op.BLOCK_CHECKSUM}} command then 
it will consider this block for reconstruction and calculate its checksum. 
Again the reconstruction errors will fail the checksum call.
# Client side changes:
Presently {{FileChecksumHelper#checksumBlockGroup()}} function is throwing 
IOException back to the client if the first datanode has errors, instead will 
try connecting to {{#getNumParityUnits()}} number of datanodes before failing 
the checksum operation.

Thanks [~umamaheswararao] for the offline discussions.

> Erasure coding: recomputing block checksum on the fly by reconstructing the 
> missed/corrupt block data
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9833
>                 URL: https://issues.apache.org/jira/browse/HDFS-9833
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Rakesh R
>              Labels: hdfs-ec-3.0-must-do
>
> As discussed in HDFS-8430 and HDFS-9694, to compute striped file checksum 
> even some of striped blocks are missed, we need to consider recomputing block 
> checksum on the fly for the missed/corrupt blocks. To recompute the block 
> checksum, the block data needs to be reconstructed by erasure decoding, and 
> the main needed codes for the block reconstruction could be borrowed from 
> HDFS-9719, the refactoring of the existing {{ErasureCodingWorker}}. In EC 
> worker, reconstructed blocks need to be written out to target datanodes, but 
> here in this case, the remote writing isn't necessary, as the reconstructed 
> block data is only used to recompute the checksum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to