[
https://issues.apache.org/jira/browse/HDFS-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268251#comment-15268251
]
Rakesh R commented on HDFS-9833:
--------------------------------
Following is the brief idea about the proposed approach. Kindly go through this
and would be great to see the feedback on this. Thanks!
In our existing striped checksum logic, client is connecting to the first
datanode in the block locations and sending {{Op.BLOCK_GROUP_CHECKSUM}}
command. He will iterate over {{ecPolicy.getNumDataUnits()}} datanodes and
invokes {{Op.BLOCK_CHECKSUM}} command one by one. During these operations it
can hit {{IOException}} and fail the checksum call.
To begin with, I think will catch generic {{IOException}} while performing
operation on a datanode. The block corresponding to the failed datanode will be
chosen for reconstruction and then recompute checksum with the reconstructed
block data.
# Datanode side changes:
If there is an IOException while performing {{Op.BLOCK_CHECKSUM}} command then
it will consider this block for reconstruction and calculate its checksum.
Again the reconstruction errors will fail the checksum call.
# Client side changes:
Presently {{FileChecksumHelper#checksumBlockGroup()}} function is throwing
IOException back to the client if the first datanode has errors, instead will
try connecting to {{#getNumParityUnits()}} number of datanodes before failing
the checksum operation.
Thanks [~umamaheswararao] for the offline discussions.
> Erasure coding: recomputing block checksum on the fly by reconstructing the
> missed/corrupt block data
> -----------------------------------------------------------------------------------------------------
>
> Key: HDFS-9833
> URL: https://issues.apache.org/jira/browse/HDFS-9833
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Kai Zheng
> Assignee: Rakesh R
> Labels: hdfs-ec-3.0-must-do
>
> As discussed in HDFS-8430 and HDFS-9694, to compute striped file checksum
> even some of striped blocks are missed, we need to consider recomputing block
> checksum on the fly for the missed/corrupt blocks. To recompute the block
> checksum, the block data needs to be reconstructed by erasure decoding, and
> the main needed codes for the block reconstruction could be borrowed from
> HDFS-9719, the refactoring of the existing {{ErasureCodingWorker}}. In EC
> worker, reconstructed blocks need to be written out to target datanodes, but
> here in this case, the remote writing isn't necessary, as the reconstructed
> block data is only used to recompute the checksum.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]