[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918694#comment-16918694
 ] 

Max Risuhin commented on ARROW-5995:
------------------------------------

[~efiop]

> Checksum(md5 of blocks crcs) is always computed on request and is not stored 
> anywhere, right?

> Are crcs stored by hdfs somewhere?

 

According to 
[https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html:]

"The HDFS client software implements checksum checking on the contents of HDFS 
files. When a client creates an HDFS file, it computes a checksum of each block 
of the file and stores these checksums in a separate hidden file in the same 
HDFS namespace. When a client retrieves file contents it verifies that the data 
it received from each DataNode matches the checksum stored in the associated 
checksum file. If not, then the client can opt to retrieve that block from 
another DataNode that has a replica of that block."

Regarding final file md5 checksum retrieved from all blocks checksums, I can't 
find so far any place where request result might be cached and used later.

 

> If they are, is there already a way to retrieve them in pyarrow or do we need 
> libhdfs support first? If they are not, then they are also computed on 
> request and so we could compute them in pyarrow itself, without libhdfs 
> support, right?

 

PyArrow/ArrowCpp communicates with hdfs only through libhdfs API. It seems we 
can't calculate file checksum on Arrow side because 
[hdfs.h|https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/include/hdfs/hdfs.h]
 doesn't provide API to get checksum of each file's block.

 

 

> [Python] pyarrow: hdfs: support file checksum
> ---------------------------------------------
>
>                 Key: ARROW-5995
>                 URL: https://issues.apache.org/jira/browse/ARROW-5995
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Ruslan Kuprieiev
>            Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to