[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918903#comment-16918903
 ] 

Max Risuhin commented on ARROW-5995:
------------------------------------

I think that relying on internal and not documented ( so far I was not able to 
find any docs ) behavior might will not work here well.

I might would prefer any possible workaround to access Hadoop Java API method 
getFileChecksum 
[https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path)]
 by Arrow from underlying cpp implementation. For example, our own C library 
built to wire specifically `getFileChecksum` and which will be loaded by Arrow 
along with official libhdfs.so. This might be a solution till our possible 
contribution into official libhdfs to have `getFileChecksum` there will not be 
supported.

Above sounds like more clean solution to be accepted into Arrow than relying on 
Hadoop hdfs internals. But I'm not sure if it will be technically possible to 
load into memory both libhdfs.so and our own library to not create any 
conflicts with Java runtimes, etc.

> [Python] pyarrow: hdfs: support file checksum
> ---------------------------------------------
>
>                 Key: ARROW-5995
>                 URL: https://issues.apache.org/jira/browse/ARROW-5995
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Ruslan Kuprieiev
>            Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to