[jira] [Commented] (ARROW-5995) [Python] pyarrow: hdfs: support file checksum

Ruslan Kuprieiev (Jira) Fri, 30 Aug 2019 01:10:57 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919309#comment-16919309
 ]


Ruslan Kuprieiev commented on ARROW-5995:
-----------------------------------------

You are right, such a hackish approach would probably not pass the reviews. But 
it might be a good temporary pure-python workaround if parsing those metafiles 
is comparatively simple, so we don't have to mess around with our own C 
library, for which we would have to ship wheels (which is a hustle). And having 
that workaround, we could submit and patiently wait for proper patches to get 
merged into libhdfs and pyarrow. If the workaround is hard to implement, then 
we could skip it and keep using hadoop CLI as we do right now, focusing on 
proper patches to libhdfs and pyarrow. What do you think? :)

> [Python] pyarrow: hdfs: support file checksum
> ---------------------------------------------
>
>                 Key: ARROW-5995
>                 URL: https://issues.apache.org/jira/browse/ARROW-5995
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Ruslan Kuprieiev
>            Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5995) [Python] pyarrow: hdfs: support file checksum

Reply via email to