[
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919309#comment-16919309
]
Ruslan Kuprieiev commented on ARROW-5995:
-----------------------------------------
You are right, such a hackish approach would probably not pass the reviews. But
it might be a good temporary pure-python workaround if parsing those metafiles
is comparatively simple, so we don't have to mess around with our own C
library, for which we would have to ship wheels (which is a hustle). And having
that workaround, we could submit and patiently wait for proper patches to get
merged into libhdfs and pyarrow. If the workaround is hard to implement, then
we could skip it and keep using hadoop CLI as we do right now, focusing on
proper patches to libhdfs and pyarrow. What do you think? :)
> [Python] pyarrow: hdfs: support file checksum
> ---------------------------------------------
>
> Key: ARROW-5995
> URL: https://issues.apache.org/jira/browse/ARROW-5995
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Ruslan Kuprieiev
> Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in
> hadoop CLI [1], looks like we will also need to implement it manually in
> pyarrow. Please correct me if I'm missing something. Is this feature
> desirable? Or was there a good reason why it wasn't implemented already?
> [1]
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]
--
This message was sent by Atlassian Jira
(v8.3.2#803003)