[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890490#comment-16890490
 ] 

Ruslan Kuprieiev commented on ARROW-5995:
-----------------------------------------

Got it :) Sure I would love to contribute a patch, but I'm not quite sure I 
fully understand what is going on there, so I'll need to do some research 
first. If you could give a short briefing on how checksum is computed (e.g. 
some sources call it as md5 of block CRCs, but I'm not quite sure where those 
CRCs are stored and how they are computed in the first place. Also is it 
possible to retrieve that crc using pyarrow? I didn't see it in `info` call or 
anywhere else.) it would be much appreciated. Or if you have some doc you could 
point me to, that would also be very helpful. I did some initial googling 
before I've opened this ticket, but it clearly wasn't quite enough. I'll do 
additional research surely, but getting some info from actual project 
developers is always invaluable :)

> [Python] pyarrow: hdfs: support file checksum
> ---------------------------------------------
>
>                 Key: ARROW-5995
>                 URL: https://issues.apache.org/jira/browse/ARROW-5995
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Ruslan Kuprieiev
>            Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to