[jira] [Commented] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file

Robert Gruener (JIRA) Thu, 02 Aug 2018 15:12:10 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567531#comment-16567531
 ]


Robert Gruener commented on ARROW-2800:
---------------------------------------

Ok so I was wondering why parquet-mr 1.10.0 can read the old corrupt statistics 
but parquet-cpp would not and I found that 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L633]
 is the java code. Basically parquet-cpp is not considering the case of min and 
max being the same (which would actually be nice to have for our use case)

 

This is something that can be fixed in parquet-cpp. I should be able to 
implement it so it has parity with the java implementation.

> [Python] Unavailable Parquet column statistics from Spark-generated file
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2800
>                 URL: https://issues.apache.org/jira/browse/ARROW-2800
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Robert Gruener
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.11.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> --------------------------------------------------------------------------------
> int:         REQUIRED INT64 R:0 D:0
> string:      OPTIONAL BINARY O:UTF8 R:0 D:1
> float:       REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> --------------------------------------------------------------------------------
> int:          INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:       BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:        DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file

Reply via email to