[
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584436#comment-16584436
]
ASF GitHub Bot commented on PARQUET-1369:
-----------------------------------------
rgruener opened a new pull request #491: PARQUET-1369: Disregard column sort
order if statistics max/min are equal
URL: https://github.com/apache/parquet-cpp/pull/491
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [Python] Unavailable Parquet column statistics from Spark-generated file
> ------------------------------------------------------------------------
>
> Key: PARQUET-1369
> URL: https://issues.apache.org/jira/browse/PARQUET-1369
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Affects Versions: cpp-1.4.0
> Reporter: Robert Gruener
> Assignee: Robert Gruener
> Priority: Major
> Labels: parquet, pull-request-available
> Fix For: cpp-1.5.0
>
>
> I have a dataset generated by spark which shows it has statistics for the
> string column when using the java parquet-mr code (shown by using
> `parquet-tools meta`) however reading from pyarrow shows that the statistics
> for that column are not set. I should not the column only has a single
> value, though it still seems like a problem that pyarrow can't recognize it
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema
> --------------------------------------------------------------------------------
> int: REQUIRED INT64 R:0 D:0
> string: OPTIONAL BINARY O:UTF8 R:0 D:1
> float: REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4
> --------------------------------------------------------------------------------
> int: INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string: BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls:
> 4192]
> float: DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232,
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like
> pyarrow should be able to read the statistics set. I made this here and not a
> JIRA since I wanted to be sure this is actually an issue and there wasnt a
> ticket already made there (I couldnt find one but I wanted to be sure).
> Either way I would like to understand why this is
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)