Hello Dev-Parquet,

I recently filed an issue, PARQUET-686
<https://issues.apache.org/jira/browse/PARQUET-686>, to attempt to fix the
abnormal sort order for Binary types in Parquet. This is to allow for the
calculation of statistics based on an *unsigned* interpretation of binary
bytestrings, which is the sort of thing you want for UTF8 columns, for
example. This is currently causing a correctness issue with Spark, see
SPARK-17213 <https://issues.apache.org/jira/browse/SPARK-17213> for more
details on that, which means there is a likelihood that this is also broken
in other query engines that pushdown String filters to Parquet.

The fix requires a change both to any implementation of Parquet
(parquet-mr, parquet-cpp) as well as the format, to add a new set of
optional fields on the statistics that allow specifying explicit signed and
unsigned statistics. The PR to parquet-format can be seen at
https://github.com/apache/parquet-format/pull/42.

Wanted to distribute this change back out to the community for comment.

-Andrew

Reply via email to