The type of comparison used here strikes me as dependent on the
ConvertedType of the column. Adding explicit signed/unsigned min/max
of course gives you both options after the fact. So another option is
(if I'm understanding correctly) to change parquet-mr's BYTE_ARRAY
comparison used for UTF8 ConvertedType.

As an aside, are decimal statistics (e.g. 12-byte or 16-byte decimals)
valid based on a signed binary comparison?

Since we don't have any heavily dependent production users of
parquet-cpp yet we'll be happy to implement whatever solution works
for everyone.

- Wes

On Thu, Aug 25, 2016 at 6:08 PM, Andrew Duffy <[email protected]> wrote:
> Hello Dev-Parquet,
>
> I recently filed an issue, PARQUET-686
> <https://issues.apache.org/jira/browse/PARQUET-686>, to attempt to fix the
> abnormal sort order for Binary types in Parquet. This is to allow for the
> calculation of statistics based on an *unsigned* interpretation of binary
> bytestrings, which is the sort of thing you want for UTF8 columns, for
> example. This is currently causing a correctness issue with Spark, see
> SPARK-17213 <https://issues.apache.org/jira/browse/SPARK-17213> for more
> details on that, which means there is a likelihood that this is also broken
> in other query engines that pushdown String filters to Parquet.
>
> The fix requires a change both to any implementation of Parquet
> (parquet-mr, parquet-cpp) as well as the format, to add a new set of
> optional fields on the statistics that allow specifying explicit signed and
> unsigned statistics. The PR to parquet-format can be seen at
> https://github.com/apache/parquet-format/pull/42.
>
> Wanted to distribute this change back out to the community for comment.
>
> -Andrew

Reply via email to