The type of comparison used here strikes me as dependent on the ConvertedType of the column. Adding explicit signed/unsigned min/max of course gives you both options after the fact. So another option is (if I'm understanding correctly) to change parquet-mr's BYTE_ARRAY comparison used for UTF8 ConvertedType.
As an aside, are decimal statistics (e.g. 12-byte or 16-byte decimals) valid based on a signed binary comparison? Since we don't have any heavily dependent production users of parquet-cpp yet we'll be happy to implement whatever solution works for everyone. - Wes On Thu, Aug 25, 2016 at 6:08 PM, Andrew Duffy <[email protected]> wrote: > Hello Dev-Parquet, > > I recently filed an issue, PARQUET-686 > <https://issues.apache.org/jira/browse/PARQUET-686>, to attempt to fix the > abnormal sort order for Binary types in Parquet. This is to allow for the > calculation of statistics based on an *unsigned* interpretation of binary > bytestrings, which is the sort of thing you want for UTF8 columns, for > example. This is currently causing a correctness issue with Spark, see > SPARK-17213 <https://issues.apache.org/jira/browse/SPARK-17213> for more > details on that, which means there is a likelihood that this is also broken > in other query engines that pushdown String filters to Parquet. > > The fix requires a change both to any implementation of Parquet > (parquet-mr, parquet-cpp) as well as the format, to add a new set of > optional fields on the statistics that allow specifying explicit signed and > unsigned statistics. The PR to parquet-format can be seen at > https://github.com/apache/parquet-format/pull/42. > > Wanted to distribute this change back out to the community for comment. > > -Andrew
