Hello Dev-Parquet, I recently filed an issue, PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686>, to attempt to fix the abnormal sort order for Binary types in Parquet. This is to allow for the calculation of statistics based on an *unsigned* interpretation of binary bytestrings, which is the sort of thing you want for UTF8 columns, for example. This is currently causing a correctness issue with Spark, see SPARK-17213 <https://issues.apache.org/jira/browse/SPARK-17213> for more details on that, which means there is a likelihood that this is also broken in other query engines that pushdown String filters to Parquet.
The fix requires a change both to any implementation of Parquet (parquet-mr, parquet-cpp) as well as the format, to add a new set of optional fields on the statistics that allow specifying explicit signed and unsigned statistics. The PR to parquet-format can be seen at https://github.com/apache/parquet-format/pull/42. Wanted to distribute this change back out to the community for comment. -Andrew
