Hey Julien, Thanks for the pointer! I actually went and took a look and I think I found a way to fix this that doesn’t require a format change. It required adding one new static comparison method to Binary that is UTF-8 aware, and as I mention in the PR, it performs UTF-8 comparison the same way that Avro and Spark both perform it:
Avro: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/io/BinaryData.java#L184 Spark: https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L835 This doesn’t break any compatibility as it only affects the write-path, and only affects generation of BinaryStatistics when the Binary type is a FromStringBinary, so won’t affect existing Parquet datasets. I pushed a PR and tagged you in it, it’s visible here: https://github.com/apache/parquet-mr/pull/362, please let me know what you think. -Andrew On 8/20/16, 9:02 PM, "Julien Le Dem" <[email protected]> wrote: >This sounds like Parquet could be improved in that regard. >One way to evolve this in a backward compatible manner is to add optional >fields in the Statistics struct that would have the min_utf8, max_utf8 >semantics you describe. These would be added to binary fields labelled with >the logical type UTF8 [1] (Which is the true String type in parquet). > >[1] >https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L50 > >On Fri, Aug 19, 2016 at 3:37 AM, Andrew Duffy <[email protected]> wrote: > >> Hello Parquet-Dev, >> >> >> >> I wanted to get some feedback on something we’ve been running into that >> revolves around the difference between Parquet and Spark sorted ordering >> for UTF8 strings. >> >> >> >> Spark has a special type for Strings, so when it performs sort ex. before >> writing out to Parquet it will perform string-wise ordering, so all Unicode >> characters are “greater than” anything in the ASCII range. >> >> >> >> However, when Spark pushes down to Parquet strings are treated as Binary, >> and the Binary comparison of two strings is byte[], which is *signed* >> byte type, so anything starting with a UTF8 character is seen as been “less >> than” anything in ASCII range. The way this manifests itself is that Spark >> sorts the records using its comparison, and then Parquet calculates the min >> and max for Statistics using signed bytes comparison, so when you pushdown >> in Spark you’re basically required to look at things that you shouldn’t >> have to look at because your statistics are broken for what you’re trying >> to do. >> >> >> >> I was wondering if anyone had strong opinions about the best way to fix >> this, perhaps adding a true “String” type in Parquet that has a >> well-defined ordering would be the way to go, or does anyone have >> recommendations for Spark-side fixes? Another thing we could do is force >> binary comparisons to assume that bytes are supposed to be unsigned, which >> would be a breaking change but might be the thing we want to actually be >> doing when comparing bytes? >> >> >> >> -Andrew >> >> >> > > > >-- >Julien
smime.p7s
Description: S/MIME cryptographic signature
