This sounds like Parquet could be improved in that regard. One way to evolve this in a backward compatible manner is to add optional fields in the Statistics struct that would have the min_utf8, max_utf8 semantics you describe. These would be added to binary fields labelled with the logical type UTF8 [1] (Which is the true String type in parquet).
[1] https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L50 On Fri, Aug 19, 2016 at 3:37 AM, Andrew Duffy <[email protected]> wrote: > Hello Parquet-Dev, > > > > I wanted to get some feedback on something we’ve been running into that > revolves around the difference between Parquet and Spark sorted ordering > for UTF8 strings. > > > > Spark has a special type for Strings, so when it performs sort ex. before > writing out to Parquet it will perform string-wise ordering, so all Unicode > characters are “greater than” anything in the ASCII range. > > > > However, when Spark pushes down to Parquet strings are treated as Binary, > and the Binary comparison of two strings is byte[], which is *signed* > byte type, so anything starting with a UTF8 character is seen as been “less > than” anything in ASCII range. The way this manifests itself is that Spark > sorts the records using its comparison, and then Parquet calculates the min > and max for Statistics using signed bytes comparison, so when you pushdown > in Spark you’re basically required to look at things that you shouldn’t > have to look at because your statistics are broken for what you’re trying > to do. > > > > I was wondering if anyone had strong opinions about the best way to fix > this, perhaps adding a true “String” type in Parquet that has a > well-defined ordering would be the way to go, or does anyone have > recommendations for Spark-side fixes? Another thing we could do is force > binary comparisons to assume that bytes are supposed to be unsigned, which > would be a breaking change but might be the thing we want to actually be > doing when comparing bytes? > > > > -Andrew > > > -- Julien
