Hello Parquet-Dev,
I wanted to get some feedback on something we’ve been running into that revolves around the difference between Parquet and Spark sorted ordering for UTF8 strings. Spark has a special type for Strings, so when it performs sort ex. before writing out to Parquet it will perform string-wise ordering, so all Unicode characters are “greater than” anything in the ASCII range. However, when Spark pushes down to Parquet strings are treated as Binary, and the Binary comparison of two strings is byte[], which is signed byte type, so anything starting with a UTF8 character is seen as been “less than” anything in ASCII range. The way this manifests itself is that Spark sorts the records using its comparison, and then Parquet calculates the min and max for Statistics using signed bytes comparison, so when you pushdown in Spark you’re basically required to look at things that you shouldn’t have to look at because your statistics are broken for what you’re trying to do. I was wondering if anyone had strong opinions about the best way to fix this, perhaps adding a true “String” type in Parquet that has a well-defined ordering would be the way to go, or does anyone have recommendations for Spark-side fixes? Another thing we could do is force binary comparisons to assume that bytes are supposed to be unsigned, which would be a breaking change but might be the thing we want to actually be doing when comparing bytes? -Andrew
smime.p7s
Description: S/MIME cryptographic signature
