Spark + Parquet UTF8 treatment of binary

Andrew Duffy Fri, 19 Aug 2016 03:38:42 -0700

Hello Parquet-Dev,


I wanted to get some feedback on something we’ve been running into that 
revolves around the difference between Parquet and Spark sorted ordering for 
UTF8 strings.

 

Spark has a special type for Strings, so when it performs sort ex. before 
writing out to Parquet it will perform string-wise ordering, so all Unicode 
characters are “greater than” anything in the ASCII range.

 

However, when Spark pushes down to Parquet strings are treated as Binary, and 
the Binary comparison of two strings is byte[], which is signed byte type, so 
anything starting with a UTF8 character is seen as been “less than” anything in 
ASCII range. The way this manifests itself is that Spark sorts the records 
using its comparison, and then Parquet calculates the min and max for 
Statistics using signed bytes comparison, so when you pushdown in Spark you’re 
basically required to look at things that you shouldn’t have to look at because 
your statistics are broken for what you’re trying to do.

 

I was wondering if anyone had strong opinions about the best way to fix this, 
perhaps adding a true “String” type in Parquet that has a well-defined ordering 
would be the way to go, or does anyone have recommendations for Spark-side 
fixes? Another thing we could do is force binary comparisons to assume that 
bytes are supposed to be unsigned, which would be a breaking change but might 
be the thing we want to actually be doing when comparing bytes?

 

-Andrew

smime.p7s
Description: S/MIME cryptographic signature

Spark + Parquet UTF8 treatment of binary

Reply via email to