Hey Julien,

Thanks for the pointer! I actually went and took a look and I think I found a 
way to fix this that doesn’t require a format change.
It required adding one new static comparison method to Binary that is UTF-8 
aware, and as I mention in the PR, it performs UTF-8 comparison the same way 
that Avro and Spark both perform it:

Avro: 
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/io/BinaryData.java#L184
Spark: 
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L835

This doesn’t break any compatibility as it only affects the write-path, and 
only affects generation of BinaryStatistics when the Binary type is a 
FromStringBinary, so won’t affect existing Parquet datasets.

I pushed a PR and tagged you in it, it’s visible here: 
https://github.com/apache/parquet-mr/pull/362, please let me know what you 
think.
-Andrew

On 8/20/16, 9:02 PM, "Julien Le Dem" <[email protected]> wrote:

>This sounds like Parquet could be improved in that regard.
>One way to evolve this in a backward compatible manner is to add optional
>fields in the Statistics struct that would have the min_utf8, max_utf8
>semantics you describe. These would be added to binary fields labelled with
>the logical type UTF8 [1] (Which is the true String type in parquet).
>
>[1]
>https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L50
>
>On Fri, Aug 19, 2016 at 3:37 AM, Andrew Duffy <[email protected]> wrote:
>
>> Hello Parquet-Dev,
>>
>>
>>
>> I wanted to get some feedback on something we’ve been running into that
>> revolves around the difference between Parquet and Spark sorted ordering
>> for UTF8 strings.
>>
>>
>>
>> Spark has a special type for Strings, so when it performs sort ex. before
>> writing out to Parquet it will perform string-wise ordering, so all Unicode
>> characters are “greater than” anything in the ASCII range.
>>
>>
>>
>> However, when Spark pushes down to Parquet strings are treated as Binary,
>> and the Binary comparison of two strings is byte[], which is *signed*
>> byte type, so anything starting with a UTF8 character is seen as been “less
>> than” anything in ASCII range. The way this manifests itself is that Spark
>> sorts the records using its comparison, and then Parquet calculates the min
>> and max for Statistics using signed bytes comparison, so when you pushdown
>> in Spark you’re basically required to look at things that you shouldn’t
>> have to look at because your statistics are broken for what you’re trying
>> to do.
>>
>>
>>
>> I was wondering if anyone had strong opinions about the best way to fix
>> this, perhaps adding a true “String” type in Parquet that has a
>> well-defined ordering would be the way to go, or does anyone have
>> recommendations for Spark-side fixes? Another thing we could do is force
>> binary comparisons to assume that bytes are supposed to be unsigned, which
>> would be a breaking change but might be the thing we want to actually be
>> doing when comparing bytes?
>>
>>
>>
>> -Andrew
>>
>>
>>
>
>
>
>-- 
>Julien

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to