This sounds like Parquet could be improved in that regard.
One way to evolve this in a backward compatible manner is to add optional
fields in the Statistics struct that would have the min_utf8, max_utf8
semantics you describe. These would be added to binary fields labelled with
the logical type UTF8 [1] (Which is the true String type in parquet).

[1]
https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L50

On Fri, Aug 19, 2016 at 3:37 AM, Andrew Duffy <[email protected]> wrote:

> Hello Parquet-Dev,
>
>
>
> I wanted to get some feedback on something we’ve been running into that
> revolves around the difference between Parquet and Spark sorted ordering
> for UTF8 strings.
>
>
>
> Spark has a special type for Strings, so when it performs sort ex. before
> writing out to Parquet it will perform string-wise ordering, so all Unicode
> characters are “greater than” anything in the ASCII range.
>
>
>
> However, when Spark pushes down to Parquet strings are treated as Binary,
> and the Binary comparison of two strings is byte[], which is *signed*
> byte type, so anything starting with a UTF8 character is seen as been “less
> than” anything in ASCII range. The way this manifests itself is that Spark
> sorts the records using its comparison, and then Parquet calculates the min
> and max for Statistics using signed bytes comparison, so when you pushdown
> in Spark you’re basically required to look at things that you shouldn’t
> have to look at because your statistics are broken for what you’re trying
> to do.
>
>
>
> I was wondering if anyone had strong opinions about the best way to fix
> this, perhaps adding a true “String” type in Parquet that has a
> well-defined ordering would be the way to go, or does anyone have
> recommendations for Spark-side fixes? Another thing we could do is force
> binary comparisons to assume that bytes are supposed to be unsigned, which
> would be a breaking change but might be the thing we want to actually be
> doing when comparing bytes?
>
>
>
> -Andrew
>
>
>



-- 
Julien

Reply via email to