Cheng Lian reassigned SPARK-17213:

    Assignee: Cheng Lian

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -----------------------------------------------------
>                 Key: SPARK-17213
>                 URL: https://issues.apache.org/jira/browse/SPARK-17213
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>            Reporter: Andrew Duffy
>            Assignee: Cheng Lian
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
>     > Seq("a", "é").toDF("name").where("name > 'a'").count
>     1
> {code}
> Querying from a parquet dataset:
> {code}
>     > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
>     > spark.read.parquet("/tmp/bad").where("name > 'a'").count
>     0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to