[ https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712707#comment-15712707 ]
Cheng Lian commented on SPARK-17213: ------------------------------------ Agree that we should disable string and binary filter push down for now until PARQUET-686 gets fixed. We turned off Parquet filter pushdown for string and binary columns in 1.6 due to PARQUET-251 (see SPARK-11153). In Spark 2.1, we upgraded to Parquet 1.8.1 to get PARQUET-251 fixed, then this issue pops up due to PARQUET-686. I think this also affects Spark 1.5.1 and prior versions. > Parquet String Pushdown for Non-Eq Comparisons Broken > ----------------------------------------------------- > > Key: SPARK-17213 > URL: https://issues.apache.org/jira/browse/SPARK-17213 > Project: Spark > Issue Type: Bug > Affects Versions: 2.1.0 > Reporter: Andrew Duffy > > Spark defines ordering over strings based on comparison of UTF8 byte arrays, > which compare bytes as unsigned integers. Currently however Parquet does not > respect this ordering. This is currently in the process of being fixed in > Parquet, JIRA and PR link below, but currently all filters are broken over > strings, with there actually being a correctness issue for {{>}} and {{<}}. > *Repro:* > Querying directly from in-memory DataFrame: > {code} > > Seq("a", "é").toDF("name").where("name > 'a'").count > 1 > {code} > Querying from a parquet dataset: > {code} > > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > > spark.read.parquet("/tmp/bad").where("name > 'a'").count > 0 > {code} > This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's > implementation of comparison of strings is based on signed byte array > comparison, so it will actually create 1 row group with statistics > {{min=é,max=a}}, and so the row group will be dropped by the query. > Based on the way Parquet pushes down Eq, it will not be affecting correctness > but it will force you to read row groups you should be able to skip. > Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686 > Link to PR: https://github.com/apache/parquet-mr/pull/362 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org