[
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ryan Blue updated PARQUET-1510:
-------------------------------
Affects Version/s: 1.9.1
1.9.0
1.10.0
> Dictionary filter skips null values when evaluating not-equals.
> ---------------------------------------------------------------
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.9.0, 1.10.0, 1.9.1
> Reporter: Ryan Blue
> Priority: Major
> Labels: correctness, pull-request-available
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"),
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-----+
> |value|
> +-----+
> +-----+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"),
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-----+
> |value|
> +-----+
> | null|
> +-----+
> {code}
> This is a correctness issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)