rdblue commented on a change in pull request #23622: [SPARK-26677][SQL] Disable
dictionary filtering by default at Parquet
URL: https://github.com/apache/spark/pull/23622#discussion_r251081026
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -314,6 +314,17 @@ class ParquetFileFormat
SQLConf.CASE_SENSITIVE.key,
sparkSession.sessionState.conf.caseSensitiveAnalysis)
+ // There are two things to note here.
+ //
+ // 1. Dictionary filtering has an issue about the predication on null. For
this reason,
+ // This filtering is disabled. See SPARK-26677.
+ //
+ // 2. We should disable 'parquet.filter.dictionary.enabled' but
+ // the 'parquet.filter.stats.enabled' and
'parquet.filter.dictionary.enabled' were
+ // swapped mistakenly in Parquet side. It should use
'parquet.filter.dictionary.enabled'
+ // when Spark upgrades Parquet. See PARQUET-1309.
+ hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false")
Review comment:
I'll make sure the fix for this is in Parquet 1.10.1.
As for fixing this problem, I think that Spark should avoid pushing down
notEquals expressions or rewrite them to `isNull(col) or notEquals(col, "A")`.
That's going to be much better for performance than disabling dictionary
filtering.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]