HyukjinKwon commented on a change in pull request #23622: [SPARK-26677][SQL]
Disable dictionary filtering by default at Parquet
URL: https://github.com/apache/spark/pull/23622#discussion_r251221877
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -314,6 +314,17 @@ class ParquetFileFormat
SQLConf.CASE_SENSITIVE.key,
sparkSession.sessionState.conf.caseSensitiveAnalysis)
+ // There are two things to note here.
+ //
+ // 1. Dictionary filtering has an issue about the predication on null. For
this reason,
+ // This filtering is disabled. See SPARK-26677.
+ //
+ // 2. We should disable 'parquet.filter.dictionary.enabled' but
+ // the 'parquet.filter.stats.enabled' and
'parquet.filter.dictionary.enabled' were
+ // swapped mistakenly in Parquet side. It should use
'parquet.filter.dictionary.enabled'
+ // when Spark upgrades Parquet. See PARQUET-1309.
+ hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false")
Review comment:
Ah, so we're targeting the upgrade to Parquet 1.10.1? yea, sounds okay to
me. Also, in that way users can also disable
`parquet.filter.dictionary.enabled` explicitly I guess.
BTW, is it something we should enable by default at Parquet side, @rdblue? I
see there can be the performance improvement but was wondering how much stable
dictionary filtering it is.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]