rdblue commented on a change in pull request #23622: [SPARK-26677][SQL] Disable
dictionary filtering by default at Parquet
URL: https://github.com/apache/spark/pull/23622#discussion_r251248759
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -314,6 +314,17 @@ class ParquetFileFormat
SQLConf.CASE_SENSITIVE.key,
sparkSession.sessionState.conf.caseSensitiveAnalysis)
+ // There are two things to note here.
+ //
+ // 1. Dictionary filtering has an issue about the predication on null. For
this reason,
+ // This filtering is disabled. See SPARK-26677.
+ //
+ // 2. We should disable 'parquet.filter.dictionary.enabled' but
+ // the 'parquet.filter.stats.enabled' and
'parquet.filter.dictionary.enabled' were
+ // swapped mistakenly in Parquet side. It should use
'parquet.filter.dictionary.enabled'
+ // when Spark upgrades Parquet. See PARQUET-1309.
+ hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false")
Review comment:
Dictionary filtering has been available for more than 2 years and Netflix
has been using it by default for almost 3 years without problems. The feature
is stable.
Keep in mind what a narrow use case hits this bug. First, an entire row
group in a file needs to have a just one value and nulls: either the column has
just one non-null value for all rows, or a sort or write pattern has clustered
null rows with one other value (enough for an entire row group). Next, a query
must use null-safe equality *and* negate it. In my experience, most people
don't know what null-safe equality is. Going back to the use cases that produce
data like this, the first use case -- only one non-null value -- would probably
result in filter like `col NOT NULL` instead. The second write pattern is the
problematic one, but how likely is the intersection of data with that write
pattern and searching for all values except one with null-safe equality? I
don't think it is likely and I think that's why we haven't found this until now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]