HyukjinKwon commented on a change in pull request #23622: [SPARK-26677][SQL] 
Disable dictionary filtering by default at Parquet
URL: https://github.com/apache/spark/pull/23622#discussion_r251221877
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ##########
 @@ -314,6 +314,17 @@ class ParquetFileFormat
       SQLConf.CASE_SENSITIVE.key,
       sparkSession.sessionState.conf.caseSensitiveAnalysis)
 
+    // There are two things to note here.
+    //
+    // 1. Dictionary filtering has an issue about the predication on null. For 
this reason,
+    //   This filtering is disabled. See SPARK-26677.
+    //
+    // 2. We should disable 'parquet.filter.dictionary.enabled' but
+    //   the 'parquet.filter.stats.enabled' and 
'parquet.filter.dictionary.enabled' were
+    //   swapped mistakenly in Parquet side. It should use 
'parquet.filter.dictionary.enabled'
+    //   when Spark upgrades Parquet. See PARQUET-1309.
+    hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false")
 
 Review comment:
   Ah, so we're targeting the upgrade to Parquet 1.10.1? yea, sounds okay to 
me. Also, in that way users can also disable 
`parquet.filter.dictionary.enabled` explicitly I guess.
   
   BTW, is it something we should enable by default at Parquet side, @rdblue? I 
see there can be the performance improvement but was wondering how much stable 
dictionary filtering it is.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to