[GitHub] rdblue commented on a change in pull request #23622: [SPARK-26677][SQL] Disable dictionary filtering by default at Parquet

GitBox Fri, 25 Jan 2019 10:07:51 -0800

rdblue commented on a change in pull request #23622: [SPARK-26677][SQL] Disable 
dictionary filtering by default at Parquet
URL: https://github.com/apache/spark/pull/23622#discussion_r251081026


 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ##########
 @@ -314,6 +314,17 @@ class ParquetFileFormat
       SQLConf.CASE_SENSITIVE.key,
       sparkSession.sessionState.conf.caseSensitiveAnalysis)
 
+    // There are two things to note here.
+    //
+    // 1. Dictionary filtering has an issue about the predication on null. For 
this reason,
+    //   This filtering is disabled. See SPARK-26677.
+    //
+    // 2. We should disable 'parquet.filter.dictionary.enabled' but
+    //   the 'parquet.filter.stats.enabled' and 
'parquet.filter.dictionary.enabled' were
+    //   swapped mistakenly in Parquet side. It should use 
'parquet.filter.dictionary.enabled'
+    //   when Spark upgrades Parquet. See PARQUET-1309.
+    hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false")
 
 Review comment:
   I'll make sure the fix for this is in Parquet 1.10.1.
   
   As for fixing this problem, I think that Spark should avoid pushing down 
notEquals expressions or rewrite them to `isNull(col) or notEquals(col, "A")`. 
That's going to be much better for performance than disabling dictionary 
filtering.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] rdblue commented on a change in pull request #23622: [SPARK-26677][SQL] Disable dictionary filtering by default at Parquet

Reply via email to