HyukjinKwon opened a new pull request #23622: [SPARK-26677][SQL] Disable dictionary filtering by default at Parquet URL: https://github.com/apache/spark/pull/23622 ## What changes were proposed in this pull request? ### Problem This is a correctness issue and should be backported as well. If we use dictionary encoding as below, it hits a correctness issue as below: ```scala // Repeat the values for dictionary encoding. PLAIN_DICTIONARY. Seq(Some("A"), Some("A"), None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() ``` ``` +-----+ |value| +-----+ +-----+ ``` Note that, if we don't use dictionary encoding it's fine. So it was difficult to find the issue. ```scala // It becomes PLAIN encoding. Seq(Some("A"), None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() ``` ``` +-----+ |value| +-----+ | null| +-----+ ``` ### How did it happen? This is because Parquet side dictionary filter fails to handle `null`. The former case hits here: ```java Set<T> dictSet = expandDictionary(meta); if (dictSet != null && dictSet.size() == 1 && dictSet.contains(value)) { return BLOCK_CANNOT_MATCH; } ``` https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.0/parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java#L182 Given dictionary set does not contain `null` and `value` is `'A'` here. So, it filters the row group out. The latter case above works fine because it hits here: ```java // if the chunk has non-dictionary pages, don't bother decoding the // dictionary because the row group can't be eliminated. if (hasNonDictionaryPages(meta)) { return BLOCK_MIGHT_MATCH; } ``` https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.0/parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java#L176 So, it does not filter the row group out. Parquet predicate handles `null` too (see also https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.0/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java#L188). Up to my knowledge, Parquet predicate is null-safe. ### How does this PR fix? This PR explicitly disable dictionary encoding. However there's another problem: We should disable `parquet.filter.dictionary.enabled` but Parquet 1.10.x has a mistake - the 'parquet.filter.stats.enabled' and 'parquet.filter.dictionary.enabled' were swapped mistakenly in Parquet side: ```java useDictionaryFilter(conf.getBoolean(STATS_FILTERING_ENABLED, true)); useStatsFilter(conf.getBoolean(DICTIONARY_FILTERING_ENABLED, true)); ``` https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.0/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83-L84 This is fixed after 1.11. See PARQUET-1309. Therefore, this PR explicitly disable `parquet.filter.stats.enabled` to disable dictionary filtering if that's not set. ### User side workaround They can explicitly disable `parquet.filter.stats.enabled` to disable dictionary filtering. ### ETC - This is quite a conservative fix. This should be backported to Spark 2.4. - I only tested null-safe equality comparison but looks equality comparison having related issues. ## How was this patch tested? Unit tests were added.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
