[GitHub] HyukjinKwon opened a new pull request #23622: [SPARK-26677][SQL] Disable dictionary filtering by default at Parquet

GitBox Tue, 22 Jan 2019 21:47:42 -0800

HyukjinKwon opened a new pull request #23622: [SPARK-26677][SQL] Disable 
dictionary filtering by default at Parquet
URL: https://github.com/apache/spark/pull/23622
 
 
   ## What changes were proposed in this pull request?
   
   
   ### Problem
   
   This is a correctness issue and should be backported as well. If we use 
dictionary encoding as below, it hits a correctness issue as below:
   
   ```scala
   // Repeat the values for dictionary encoding. PLAIN_DICTIONARY.
   Seq(Some("A"), Some("A"), 
None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
   spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
   ```
   ```
   +-----+
   |value|
   +-----+
   +-----+
   ```
   
   Note that, if we don't use dictionary encoding it's fine. So it was 
difficult to find the issue.
   
   ```scala
   // It becomes PLAIN encoding.
   Seq(Some("A"), 
None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
   spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
   ```
   ```
   +-----+
   |value|
   +-----+
   | null|
   +-----+
   ```
   
   ### How did it happen?
   
   This is because Parquet side dictionary filter fails to handle `null`. The 
former case hits here:
   
   ```java
         Set<T> dictSet = expandDictionary(meta);
         if (dictSet != null && dictSet.size() == 1 && dictSet.contains(value)) 
{
           return BLOCK_CANNOT_MATCH;
         }
   ```
   
   
https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.0/parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java#L182
   
   Given dictionary set does not contain `null` and `value` is `'A'` here. So, 
it filters the row group out.
   
   The latter case above works fine because it hits here:
   
   ```java
       // if the chunk has non-dictionary pages, don't bother decoding the
       // dictionary because the row group can't be eliminated.
       if (hasNonDictionaryPages(meta)) {
         return BLOCK_MIGHT_MATCH;
       }
   ```
   
   
https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.0/parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java#L176
   
   So, it does not filter the row group out.
   
   Parquet predicate handles `null` too (see also 
https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.0/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java#L188).
 Up to my knowledge, Parquet predicate is null-safe.
   
   
   ### How does this PR fix?
   
   This PR explicitly disable dictionary encoding. However there's another 
problem:
   
   We should disable `parquet.filter.dictionary.enabled` but Parquet 1.10.x has 
a mistake - the 'parquet.filter.stats.enabled' and 
'parquet.filter.dictionary.enabled' were swapped mistakenly in Parquet side:
   
   ```java
         useDictionaryFilter(conf.getBoolean(STATS_FILTERING_ENABLED, true));
         useStatsFilter(conf.getBoolean(DICTIONARY_FILTERING_ENABLED, true));
   ```
   
   
https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.0/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83-L84
   
   This is fixed after 1.11. See PARQUET-1309.
   
   Therefore, this PR explicitly disable `parquet.filter.stats.enabled` to 
disable dictionary filtering if that's not set.
   
   
   ### User side workaround
   
   They can explicitly disable `parquet.filter.stats.enabled` to disable 
dictionary filtering.
   
   
   ### ETC
   
   - This is quite a conservative fix. This should be backported to Spark 2.4.
   - I only tested null-safe equality comparison but looks equality comparison 
having related issues.
   
   ## How was this patch tested?
   
   Unit tests were added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] HyukjinKwon opened a new pull request #23622: [SPARK-26677][SQL] Disable dictionary filtering by default at Parquet

Reply via email to