[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

dongjoon-hyun Mon, 03 Sep 2018 19:42:33 -0700

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214775155
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
    @@ -71,12 +71,24 @@ private[orc] object OrcFilters {
     
         for {
           // Combines all convertible filters using `And` to produce a single 
conjunction
    -      conjunction <- 
convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
    +      conjunction <- buildTree(convertibleFilters)
    --- End diff --
    
    For the first question, I don't think Parquet has the same issue because 
Parquet uses `canMakeFilterOn` while ORC is trying to build a full result (with 
a fresh builder) to check if it's okay or not.
    
    For the second question, in ORC, we already did the first half(`flatMap`) 
to compute `convertibleFilters`, but it can change it with `filters.filter`.
    ```scala
    val convertibleFilters = for {
        filter <- filters
        _ <- buildSearchArgument(dataTypeMap, filter, 
SearchArgumentFactory.newBuilder())
    } yield filter
    ```
    
    2. And, the second half `reduceOption(FilterApi.and)` was the original ORC 
code which generated a skewed tree having exponential time complexity. We need 
to use `buildTree`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Reply via email to