GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/21442

    [SPARK-24402] [SQL] Optimize `In` expression when only one element in the 
collection or collection is empty

    ## What changes were proposed in this pull request?
    
    Two new rules in the logical plan optimizers are added.
    
    1. When there is only one element in the **`Collection`**, the
    physical plan will be optimized to **`EqualTo`**, so predicate
    pushdown can be used.
    
    ```scala
        profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
        """
          |== Physical Plan ==
          |*(1) Project [profileID#0]
          |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))
          |   +- *(1) FileScan parquet [profileID#0] Batched: true, Format: 
Parquet,
          |     PartitionFilters: [],
          |     PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],
          |     ReadSchema: struct<profileID:int>
        """.stripMargin
    ```
    
    2. When the **`Collection`** is empty, and the input is nullable, the
    logical plan will be simplified to
    
    ```scala
        profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
        """
          |== Optimized Logical Plan ==
          |Filter if (isnull(profileID#0)) null else false
          |+- Relation[profileID#0] parquet
        """.stripMargin
    ```
    
    TODO:
    
    1. For multiple conditions with numbers less than certain thresholds,
    we should still allow predicate pushdown.
    2. Optimize the **`In`** using **`tableswitch`** or **`lookupswitch`**
    when the numbers of the categories are low, and they are **`Int`**,
    **`Long`**.
    3. The default immutable hash trees set is slow for query, and we
    should do benchmark for using different set implementation for faster
    query.
    4. **`filter(if (condition) null else false)`** can be optimized to false.
    
    ## How was this patch tested?
    
    Couple new tests are added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dbtsai/spark optimize-in

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21442.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21442
    
----
commit 7c44c70fe664e73b36a49a974ece93a0c83d7f8e
Author: DB Tsai <d_tsai@...>
Date:   2018-05-28T07:27:09Z

    Optimize `In`

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to