[ 
https://issues.apache.org/jira/browse/SPARK-24371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24371:
----------------------------
    Description: 
Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users 
can do
{code}
 val profileDF = Seq(
 Some(1), Some(2), Some(3), Some(4),
 Some(5), Some(6), Some(7), None
 ).toDF("profileID")

val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")

val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers))

result.show(10)
 """
 +----------+------+
|profileID|isValid|

+----------+------+
|1|false|
|2|false|
|3|true|
|4|false|
|5|false|
|6|true|
|7|true|
|null|null|

+----------+------+
 """.stripMargin
{code}


  was:
Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users 
can do
{code}
 val profileDF = Seq(
 Some(1), Some(2), Some(3), Some(4),
 Some(5), Some(6), Some(7), None
 ).toDF("profileID")

val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")

val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers))

result.show(10)
 """
 +----------+------+
|profileID|isValid|

+----------+------+
|1|false|
|2|false|
|3|true|
|4|false|
|5|false|
|6|true|
|7|true|
|null|null|

+----------+------+
 """.stripMargin
{code}
Two new rules in the logical plan optimizers are added.

# When there is only one element in the *{{Set}}*, the physical plan will be 
optimized to *{{EqualTo}}*, so predicate pushdown can be used.
{code}
 profileDF.filter( $"profileID".isinSet(Set(6))).explain(true)
 """
|== Physical Plan ==|
|*(1) Project [profileID#0|#0]|
|+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
|+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
|PartitionFilters: [],|
|PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
|ReadSchema: struct<profileID:int>
 """.stripMargin
{code}

# When the *{{Set}}* is empty, and the input is nullable, the logical plan will 
be simplified to
{code}
 profileDF.filter( $"profileID".isinSet(Set())).explain(true)
 """
|== Optimized Logical Plan ==|
|Filter if (isnull(profileID#0)) null else false|
|+- Relation[profileID#0|#0] parquet
 """.stripMargin
{code}
 

TODO:
 # For multiple conditions with numbers less than certain thresholds, we should 
still allow predicate pushdown.
 # Optimize the `In` using tableswitch or lookupswitch when the numbers of the 
categories are low, and they are `Int`, `Long`.
 # The default immutable hash trees set is slow for query, and we should do 
benchmark for using different set implementation for faster query.
 # `filter(if (condition) null else false)` can be optimized to false.


> Added isinSet in DataFrame API for Scala and Java.
> --------------------------------------------------
>
>                 Key: SPARK-24371
>                 URL: https://issues.apache.org/jira/browse/SPARK-24371
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: DB Tsai
>            Assignee: DB Tsai
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users 
> can do
> {code}
>  val profileDF = Seq(
>  Some(1), Some(2), Some(3), Some(4),
>  Some(5), Some(6), Some(7), None
>  ).toDF("profileID")
> val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")
> val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers))
> result.show(10)
>  """
>  +----------+------+
> |profileID|isValid|
> +----------+------+
> |1|false|
> |2|false|
> |3|true|
> |4|false|
> |5|false|
> |6|true|
> |7|true|
> |null|null|
> +----------+------+
>  """.stripMargin
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to