Nikolas Vanderhoof created SPARK-27359:
------------------------------------------

             Summary: Joins on some array functions can be optimized
                 Key: SPARK-27359
                 URL: https://issues.apache.org/jira/browse/SPARK-27359
             Project: Spark
          Issue Type: Improvement
          Components: Optimizer, SQL
    Affects Versions: 3.0.0
            Reporter: Nikolas Vanderhoof


I encounter these cases frequently, and implemented the optimization manually 
(as shown here). If others experience this as well, perhaps it would be good to 
add appropriate tree transformations into catalyst. I can create some rough 
draft implementations but expect I will need assistance when it comes to 
resolving the generating expressions in the logical plan.

h1. Case 1
A join like this:
{code:scala}
left.join(
  right,
  arrays_overlap(left("a"), right("b"))     // Creates a cartesian product in 
the logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_a", explode(col("a")))
  val rightPrime = right.withColumn("exploded_b", explode(col("b")))

  leftPrime.join(
    rightPrime,
    leftPrime("exploded_a") === rightPrime("exploded_b")
      // Equijoin doesn't produce cartesian product
  ).drop("exploded_a", "exploded_b").distinct
}
{code}

h1. Case 2
A join like this:
{code:scala}
left.join(
  right,
  array_contains(left("arr"), right("value")) // Cartesian product in logical 
plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))

  leftPrime.join(
    right,
    leftPrime("exploded_arr") === right("value") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}

h1. Case 3
A join like this:
{code:scala}
left.join(
  right,
  array_contains(right("arr"), left("value")) // Cartesian product in logical 
plan
)
{code}

will produce the same results as:
{code:scala}
{
  val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))

  left.join(
    rightPrime,
    left("value") === rightPrime("exploded_arr") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to