AngersZhuuuu commented on issue #26437: [SPARK-29800][SQL] Plan non-correlated 
Exists 's subquery in PlanSubqueries
URL: https://github.com/apache/spark/pull/26437#issuecomment-557636283
 
 
   cc @cloud-fan 
   Simply look at the calculation process, the calculation of non-correlated 
exists sub-query is very fast.  And remove one shuffle, I will try this in our 
env with real production case.
   **With this pr**
   ```
   scala> (1 to 10000).toDF("id").createOrReplaceTempView("s1")
   scala> (0 to 50000).toDF("id").createOrReplaceTempView("s2")
   scala> (0 to 1000000).map(_ * 2).toDF("id").createOrReplaceTempView("s3")
   scala>       val df = sql(
        |         """
        |           | SELECT s1.id  FROM s1
        |           | WHERE EXISTS (SELECT * from s3)
        |         """.stripMargin)
   df: org.apache.spark.sql.DataFrame = [id: int]
   scala>       var start = System.currentTimeMillis()
   start: Long = 1574445595283
   scala>       df.show(5)
   +---+
   | id|
   +---+
   |  1|
   |  2|
   |  3|
   |  4|
   |  5|
   +---+
   only showing top 5 rows
   scala>       var end = System.currentTimeMillis()
   end: Long = 1574445609103
   scala>       println(s"duration = ${end - start}")
   duration = 13820
   ```
   
   
![image](https://user-images.githubusercontent.com/46485123/69449609-46a9a580-0d96-11ea-9755-847e4b75c99c.png)
   
![image](https://user-images.githubusercontent.com/46485123/69449578-32fe3f00-0d96-11ea-8126-1e06d0353851.png)
   
   **Without this pr current master:**
   ```
   scala> (1 to 10000).toDF("id").createOrReplaceTempView("s1")
   scala> (0 to 50000).toDF("id").createOrReplaceTempView("s2")
   scala> (0 to 1000000).map(_ * 2).toDF("id").createOrReplaceTempView("s3")
   scala>       val df = sql(
        |         """
        |           | SELECT s1.id  FROM s1
        |           | WHERE EXISTS (SELECT * from s3)
        |         """.stripMargin)
   df: org.apache.spark.sql.DataFrame = [id: int]
   scala>       var start = System.currentTimeMillis()
   start: Long = 1574445708886
   scala>       df.show(5)
   +---+
   | id|
   +---+
   |  1|
   |  2|
   |  3|
   |  4|
   |  5|
   +---+
   only showing top 5 rows
   scala>       var end = System.currentTimeMillis()
   end: Long = 1574445730126
   scala>       println(s"duration = ${end - start}")
   duration = 21240
   ```
   
   
![image](https://user-images.githubusercontent.com/46485123/69449638-4f9a7700-0d96-11ea-96ce-61a3bab87a0e.png)
   
![image](https://user-images.githubusercontent.com/46485123/69449559-2a0d6d80-0d96-11ea-9b61-9c0e30310d71.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to