[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

mgaido91 Tue, 13 Nov 2018 02:54:48 -0800

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22518
  
    @cloud-fan this is the benchmark:
    ```
    (1 to 1000000).toSeq.toDF("a").write.save("/tmp/t1")
    spark.read.load("/tmp/t1").createTempView("t1")
    (1 to 2000).toSeq.toDF("b").write.save("/tmp/t2")
    spark.read.load("/tmp/t2").createTempView("t2")
    val plan = sql("select * from t2 where b > (select avg(a + 1) from t1)")
    val t0 = System.nanoTime()
    plan.show
    val t1 = System.nanoTime()
    println("Elapsed time: " + (t1 - t0) + "ns")
    ```
    
    the result is:
    
    ```
    Before PR: Elapsed time: 862499689ns
    After  PR: Elapsed time: 914728641ns
    ```
    The difference is very small because all the subqueries run in parallel. 
The execution time would be much more affected if there are several subqueries 
(our thread pool has 16 threads so a query like that but with 9 filters with 
subqueries would see a big performance gain after this PR).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22518: [SPARK-25482][SQL] Avoid pushdown of subqueries to data ...

Reply via email to