Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22518 @cloud-fan this is the benchmark: ``` (1 to 1000000).toSeq.toDF("a").write.save("/tmp/t1") spark.read.load("/tmp/t1").createTempView("t1") (1 to 2000).toSeq.toDF("b").write.save("/tmp/t2") spark.read.load("/tmp/t2").createTempView("t2") val plan = sql("select * from t2 where b > (select avg(a + 1) from t1)") val t0 = System.nanoTime() plan.show val t1 = System.nanoTime() println("Elapsed time: " + (t1 - t0) + "ns") ``` the result is: ``` Before PR: Elapsed time: 862499689ns After PR: Elapsed time: 914728641ns ``` The difference is very small because all the subqueries run in parallel. The execution time would be much more affected if there are several subqueries (our thread pool has 16 threads so a query like that but with 9 filters with subqueries would see a big performance gain after this PR).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org