spark git commit: [SPARK-25610][SQL][TEST] Improve execution time of DatasetCacheSuite: cache UDF result correctly

lixiao Fri, 05 Oct 2018 17:26:29 -0700

Repository: spark
Updated Branches:
  refs/heads/master bbd038d24 -> 2c6f4d61b



[SPARK-25610][SQL][TEST] Improve execution time of DatasetCacheSuite: cache UDF 
result correctly

## What changes were proposed in this pull request?
In this test case, we are verifying that the result of an UDF  is cached when 
the underlying data frame is cached and that the udf is not evaluated again 
when the cached data frame is used.

To reduce the runtime we do :
1) Use a single partition dataframe, so the total execution time of UDF is more 
deterministic.
2) Cut down the size of the dataframe from 10 to 2.
3) Reduce the sleep time in the UDF from 5secs to 2secs.
4) Reduce the failafter condition from 3 to 2.

With the above change, it takes about 4 secs to cache the first dataframe. And 
subsequent check takes a few hundred milliseconds.
The new runtime for 5 consecutive runs of this test is as follows :
```
[info] - cache UDF result correctly (4 seconds, 906 milliseconds)
[info] - cache UDF result correctly (4 seconds, 281 milliseconds)
[info] - cache UDF result correctly (4 seconds, 288 milliseconds)
[info] - cache UDF result correctly (4 seconds, 355 milliseconds)
[info] - cache UDF result correctly (4 seconds, 280 milliseconds)
```
## How was this patch tested?
This is s test fix.

Closes #22638 from dilipbiswal/SPARK-25610.

Authored-by: Dilip Biswal <dbis...@us.ibm.com>
Signed-off-by: gatorsmile <gatorsm...@gmail.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2c6f4d61
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2c6f4d61
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2c6f4d61

Branch: refs/heads/master
Commit: 2c6f4d61bbf7f0267a7309b4a236047f830bd6ee
Parents: bbd038d
Author: Dilip Biswal <dbis...@us.ibm.com>
Authored: Fri Oct 5 17:25:28 2018 -0700
Committer: gatorsmile <gatorsm...@gmail.com>
Committed: Fri Oct 5 17:25:28 2018 -0700

----------------------------------------------------------------------
 .../test/scala/org/apache/spark/sql/DatasetCacheSuite.scala    | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/2c6f4d61/sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala
index 5c6a021..fef6ddd 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala
@@ -127,8 +127,8 @@ class DatasetCacheSuite extends QueryTest with 
SharedSQLContext with TimeLimits
   }
 
   test("cache UDF result correctly") {
-    val expensiveUDF = udf({x: Int => Thread.sleep(5000); x})
-    val df = spark.range(0, 10).toDF("a").withColumn("b", expensiveUDF($"a"))
+    val expensiveUDF = udf({x: Int => Thread.sleep(2000); x})
+    val df = spark.range(0, 2).toDF("a").repartition(1).withColumn("b", 
expensiveUDF($"a"))
     val df2 = df.agg(sum(df("b")))
 
     df.cache()
@@ -136,7 +136,7 @@ class DatasetCacheSuite extends QueryTest with 
SharedSQLContext with TimeLimits
     assertCached(df2)
 
     // udf has been evaluated during caching, and thus should not be 
re-evaluated here
-    failAfter(3 seconds) {
+    failAfter(2 seconds) {
       df2.collect()
     }
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25610][SQL][TEST] Improve execution time of DatasetCacheSuite: cache UDF result correctly

Reply via email to