spark git commit: [SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608

2018-10-30 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/master 243ce319a -> 9cf9a83af


[SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608

## What changes were proposed in this pull request?
This patch removes the rangeBetween functions introduced in SPARK-21608. As 
explained in SPARK-25841, these functions are confusing and don't quite work. 
We will redesign them and introduce better ones in SPARK-25843.

## How was this patch tested?
Removed relevant test cases as well. These test cases will need to be added 
back in SPARK-25843.

Closes #22870 from rxin/SPARK-25862.

Lead-authored-by: Reynold Xin 
Co-authored-by: hyukjinkwon 
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9cf9a83a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9cf9a83a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9cf9a83a

Branch: refs/heads/master
Commit: 9cf9a83afafb88668c95ca704a1f65a91b5e591c
Parents: 243ce31
Author: Reynold Xin 
Authored: Tue Oct 30 21:27:17 2018 -0700
Committer: gatorsmile 
Committed: Tue Oct 30 21:27:17 2018 -0700

--
 .../expressions/windowExpressions.scala |  2 +-
 .../apache/spark/sql/expressions/Window.scala   |  9 ---
 .../spark/sql/expressions/WindowSpec.scala  | 12 
 .../scala/org/apache/spark/sql/functions.scala  | 26 
 .../resources/sql-tests/results/window.sql.out  |  2 +-
 .../spark/sql/DataFrameWindowFramesSuite.scala  | 68 +---
 6 files changed, 3 insertions(+), 116 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9cf9a83a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index 7de6ddd..0b674d0 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -206,7 +206,7 @@ case class SpecifiedWindowFrame(
 // Check combination (of expressions).
 (lower, upper) match {
   case (l: Expression, u: Expression) if !isValidFrameBoundary(l, u) =>
-TypeCheckFailure(s"Window frame upper bound '$upper' does not followes 
the lower bound " +
+TypeCheckFailure(s"Window frame upper bound '$upper' does not follow 
the lower bound " +
   s"'$lower'.")
   case (l: SpecialFrameBoundary, _) => TypeCheckSuccess
   case (_, u: SpecialFrameBoundary) => TypeCheckSuccess

http://git-wip-us.apache.org/repos/asf/spark/blob/9cf9a83a/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
index 14dec8f..d50031b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
@@ -214,15 +214,6 @@ object Window {
 spec.rangeBetween(start, end)
   }
 
-  /**
-   * This function has been deprecated in Spark 2.4. See SPARK-25842 for more 
information.
-   * @since 2.3.0
-   */
-  @deprecated("Use the version with Long parameter types", "2.4.0")
-  def rangeBetween(start: Column, end: Column): WindowSpec = {
-spec.rangeBetween(start, end)
-  }
-
   private[sql] def spec: WindowSpec = {
 new WindowSpec(Seq.empty, Seq.empty, UnspecifiedFrame)
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/9cf9a83a/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala
index 0cc43a5..b7f3000 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala
@@ -210,18 +210,6 @@ class WindowSpec private[sql](
   }
 
   /**
-   * This function has been deprecated in Spark 2.4. See SPARK-25842 for more 
information.
-   * @since 2.3.0
-   */
-  @deprecated("Use the version with Long parameter types", "2.4.0")
-  def rangeBetween(start: Column, end: Column): WindowSpec = {
-new WindowSpec(
-  partitionSpec,
-  orderSpec,
-  SpecifiedWindowFrame(RangeFrame, start.expr, end.expr))
-  }
-
-  /**
* Converts this 

svn commit: r30541 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_30_20_02-243ce31-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-30 Thread pwendell
Author: pwendell
Date: Wed Oct 31 03:17:00 2018
New Revision: 30541

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_30_20_02-243ce31 docs


[This commit notification would consist of 1472 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARKR] found some extra whitespace in the R tests

2018-10-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master f6ff6329e -> 243ce319a


[SPARKR] found some extra whitespace in the R tests

## What changes were proposed in this pull request?

during my ubuntu-port testing, i found some extra whitespace that for some 
reason wasn't getting caught on the centos lint-r build step.

## How was this patch tested?

the build system will test this!  i used one of my ubuntu testing builds and 
scped over the modified file.

before my fix:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-ubuntu-testing/22/console

after my fix:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-ubuntu-testing/23/console

Closes #22896 from shaneknapp/remove-extra-whitespace.

Authored-by: shane knapp 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/243ce319
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/243ce319
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/243ce319

Branch: refs/heads/master
Commit: 243ce319a06f20365d5b08d479642d75748645d9
Parents: f6ff632
Author: shane knapp 
Authored: Wed Oct 31 10:32:26 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Oct 31 10:32:26 2018 +0800

--
 R/pkg/tests/fulltests/test_sparkSQL_eager.R | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/243ce319/R/pkg/tests/fulltests/test_sparkSQL_eager.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL_eager.R 
b/R/pkg/tests/fulltests/test_sparkSQL_eager.R
index df7354f..9b4489a 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL_eager.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL_eager.R
@@ -22,12 +22,12 @@ context("test show SparkDataFrame when eager execution is 
enabled.")
 test_that("eager execution is not enabled", {
   # Start Spark session without eager execution enabled
   sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE)
-  
+
   df <- createDataFrame(faithful)
   expect_is(df, "SparkDataFrame")
   expected <- "eruptions:double, waiting:double"
   expect_output(show(df), expected)
-  
+
   # Stop Spark session
   sparkR.session.stop()
 })
@@ -35,9 +35,9 @@ test_that("eager execution is not enabled", {
 test_that("eager execution is enabled", {
   # Start Spark session with eager execution enabled
   sparkConfig <- list(spark.sql.repl.eagerEval.enabled = "true")
-  
+
   sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
sparkConfig = sparkConfig)
-  
+
   df <- createDataFrame(faithful)
   expect_is(df, "SparkDataFrame")
   expected <- paste0("(+-+---+\n",
@@ -45,7 +45,7 @@ test_that("eager execution is enabled", {
  "+-+---+\n)*",
  "(only showing top 20 rows)")
   expect_output(show(df), expected)
-  
+
   # Stop Spark session
   sparkR.session.stop()
 })
@@ -55,9 +55,9 @@ test_that("eager execution is enabled with maxNumRows and 
truncate set", {
   sparkConfig <- list(spark.sql.repl.eagerEval.enabled = "true",
   spark.sql.repl.eagerEval.maxNumRows = as.integer(5),
   spark.sql.repl.eagerEval.truncate = as.integer(2))
-  
+
   sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
sparkConfig = sparkConfig)
-  
+
   df <- arrange(createDataFrame(faithful), "waiting")
   expect_is(df, "SparkDataFrame")
   expected <- paste0("(+-+---+\n",
@@ -66,7 +66,7 @@ test_that("eager execution is enabled with maxNumRows and 
truncate set", {
  "|   1.| 43|\n)*",
  "(only showing top 5 rows)")
   expect_output(show(df), expected)
-  
+
   # Stop Spark session
   sparkR.session.stop()
 })


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use main method

2018-10-30 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 891032da6 -> f6ff6329e


[SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use main method

## What changes were proposed in this pull request?

Refactor JSONBenchmark to use main method

use spark-submit:
`bin/spark-submit --class 
org.apache.spark.sql.execution.datasources.json.JSONBenchmark --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar
 ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar`

Generate benchmark result:
`SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.datasources.json.JSONBenchmark"`

## How was this patch tested?

manual tests

Closes #22844 from heary-cao/JSONBenchmarks.

Lead-authored-by: caoxuewen 
Co-authored-by: heary 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f6ff6329
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f6ff6329
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f6ff6329

Branch: refs/heads/master
Commit: f6ff6329eee720e19a56b90c0ffda9da5cecca5b
Parents: 891032d
Author: caoxuewen 
Authored: Wed Oct 31 10:28:17 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Oct 31 10:28:17 2018 +0800

--
 sql/core/benchmarks/JSONBenchmark-results.txt   |  37 
 .../datasources/json/JsonBenchmark.scala| 183 
 .../datasources/json/JsonBenchmarks.scala   | 217 ---
 3 files changed, 220 insertions(+), 217 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f6ff6329/sql/core/benchmarks/JSONBenchmark-results.txt
--
diff --git a/sql/core/benchmarks/JSONBenchmark-results.txt 
b/sql/core/benchmarks/JSONBenchmark-results.txt
new file mode 100644
index 000..9993730
--- /dev/null
+++ b/sql/core/benchmarks/JSONBenchmark-results.txt
@@ -0,0 +1,37 @@
+
+Benchmark for performance of JSON parsing
+
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+JSON schema inferring:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+No encoding 62946 / 63310  1.6 
629.5   1.0X
+UTF-8 is set  112814 / 112866  0.9
1128.1   0.6X
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+JSON per-line parsing:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+No encoding 16468 / 16553  6.1 
164.7   1.0X
+UTF-8 is set16420 / 16441  6.1 
164.2   1.0X
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+JSON parsing of wide lines:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+No encoding 39789 / 40053  0.3
3978.9   1.0X
+UTF-8 is set39505 / 39584  0.3
3950.5   1.0X
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Count a dataset with 10 columns: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+Select 10 columns + count() 15997 / 16015  0.6
1599.7   1.0X
+Select 1 column + count()   13280 / 13326  0.8
1328.0   1.2X
+count()   3006 / 3021  3.3 
300.6   5.3X
+
+

http://git-wip-us.apache.org/repos/asf/spark/blob/f6ff6329/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
--
diff --git 

spark git commit: [SPARK-25691][SQL] Use semantic equality in AliasViewChild in order to compare attributes

2018-10-30 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master f6cc354d8 -> 891032da6


[SPARK-25691][SQL] Use semantic equality in AliasViewChild in order to compare 
attributes

## What changes were proposed in this pull request?

When we compare attributes, in general, we should always refer to semantic 
equality, as the default `equal` method can return false when there are 
"cosmetic" differences between them, but still they are the same thing; at 
least we have to consider them so when analyzing/optimizing queries.

The PR focuses on the usage and comparison of the `output` of a `LogicalPlan`, 
which is a `Seq[Attribute]` in `AliasViewChild`. In this case, using equality 
implicitly fails to check the semantic equality. This results in the operator 
failing to stabilize.

## How was this patch tested?

running the tests with the patch provided by maryannxue in 
https://github.com/apache/spark/pull/22060

Closes #22713 from mgaido91/SPARK-25691.

Authored-by: Marco Gaido 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/891032da
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/891032da
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/891032da

Branch: refs/heads/master
Commit: 891032da6f5b3c6a690e2ae44396873aa6a6b91d
Parents: f6cc354
Author: Marco Gaido 
Authored: Wed Oct 31 09:18:53 2018 +0800
Committer: Wenchen Fan 
Committed: Wed Oct 31 09:18:53 2018 +0800

--
 .../spark/sql/catalyst/analysis/view.scala  |  8 +++
 .../sql/catalyst/optimizer/Optimizer.scala  |  5 +---
 .../catalyst/plans/logical/LogicalPlan.scala| 14 +++
 .../sql/catalyst/analysis/AnalysisSuite.scala   | 25 +++-
 4 files changed, 43 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/891032da/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala
index af74693..6134d54 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala
@@ -49,7 +49,7 @@ import org.apache.spark.sql.internal.SQLConf
  */
 case class AliasViewChild(conf: SQLConf) extends Rule[LogicalPlan] with 
CastSupport {
   override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp 
{
-case v @ View(desc, output, child) if child.resolved && output != 
child.output =>
+case v @ View(desc, output, child) if child.resolved && 
!v.sameOutput(child) =>
   val resolver = conf.resolver
   val queryColumnNames = desc.viewQueryColumnNames
   val queryOutput = if (queryColumnNames.nonEmpty) {
@@ -70,7 +70,7 @@ case class AliasViewChild(conf: SQLConf) extends 
Rule[LogicalPlan] with CastSupp
   }
   // Map the attributes in the query output to the attributes in the view 
output by index.
   val newOutput = output.zip(queryOutput).map {
-case (attr, originAttr) if attr != originAttr =>
+case (attr, originAttr) if !attr.semanticEquals(originAttr) =>
   // The dataType of the output attributes may be not the same with 
that of the view
   // output, so we should cast the attribute to the dataType of the 
view output attribute.
   // Will throw an AnalysisException if the cast can't perform or 
might truncate.
@@ -112,8 +112,8 @@ object EliminateView extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 // The child should have the same output attributes with the View 
operator, so we simply
 // remove the View operator.
-case View(_, output, child) =>
-  assert(output == child.output,
+case v @ View(_, output, child) =>
+  assert(v.sameOutput(child),
 s"The output of the child ${child.output.mkString("[", ",", "]")} is 
different from the " +
   s"view output ${output.mkString("[", ",", "]")}")
   child

http://git-wip-us.apache.org/repos/asf/spark/blob/891032da/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index da8009d..95455ff 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -530,9 +530,6 @@ 

svn commit: r30540 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_30_16_06-f6cc354-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-30 Thread pwendell
Author: pwendell
Date: Tue Oct 30 23:20:56 2018
New Revision: 30540

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_30_16_06-f6cc354 docs


[This commit notification would consist of 1472 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r30538 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_30_12_06-c36537f-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-30 Thread pwendell
Author: pwendell
Date: Tue Oct 30 19:21:20 2018
New Revision: 30538

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_30_12_06-c36537f docs


[This commit notification would consist of 1472 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-25773][CORE] Cancel zombie tasks in a result stage when the job finishes

2018-10-30 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/master 94de5609b -> c36537fcf


[SPARK-25773][CORE] Cancel zombie tasks in a result stage when the job finishes

## What changes were proposed in this pull request?

When a job finishes, there may be some zombie tasks still running due to stage 
retry. Since a result stage will never be used by other jobs, running these 
tasks are just wasting the cluster resource. This PR just asks TaskScheduler to 
cancel the running tasks of a result stage when it's already finished. Credits 
go to srinathshankar who suggested this idea to me.

This PR also fixes two minor issues while I'm touching DAGScheduler:
- Invalid spark.job.interruptOnCancel should not crash DAGScheduler.
- Non fatal errors should not crash DAGScheduler.

## How was this patch tested?

The new unit tests.

Closes #22771 from zsxwing/SPARK-25773.

Lead-authored-by: Shixiong Zhu 
Co-authored-by: Shixiong Zhu 
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c36537fc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c36537fc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c36537fc

Branch: refs/heads/master
Commit: c36537fcfddc1eae1581b1b84d9d4384c5985c26
Parents: 94de560
Author: Shixiong Zhu 
Authored: Tue Oct 30 10:48:04 2018 -0700
Committer: Shixiong Zhu 
Committed: Tue Oct 30 10:48:04 2018 -0700

--
 .../apache/spark/scheduler/DAGScheduler.scala   | 48 +++---
 .../org/apache/spark/SparkContextSuite.scala| 53 +++-
 .../spark/scheduler/DAGSchedulerSuite.scala | 51 +--
 3 files changed, 129 insertions(+), 23 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c36537fc/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
--
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index 34b1160..06966e7 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -1296,6 +1296,27 @@ private[spark] class DAGScheduler(
   }
 
   /**
+   * Check [[SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL]] in job properties to 
see if we should
+   * interrupt running tasks. Returns `false` if the property value is not a 
boolean value
+   */
+  private def shouldInterruptTaskThread(job: ActiveJob): Boolean = {
+if (job.properties == null) {
+  false
+} else {
+  val shouldInterruptThread =
+job.properties.getProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, 
"false")
+  try {
+shouldInterruptThread.toBoolean
+  } catch {
+case e: IllegalArgumentException =>
+  logWarning(s"${SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL} in Job 
${job.jobId} " +
+s"is invalid: $shouldInterruptThread. Using 'false' instead", e)
+  false
+  }
+}
+  }
+
+  /**
* Responds to a task finishing. This is called inside the event loop so it 
assumes that it can
* modify the scheduler's internal state. Use taskEnded() to post a task end 
event from outside.
*/
@@ -1364,6 +1385,21 @@ private[spark] class DAGScheduler(
   if (job.numFinished == job.numPartitions) {
 markStageAsFinished(resultStage)
 cleanupStateForJobAndIndependentStages(job)
+try {
+  // killAllTaskAttempts will fail if a SchedulerBackend 
does not implement
+  // killTask.
+  logInfo(s"Job ${job.jobId} is finished. Cancelling 
potential speculative " +
+"or zombie tasks for this job")
+  // ResultStage is only used by this job. It's safe to 
kill speculative or
+  // zombie tasks in this stage.
+  taskScheduler.killAllTaskAttempts(
+stageId,
+shouldInterruptTaskThread(job),
+reason = "Stage finished")
+} catch {
+  case e: UnsupportedOperationException =>
+logWarning(s"Could not cancel tasks for stage 
$stageId", e)
+}
 listenerBus.post(
   SparkListenerJobEnd(job.jobId, clock.getTimeMillis(), 
JobSucceeded))
   }
@@ -1373,7 +1409,7 @@ private[spark] class DAGScheduler(
   try {
 job.listener.taskSucceeded(rt.outputId, event.result)
   } catch {
-case e: Exception =>
+case e: Throwable if 

spark git commit: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main method

2018-10-30 Thread dongjoon
Repository: spark
Updated Branches:
  refs/heads/master a129f0795 -> 94de5609b


[SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main method

## What changes were proposed in this pull request?

use spark-submit:
`bin/spark-submit --class 
org.apache.spark.sql.execution.datasources.csv.CSVBenchmark --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar
 ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar`

Generate benchmark result:
`SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.datasources.csv.CSVBenchmark"`

## How was this patch tested?

manual tests

Closes #22845 from heary-cao/CSVBenchmarks.

Authored-by: caoxuewen 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/94de5609
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/94de5609
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/94de5609

Branch: refs/heads/master
Commit: 94de5609be27e2618d6d241ec9aa032fbc601b6e
Parents: a129f07
Author: caoxuewen 
Authored: Tue Oct 30 09:18:55 2018 -0700
Committer: Dongjoon Hyun 
Committed: Tue Oct 30 09:18:55 2018 -0700

--
 sql/core/benchmarks/CSVBenchmark-results.txt|  27 
 .../datasources/csv/CSVBenchmark.scala  | 136 
 .../datasources/csv/CSVBenchmarks.scala | 158 ---
 3 files changed, 163 insertions(+), 158 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/94de5609/sql/core/benchmarks/CSVBenchmark-results.txt
--
diff --git a/sql/core/benchmarks/CSVBenchmark-results.txt 
b/sql/core/benchmarks/CSVBenchmark-results.txt
new file mode 100644
index 000..865575b
--- /dev/null
+++ b/sql/core/benchmarks/CSVBenchmark-results.txt
@@ -0,0 +1,27 @@
+
+Benchmark to measure CSV read/write performance
+
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Parsing quoted values:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+One quoted string   64733 / 64839  0.0 
1294653.1   1.0X
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Wide rows with 1000 columns: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+Select 1000 columns   185609 / 189735  0.0  
185608.6   1.0X
+Select 100 columns  50195 / 51808  0.0   
50194.8   3.7X
+Select one column   39266 / 39293  0.0   
39265.6   4.7X
+count() 10959 / 11000  0.1   
10958.5  16.9X
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Count a dataset with 10 columns: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+Select 10 columns + count() 24637 / 24768  0.4
2463.7   1.0X
+Select 1 column + count()   20026 / 20076  0.5
2002.6   1.2X
+count()   3754 / 3877  2.7 
375.4   6.6X
+

http://git-wip-us.apache.org/repos/asf/spark/blob/94de5609/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmark.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmark.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmark.scala
new file mode 100644
index 000..ce38b08
--- /dev/null
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmark.scala
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use 

spark git commit: [SPARK-23429][CORE][FOLLOWUP] MetricGetter should rename to ExecutorMetricType in comments

2018-10-30 Thread irashid
Repository: spark
Updated Branches:
  refs/heads/master ce40efa20 -> a129f0795


[SPARK-23429][CORE][FOLLOWUP] MetricGetter should rename to ExecutorMetricType 
in comments

## What changes were proposed in this pull request?

MetricGetter should rename to ExecutorMetricType in comments.

## How was this patch tested?

Just comments, no need to test.

Closes #22884 from LantaoJin/SPARK-23429_FOLLOWUP.

Authored-by: LantaoJin 
Signed-off-by: Imran Rashid 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a129f079
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a129f079
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a129f079

Branch: refs/heads/master
Commit: a129f079557204e3694754a5f9184c7f178cdf2a
Parents: ce40efa
Author: LantaoJin 
Authored: Tue Oct 30 11:01:55 2018 -0500
Committer: Imran Rashid 
Committed: Tue Oct 30 11:01:55 2018 -0500

--
 core/src/main/scala/org/apache/spark/Heartbeater.scala | 2 +-
 .../src/main/scala/org/apache/spark/executor/ExecutorMetrics.scala | 2 +-
 core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala  | 2 +-
 core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a129f079/core/src/main/scala/org/apache/spark/Heartbeater.scala
--
diff --git a/core/src/main/scala/org/apache/spark/Heartbeater.scala 
b/core/src/main/scala/org/apache/spark/Heartbeater.scala
index 5ba1b9b..84091ee 100644
--- a/core/src/main/scala/org/apache/spark/Heartbeater.scala
+++ b/core/src/main/scala/org/apache/spark/Heartbeater.scala
@@ -61,7 +61,7 @@ private[spark] class Heartbeater(
 
   /**
* Get the current executor level metrics. These are returned as an array, 
with the index
-   * determined by MetricGetter.values
+   * determined by ExecutorMetricType.values
*/
   def getCurrentMetrics(): ExecutorMetrics = {
 val metrics = 
ExecutorMetricType.values.map(_.getMetricValue(memoryManager)).toArray

http://git-wip-us.apache.org/repos/asf/spark/blob/a129f079/core/src/main/scala/org/apache/spark/executor/ExecutorMetrics.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/executor/ExecutorMetrics.scala 
b/core/src/main/scala/org/apache/spark/executor/ExecutorMetrics.scala
index 2933f3b..1befd27 100644
--- a/core/src/main/scala/org/apache/spark/executor/ExecutorMetrics.scala
+++ b/core/src/main/scala/org/apache/spark/executor/ExecutorMetrics.scala
@@ -28,7 +28,7 @@ import org.apache.spark.metrics.ExecutorMetricType
 @DeveloperApi
 class ExecutorMetrics private[spark] extends Serializable {
 
-  // Metrics are indexed by MetricGetter.values
+  // Metrics are indexed by ExecutorMetricType.values
   private val metrics = new Array[Long](ExecutorMetricType.values.length)
 
   // the first element is initialized to -1, indicating that the values for 
the array

http://git-wip-us.apache.org/repos/asf/spark/blob/a129f079/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
--
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index f93d8a8..34b1160 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -265,7 +265,7 @@ private[spark] class DAGScheduler(
   // (taskId, stageId, stageAttemptId, accumUpdates)
   accumUpdates: Array[(Long, Int, Int, Seq[AccumulableInfo])],
   blockManagerId: BlockManagerId,
-  // executor metrics indexed by MetricGetter.values
+  // executor metrics indexed by ExecutorMetricType.values
   executorUpdates: ExecutorMetrics): Boolean = {
 listenerBus.post(SparkListenerExecutorMetricsUpdate(execId, accumUpdates,
   Some(executorUpdates)))

http://git-wip-us.apache.org/repos/asf/spark/blob/a129f079/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala
--
diff --git a/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala 
b/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala
index 293e836..e92b8a2 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala
@@ -175,7 +175,7 @@ case class SparkListenerExecutorMetricsUpdate(
  * @param execId executor id
  * @param stageId stage id
  * @param stageAttemptId stage attempt
- * @param executorMetrics executor level metrics, 

svn commit: r30533 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_30_08_04-ce40efa-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-30 Thread pwendell
Author: pwendell
Date: Tue Oct 30 15:19:32 2018
New Revision: 30533

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_30_08_04-ce40efa docs


[This commit notification would consist of 1472 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-25790][MLLIB] PCA: Support more than 65535 column matrix

2018-10-30 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 327456b48 -> ce40efa20


[SPARK-25790][MLLIB] PCA: Support more than 65535 column matrix

## What changes were proposed in this pull request?
Spark PCA supports maximum only ~65,535 columns matrix. This is due to the fact 
that, it computes the Covariance matrix first, then compute principle 
components. The main bottle neck was computing **covariance matrix.** The limit 
65,500 came due to the integer size limit. Because we are passing an array of 
size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. 
so, the maximum column size we can give is 65,500.

Currently we don't have such limitation for computing SVD in spark.  So, we can 
make use of Spark SVD to compute the PCA, if the number of columns exceeds the 
limit.

Computation of PCA can be done directly using SVD of matrix, instead of finding 
the covariance matrix.
Following are the papers/links for the reference.

https://arxiv.org/pdf/1404.1100.pdf
https://en.wikipedia.org/wiki/Principal_component_analysis#Singular_value_decomposition
http://www.ifis.uni-luebeck.de/~moeller/Lectures/WS-16-17/Web-Mining-Agents/PCA-SVD.pdf

## How was this patch tested?
added UT, also manually verified with the existing test for pca, by removing 
the limit condition in the fit method.

Closes #22784 from shahidki31/PCA.

Authored-by: Shahid 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ce40efa2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ce40efa2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ce40efa2

Branch: refs/heads/master
Commit: ce40efa200e6cb6f10a289e2ab00f711b0ebd379
Parents: 327456b
Author: Shahid 
Authored: Tue Oct 30 08:39:30 2018 -0500
Committer: Sean Owen 
Committed: Tue Oct 30 08:39:30 2018 -0500

--
 .../org/apache/spark/mllib/feature/PCA.scala| 20 +---
 .../mllib/linalg/distributed/RowMatrix.scala| 33 +---
 .../apache/spark/mllib/feature/PCASuite.scala   | 23 +-
 3 files changed, 58 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ce40efa2/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala
index a01503f..2fc517c 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala
@@ -21,6 +21,7 @@ import org.apache.spark.annotation.Since
 import org.apache.spark.api.java.JavaRDD
 import org.apache.spark.mllib.linalg._
 import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.stat.Statistics
 import org.apache.spark.rdd.RDD
 
 /**
@@ -44,12 +45,21 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
 require(k <= numFeatures,
   s"source vector size $numFeatures must be no less than k=$k")
 
-require(PCAUtil.memoryCost(k, numFeatures) < Int.MaxValue,
-  "The param k and numFeatures is too large for SVD computation. " +
-  "Try reducing the parameter k for PCA, or reduce the input feature " +
-  "vector dimension to make this tractable.")
+val mat = if (numFeatures > 65535) {
+  val meanVector = Statistics.colStats(sources).mean.asBreeze
+  val meanCentredRdd = sources.map { rowVector =>
+Vectors.fromBreeze(rowVector.asBreeze - meanVector)
+  }
+  new RowMatrix(meanCentredRdd)
+} else {
+  require(PCAUtil.memoryCost(k, numFeatures) < Int.MaxValue,
+"The param k and numFeatures is too large for SVD computation. " +
+  "Try reducing the parameter k for PCA, or reduce the input feature " 
+
+  "vector dimension to make this tractable.")
+
+  new RowMatrix(sources)
+}
 
-val mat = new RowMatrix(sources)
 val (pc, explainedVariance) = 
mat.computePrincipalComponentsAndExplainedVariance(k)
 val densePC = pc match {
   case dm: DenseMatrix =>

http://git-wip-us.apache.org/repos/asf/spark/blob/ce40efa2/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
index 78a8810..82ab716 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
@@ -370,32 +370,41 @@ class RowMatrix @Since("1.0.0") (
* Each column corresponds for 

spark git commit: [BUILD][MINOR] release script should not interrupt by svn

2018-10-30 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master eab39f79e -> 327456b48


[BUILD][MINOR] release script should not interrupt by svn

## What changes were proposed in this pull request?

When running the release script, you will be interrupted unexpectedly
```
ATTENTION!  Your password for authentication realm:

    ASF Committers

can only be stored to disk unencrypted!  You are advised to configure
your system so that Subversion can store passwords encrypted, if
possible.  See the documentation for details.

You can avoid future appearances of this warning by setting the value
of the 'store-plaintext-passwords' option to either 'yes' or 'no' in
'/home/spark-rm/.subversion/servers'.
---
Store password unencrypted (yes/no)?
```

We can avoid it by adding `--no-auth-cache` when running svn command.

## How was this patch tested?

manually verified with 2.4.0 RC5

Closes #22885 from cloud-fan/svn.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/327456b4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/327456b4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/327456b4

Branch: refs/heads/master
Commit: 327456b482dec38a19bdc65061b3c2271f86819a
Parents: eab39f7
Author: Wenchen Fan 
Authored: Tue Oct 30 21:17:40 2018 +0800
Committer: Wenchen Fan 
Committed: Tue Oct 30 21:17:40 2018 +0800

--
 dev/create-release/release-build.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/327456b4/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 26e0868..2fdb5c8 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -326,7 +326,7 @@ if [[ "$1" == "package" ]]; then
 svn add "svn-spark/${DEST_DIR_NAME}-bin"
 
 cd svn-spark
-svn ci --username $ASF_USERNAME --password "$ASF_PASSWORD" -m"Apache Spark 
$SPARK_PACKAGE_VERSION"
+svn ci --username $ASF_USERNAME --password "$ASF_PASSWORD" -m"Apache Spark 
$SPARK_PACKAGE_VERSION" --no-auth-cache
 cd ..
 rm -rf svn-spark
   fi
@@ -354,7 +354,7 @@ if [[ "$1" == "docs" ]]; then
 svn add "svn-spark/${DEST_DIR_NAME}-docs"
 
 cd svn-spark
-svn ci --username $ASF_USERNAME --password "$ASF_PASSWORD" -m"Apache Spark 
$SPARK_PACKAGE_VERSION docs"
+svn ci --username $ASF_USERNAME --password "$ASF_PASSWORD" -m"Apache Spark 
$SPARK_PACKAGE_VERSION docs" --no-auth-cache
 cd ..
 rm -rf svn-spark
   fi


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-25755][SQL][TEST] Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec

2018-10-30 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 5bd5e1b9c -> eab39f79e


[SPARK-25755][SQL][TEST] Supplementation of non-CodeGen unit tested for 
BroadcastHashJoinExec

## What changes were proposed in this pull request?

Currently, the BroadcastHashJoinExec physical plan supports CodeGen and 
non-codegen, but only CodeGen code is tested in the unit tests of 
InnerJoinSuite、OuterJoinSuite、ExistenceJoinSuite, and non-codegen code is 
not tested. This PR supplements this part of the test.

## How was this patch tested?

add new unit tested.

Closes #22755 from heary-cao/AddTestToBroadcastHashJoinExec.

Authored-by: caoxuewen 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eab39f79
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eab39f79
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eab39f79

Branch: refs/heads/master
Commit: eab39f79e4c2fb51266ff5844114ee56b8ec2d91
Parents: 5bd5e1b
Author: caoxuewen 
Authored: Tue Oct 30 20:13:18 2018 +0800
Committer: Wenchen Fan 
Committed: Tue Oct 30 20:13:18 2018 +0800

--
 .../spark/sql/DataFrameAggregateSuite.scala | 30 
 .../spark/sql/DataFrameFunctionsSuite.scala | 15 ++--
 .../apache/spark/sql/DataFrameRangeSuite.scala  | 76 +---
 .../columnar/InMemoryColumnarQuerySuite.scala   | 39 +-
 .../execution/joins/ExistenceJoinSuite.scala|  2 +-
 .../sql/execution/joins/InnerJoinSuite.scala|  6 +-
 .../sql/execution/joins/OuterJoinSuite.scala|  2 +-
 .../apache/spark/sql/test/SQLTestUtils.scala| 15 
 8 files changed, 90 insertions(+), 95 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eab39f79/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
index d0106c4..d9ba6e2 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
@@ -669,23 +669,19 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
-  Seq(true, false).foreach { codegen =>
-test("SPARK-22951: dropDuplicates on empty dataFrames should produce 
correct aggregate " +
-  s"results when codegen is enabled: $codegen") {
-  withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, codegen.toString)) {
-// explicit global aggregations
-val emptyAgg = Map.empty[String, String]
-checkAnswer(spark.emptyDataFrame.agg(emptyAgg), Seq(Row()))
-checkAnswer(spark.emptyDataFrame.groupBy().agg(emptyAgg), Seq(Row()))
-checkAnswer(spark.emptyDataFrame.groupBy().agg(count("*")), 
Seq(Row(0)))
-checkAnswer(spark.emptyDataFrame.dropDuplicates().agg(emptyAgg), 
Seq(Row()))
-
checkAnswer(spark.emptyDataFrame.dropDuplicates().groupBy().agg(emptyAgg), 
Seq(Row()))
-
checkAnswer(spark.emptyDataFrame.dropDuplicates().groupBy().agg(count("*")), 
Seq(Row(0)))
-
-// global aggregation is converted to grouping aggregation:
-assert(spark.emptyDataFrame.dropDuplicates().count() == 0)
-  }
-}
+  testWithWholeStageCodegenOnAndOff("SPARK-22951: dropDuplicates on empty 
dataFrames " +
+"should produce correct aggregate") { _ =>
+// explicit global aggregations
+val emptyAgg = Map.empty[String, String]
+checkAnswer(spark.emptyDataFrame.agg(emptyAgg), Seq(Row()))
+checkAnswer(spark.emptyDataFrame.groupBy().agg(emptyAgg), Seq(Row()))
+checkAnswer(spark.emptyDataFrame.groupBy().agg(count("*")), Seq(Row(0)))
+checkAnswer(spark.emptyDataFrame.dropDuplicates().agg(emptyAgg), 
Seq(Row()))
+checkAnswer(spark.emptyDataFrame.dropDuplicates().groupBy().agg(emptyAgg), 
Seq(Row()))
+
checkAnswer(spark.emptyDataFrame.dropDuplicates().groupBy().agg(count("*")), 
Seq(Row(0)))
+
+// global aggregation is converted to grouping aggregation:
+assert(spark.emptyDataFrame.dropDuplicates().count() == 0)
   }
 
   test("SPARK-21896: Window functions inside aggregate functions") {

http://git-wip-us.apache.org/repos/asf/spark/blob/eab39f79/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
index 60ebc5e..666ba35 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
+++