spark git commit: [SPARK-15792][SQL] Allows operator to change the verbosity in explain output
Repository: spark Updated Branches: refs/heads/branch-2.0 a5bec5b81 -> 57dd4efcd [SPARK-15792][SQL] Allows operator to change the verbosity in explain output ## What changes were proposed in this pull request? This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan. Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future. **Less verbose mode:** dataframe.explain(extended = false) `output=[count(a)#85L]` is **NOT** displayed for HashAggregate. ``` scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2") scala> spark.sql("select count(a) from df2").explain() == Physical Plan == *HashAggregate(key=[], functions=[count(1)]) +- Exchange SinglePartition +- *HashAggregate(key=[], functions=[partial_count(1)]) +- LocalTableScan ``` **Verbose mode:** dataframe.explain(extended = true) `output=[count(a)#85L]` is displayed for HashAggregate. ``` scala> spark.sql("select count(a) from df2").explain(true) // "output=[count(a)#85L]" is added ... == Physical Plan == *HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L]) +- Exchange SinglePartition +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L]) +- LocalTableScan ``` ## How was this patch tested? Manual test. Author: Sean ZhongCloses #13535 from clockfly/verbose_breakdown_2. (cherry picked from commit 5f731d6859c4516941e5f90c99c966ef76268864) Signed-off-by: Cheng Lian Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57dd4efc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57dd4efc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57dd4efc Branch: refs/heads/branch-2.0 Commit: 57dd4efcda9158646df41ea8d70754dc110ecd6f Parents: a5bec5b Author: Sean Zhong Authored: Mon Jun 6 22:59:25 2016 -0700 Committer: Cheng Lian Committed: Mon Jun 6 22:59:34 2016 -0700 -- .../sql/catalyst/expressions/Expression.scala | 4 .../spark/sql/catalyst/plans/QueryPlan.scala| 2 ++ .../spark/sql/catalyst/trees/TreeNode.scala | 23 +++- .../spark/sql/execution/QueryExecution.scala| 14 +++- .../sql/execution/WholeStageCodegenExec.scala | 6 +++-- .../execution/aggregate/HashAggregateExec.scala | 12 -- .../execution/aggregate/SortAggregateExec.scala | 12 -- 7 files changed, 55 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/57dd4efc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala index 2ec4621..efe592d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala @@ -190,6 +190,10 @@ abstract class Expression extends TreeNode[Expression] { case single => single :: Nil } + // Marks this as final, Expression.verboseString should never be called, and thus shouldn't be + // overridden by concrete classes. + final override def verboseString: String = simpleString + override def simpleString: String = toString override def toString: String = prettyName + flatArguments.mkString("(", ", ", ")") http://git-wip-us.apache.org/repos/asf/spark/blob/57dd4efc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala index 19a66cf..cf34f4b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala @@ -257,6 +257,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT override def simpleString: String = statePrefix + super.simpleString + override def verboseString: String = simpleString + /** * All the subqueries of current plan. */ http://git-wip-us.apache.org/repos/asf/spark/blob/57dd4efc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
spark git commit: [SPARK-15792][SQL] Allows operator to change the verbosity in explain output
Repository: spark Updated Branches: refs/heads/master 0e0904a2f -> 5f731d685 [SPARK-15792][SQL] Allows operator to change the verbosity in explain output ## What changes were proposed in this pull request? This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan. Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future. **Less verbose mode:** dataframe.explain(extended = false) `output=[count(a)#85L]` is **NOT** displayed for HashAggregate. ``` scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2") scala> spark.sql("select count(a) from df2").explain() == Physical Plan == *HashAggregate(key=[], functions=[count(1)]) +- Exchange SinglePartition +- *HashAggregate(key=[], functions=[partial_count(1)]) +- LocalTableScan ``` **Verbose mode:** dataframe.explain(extended = true) `output=[count(a)#85L]` is displayed for HashAggregate. ``` scala> spark.sql("select count(a) from df2").explain(true) // "output=[count(a)#85L]" is added ... == Physical Plan == *HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L]) +- Exchange SinglePartition +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L]) +- LocalTableScan ``` ## How was this patch tested? Manual test. Author: Sean ZhongCloses #13535 from clockfly/verbose_breakdown_2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5f731d68 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5f731d68 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5f731d68 Branch: refs/heads/master Commit: 5f731d6859c4516941e5f90c99c966ef76268864 Parents: 0e0904a Author: Sean Zhong Authored: Mon Jun 6 22:59:25 2016 -0700 Committer: Cheng Lian Committed: Mon Jun 6 22:59:25 2016 -0700 -- .../sql/catalyst/expressions/Expression.scala | 4 .../spark/sql/catalyst/plans/QueryPlan.scala| 2 ++ .../spark/sql/catalyst/trees/TreeNode.scala | 23 +++- .../spark/sql/execution/QueryExecution.scala| 14 +++- .../sql/execution/WholeStageCodegenExec.scala | 6 +++-- .../execution/aggregate/HashAggregateExec.scala | 12 -- .../execution/aggregate/SortAggregateExec.scala | 12 -- 7 files changed, 55 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5f731d68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala index 2ec4621..efe592d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala @@ -190,6 +190,10 @@ abstract class Expression extends TreeNode[Expression] { case single => single :: Nil } + // Marks this as final, Expression.verboseString should never be called, and thus shouldn't be + // overridden by concrete classes. + final override def verboseString: String = simpleString + override def simpleString: String = toString override def toString: String = prettyName + flatArguments.mkString("(", ", ", ")") http://git-wip-us.apache.org/repos/asf/spark/blob/5f731d68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala index 19a66cf..cf34f4b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala @@ -257,6 +257,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT override def simpleString: String = statePrefix + super.simpleString + override def verboseString: String = simpleString + /** * All the subqueries of current plan. */ http://git-wip-us.apache.org/repos/asf/spark/blob/5f731d68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala -- diff --git
spark git commit: [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema
Repository: spark Updated Branches: refs/heads/branch-2.0 62765cbeb -> a5bec5b81 [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema ## What changes were proposed in this pull request? This PR makes sure the typed Filter doesn't change the Dataset schema. **Before the change:** ``` scala> val df = spark.range(0,9) scala> df.schema res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) scala> val afterFilter = df.filter(_=>true) scala> afterFilter.schema // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true. res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true)) ``` SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset. **After the change:** ``` scala> afterFilter.schema // schema is NOT changed. res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) ``` ## How was this patch tested? Unit test. Author: Sean ZhongCloses #13529 from clockfly/spark-15632. (cherry picked from commit 0e0904a2fce3c4447c24f1752307b6d01ffbd0ad) Signed-off-by: Cheng Lian Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a5bec5b8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a5bec5b8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a5bec5b8 Branch: refs/heads/branch-2.0 Commit: a5bec5b81d9e8ce17f1ce509731b030f0f3538e3 Parents: 62765cb Author: Sean Zhong Authored: Mon Jun 6 22:40:21 2016 -0700 Committer: Cheng Lian Committed: Mon Jun 6 22:40:29 2016 -0700 -- .../optimizer/TypedFilterOptimizationSuite.scala| 4 +++- .../main/scala/org/apache/spark/sql/Dataset.scala | 16 .../test/org/apache/spark/sql/JavaDatasetSuite.java | 13 + .../scala/org/apache/spark/sql/DatasetSuite.scala | 6 ++ .../sql/execution/WholeStageCodegenSuite.scala | 2 +- 5 files changed, 31 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a5bec5b8/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala index 289c16a..63d87bf 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala @@ -57,7 +57,9 @@ class TypedFilterOptimizationSuite extends PlanTest { comparePlans(optimized, expected) } - test("embed deserializer in filter condition if there is only one filter") { + // TODO: Remove this after we completely fix SPARK-15632 by adding optimization rules + // for typed filters. + ignore("embed deserializer in typed filter condition if there is only one filter") { val input = LocalRelation('_1.int, '_2.int) val f = (i: (Int, Int)) => i._1 > 0 http://git-wip-us.apache.org/repos/asf/spark/blob/a5bec5b8/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 96c871d..6cbc27d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1944,11 +1944,11 @@ class Dataset[T] private[sql]( */ @Experimental def filter(func: T => Boolean): Dataset[T] = { -val deserialized = CatalystSerde.deserialize[T](logicalPlan) +val deserializer = UnresolvedDeserializer(encoderFor[T].deserializer) val function = Literal.create(func, ObjectType(classOf[T => Boolean])) -val condition = Invoke(function, "apply", BooleanType, deserialized.output) -val filter = Filter(condition, deserialized) -withTypedPlan(CatalystSerde.serialize[T](filter)) +val condition = Invoke(function, "apply", BooleanType, deserializer :: Nil) +val filter = Filter(condition, logicalPlan) +withTypedPlan(filter) } /** @@ -1961,11 +1961,11 @@ class Dataset[T] private[sql]( */ @Experimental def filter(func: FilterFunction[T]): Dataset[T] = { -val deserialized = CatalystSerde.deserialize[T](logicalPlan) +val deserializer =
spark git commit: [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema
Repository: spark Updated Branches: refs/heads/master c409e23ab -> 0e0904a2f [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema ## What changes were proposed in this pull request? This PR makes sure the typed Filter doesn't change the Dataset schema. **Before the change:** ``` scala> val df = spark.range(0,9) scala> df.schema res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) scala> val afterFilter = df.filter(_=>true) scala> afterFilter.schema // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true. res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true)) ``` SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset. **After the change:** ``` scala> afterFilter.schema // schema is NOT changed. res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) ``` ## How was this patch tested? Unit test. Author: Sean ZhongCloses #13529 from clockfly/spark-15632. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0e0904a2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0e0904a2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0e0904a2 Branch: refs/heads/master Commit: 0e0904a2fce3c4447c24f1752307b6d01ffbd0ad Parents: c409e23 Author: Sean Zhong Authored: Mon Jun 6 22:40:21 2016 -0700 Committer: Cheng Lian Committed: Mon Jun 6 22:40:21 2016 -0700 -- .../optimizer/TypedFilterOptimizationSuite.scala| 4 +++- .../main/scala/org/apache/spark/sql/Dataset.scala | 16 .../test/org/apache/spark/sql/JavaDatasetSuite.java | 13 + .../scala/org/apache/spark/sql/DatasetSuite.scala | 6 ++ .../sql/execution/WholeStageCodegenSuite.scala | 2 +- 5 files changed, 31 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0e0904a2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala index 289c16a..63d87bf 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala @@ -57,7 +57,9 @@ class TypedFilterOptimizationSuite extends PlanTest { comparePlans(optimized, expected) } - test("embed deserializer in filter condition if there is only one filter") { + // TODO: Remove this after we completely fix SPARK-15632 by adding optimization rules + // for typed filters. + ignore("embed deserializer in typed filter condition if there is only one filter") { val input = LocalRelation('_1.int, '_2.int) val f = (i: (Int, Int)) => i._1 > 0 http://git-wip-us.apache.org/repos/asf/spark/blob/0e0904a2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 96c871d..6cbc27d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1944,11 +1944,11 @@ class Dataset[T] private[sql]( */ @Experimental def filter(func: T => Boolean): Dataset[T] = { -val deserialized = CatalystSerde.deserialize[T](logicalPlan) +val deserializer = UnresolvedDeserializer(encoderFor[T].deserializer) val function = Literal.create(func, ObjectType(classOf[T => Boolean])) -val condition = Invoke(function, "apply", BooleanType, deserialized.output) -val filter = Filter(condition, deserialized) -withTypedPlan(CatalystSerde.serialize[T](filter)) +val condition = Invoke(function, "apply", BooleanType, deserializer :: Nil) +val filter = Filter(condition, logicalPlan) +withTypedPlan(filter) } /** @@ -1961,11 +1961,11 @@ class Dataset[T] private[sql]( */ @Experimental def filter(func: FilterFunction[T]): Dataset[T] = { -val deserialized = CatalystSerde.deserialize[T](logicalPlan) +val deserializer = UnresolvedDeserializer(encoderFor[T].deserializer) val function = Literal.create(func, ObjectType(classOf[FilterFunction[T]])) -val condition =
spark git commit: [SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of SparkLauncher
Repository: spark Updated Branches: refs/heads/branch-2.0 5363783af -> 62765cbeb [SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of SparkLauncher ## What changes were proposed in this pull request? This situation can happen when the LauncherConnection gets an exception while reading through the socket and terminating silently without notifying making the client/listener think that the job is still in previous state. The fix force sends a notification to client that the job finished with unknown status and let client handle it accordingly. ## How was this patch tested? Added a unit test. Author: Subroto SanyalCloses #13497 from subrotosanyal/SPARK-15652-handle-spark-submit-jvm-crash. (cherry picked from commit c409e23abd128dad33557025f1e824ef47e6222f) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/62765cbe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/62765cbe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/62765cbe Branch: refs/heads/branch-2.0 Commit: 62765cbebe0cb41b0c4fdc344828337ee15e1fd2 Parents: 5363783 Author: Subroto Sanyal Authored: Mon Jun 6 16:05:40 2016 -0700 Committer: Marcelo Vanzin Committed: Mon Jun 6 16:05:52 2016 -0700 -- .../apache/spark/launcher/LauncherServer.java | 4 +++ .../apache/spark/launcher/SparkAppHandle.java | 4 ++- .../spark/launcher/LauncherServerSuite.java | 31 3 files changed, 38 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/62765cbe/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java -- diff --git a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java index e3413fd..28e9420 100644 --- a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java +++ b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java @@ -337,6 +337,10 @@ class LauncherServer implements Closeable { } super.close(); if (handle != null) { + if (!handle.getState().isFinal()) { + LOG.log(Level.WARNING, "Lost connection to spark application."); + handle.setState(SparkAppHandle.State.LOST); + } handle.disconnect(); } } http://git-wip-us.apache.org/repos/asf/spark/blob/62765cbe/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java -- diff --git a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java b/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java index 625d026..0aa7bd1 100644 --- a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java +++ b/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java @@ -46,7 +46,9 @@ public interface SparkAppHandle { /** The application finished with a failed status. */ FAILED(true), /** The application was killed. */ -KILLED(true); +KILLED(true), +/** The Spark Submit JVM exited with a unknown status. */ +LOST(true); private final boolean isFinal; http://git-wip-us.apache.org/repos/asf/spark/blob/62765cbe/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java -- diff --git a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java index bfe1fcc..12f1a0c 100644 --- a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java +++ b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java @@ -152,6 +152,37 @@ public class LauncherServerSuite extends BaseSuite { } } + @Test + public void testSparkSubmitVmShutsDown() throws Exception { +ChildProcAppHandle handle = LauncherServer.newAppHandle(); +TestClient client = null; +final Semaphore semaphore = new Semaphore(0); +try { + Socket s = new Socket(InetAddress.getLoopbackAddress(), +LauncherServer.getServerInstance().getPort()); + handle.addListener(new SparkAppHandle.Listener() { +public void stateChanged(SparkAppHandle handle) { + semaphore.release(); +} +public void infoChanged(SparkAppHandle handle) { + semaphore.release(); +} + }); + client = new TestClient(s); + client.send(new Hello(handle.getSecret(), "1.4.0")); + assertTrue(semaphore.tryAcquire(30, TimeUnit.SECONDS)); + // Make sure the server
spark git commit: [SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of SparkLauncher
Repository: spark Updated Branches: refs/heads/master 36d3dfa59 -> c409e23ab [SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of SparkLauncher ## What changes were proposed in this pull request? This situation can happen when the LauncherConnection gets an exception while reading through the socket and terminating silently without notifying making the client/listener think that the job is still in previous state. The fix force sends a notification to client that the job finished with unknown status and let client handle it accordingly. ## How was this patch tested? Added a unit test. Author: Subroto SanyalCloses #13497 from subrotosanyal/SPARK-15652-handle-spark-submit-jvm-crash. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c409e23a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c409e23a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c409e23a Branch: refs/heads/master Commit: c409e23abd128dad33557025f1e824ef47e6222f Parents: 36d3dfa Author: Subroto Sanyal Authored: Mon Jun 6 16:05:40 2016 -0700 Committer: Marcelo Vanzin Committed: Mon Jun 6 16:05:40 2016 -0700 -- .../apache/spark/launcher/LauncherServer.java | 4 +++ .../apache/spark/launcher/SparkAppHandle.java | 4 ++- .../spark/launcher/LauncherServerSuite.java | 31 3 files changed, 38 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c409e23a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java -- diff --git a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java index e3413fd..28e9420 100644 --- a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java +++ b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java @@ -337,6 +337,10 @@ class LauncherServer implements Closeable { } super.close(); if (handle != null) { + if (!handle.getState().isFinal()) { + LOG.log(Level.WARNING, "Lost connection to spark application."); + handle.setState(SparkAppHandle.State.LOST); + } handle.disconnect(); } } http://git-wip-us.apache.org/repos/asf/spark/blob/c409e23a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java -- diff --git a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java b/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java index 625d026..0aa7bd1 100644 --- a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java +++ b/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java @@ -46,7 +46,9 @@ public interface SparkAppHandle { /** The application finished with a failed status. */ FAILED(true), /** The application was killed. */ -KILLED(true); +KILLED(true), +/** The Spark Submit JVM exited with a unknown status. */ +LOST(true); private final boolean isFinal; http://git-wip-us.apache.org/repos/asf/spark/blob/c409e23a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java -- diff --git a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java index bfe1fcc..12f1a0c 100644 --- a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java +++ b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java @@ -152,6 +152,37 @@ public class LauncherServerSuite extends BaseSuite { } } + @Test + public void testSparkSubmitVmShutsDown() throws Exception { +ChildProcAppHandle handle = LauncherServer.newAppHandle(); +TestClient client = null; +final Semaphore semaphore = new Semaphore(0); +try { + Socket s = new Socket(InetAddress.getLoopbackAddress(), +LauncherServer.getServerInstance().getPort()); + handle.addListener(new SparkAppHandle.Listener() { +public void stateChanged(SparkAppHandle handle) { + semaphore.release(); +} +public void infoChanged(SparkAppHandle handle) { + semaphore.release(); +} + }); + client = new TestClient(s); + client.send(new Hello(handle.getSecret(), "1.4.0")); + assertTrue(semaphore.tryAcquire(30, TimeUnit.SECONDS)); + // Make sure the server matched the client to the handle. + assertNotNull(handle.getConnection()); + close(client); +
svn commit: r1747076 - in /spark: js/downloads.js site/js/downloads.js
Author: srowen Date: Mon Jun 6 20:59:54 2016 New Revision: 1747076 URL: http://svn.apache.org/viewvc?rev=1747076=rev Log: SPARK-15778 part 2: group preview/stable releases in download version dropdown Modified: spark/js/downloads.js spark/site/js/downloads.js Modified: spark/js/downloads.js URL: http://svn.apache.org/viewvc/spark/js/downloads.js?rev=1747076=1747075=1747076=diff == --- spark/js/downloads.js (original) +++ spark/js/downloads.js Mon Jun 6 20:59:54 2016 @@ -53,18 +53,18 @@ addRelease("1.1.0", new Date("9/11/2014" addRelease("1.0.2", new Date("8/5/2014"), sources.concat(packagesV3), true, true); addRelease("1.0.1", new Date("7/11/2014"), sources.concat(packagesV3), false, true); addRelease("1.0.0", new Date("5/30/2014"), sources.concat(packagesV2), false, true); -addRelease("0.9.2", new Date("7/23/2014"), sources.concat(packagesV2), true, false); -addRelease("0.9.1", new Date("4/9/2014"), sources.concat(packagesV2), false, false); -addRelease("0.9.0-incubating", new Date("2/2/2014"), sources.concat(packagesV2), false, false); -addRelease("0.8.1-incubating", new Date("12/19/2013"), sources.concat(packagesV2), true, false); -addRelease("0.8.0-incubating", new Date("9/25/2013"), sources.concat(packagesV1), true, false); -addRelease("0.7.3", new Date("7/16/2013"), sources.concat(packagesV1), true, false); -addRelease("0.7.2", new Date("2/6/2013"), sources.concat(packagesV1), false, false); -addRelease("0.7.0", new Date("2/27/2013"), sources, false, false); +addRelease("0.9.2", new Date("7/23/2014"), sources.concat(packagesV2), true, true); +addRelease("0.9.1", new Date("4/9/2014"), sources.concat(packagesV2), false, true); +addRelease("0.9.0-incubating", new Date("2/2/2014"), sources.concat(packagesV2), false, true); +addRelease("0.8.1-incubating", new Date("12/19/2013"), sources.concat(packagesV2), true, true); +addRelease("0.8.0-incubating", new Date("9/25/2013"), sources.concat(packagesV1), true, true); +addRelease("0.7.3", new Date("7/16/2013"), sources.concat(packagesV1), true, true); +addRelease("0.7.2", new Date("2/6/2013"), sources.concat(packagesV1), false, true); +addRelease("0.7.0", new Date("2/27/2013"), sources, false, true); function append(el, contents) { - el.innerHTML = el.innerHTML + contents; -}; + el.innerHTML += contents; +} function empty(el) { el.innerHTML = ""; @@ -79,27 +79,25 @@ function versionShort(version) { return function initDownloads() { var versionSelect = document.getElementById("sparkVersionSelect"); - // Populate versions - var markedDefault = false; + // Populate stable versions + append(versionSelect, ""); for (var version in releases) { +if (!releases[version].downloadable || !releases[version].stable) { continue; } var releaseDate = releases[version].released; -var downloadable = releases[version].downloadable; -var stable = releases[version].stable; - -if (!downloadable) { continue; } - -var selected = false; -if (!markedDefault && stable) { - selected = true; - markedDefault = true; -} +var title = versionShort(version) + " (" + releaseDate.toDateString().slice(4) + ")"; +append(versionSelect, "" + title + ""); + } + append(versionSelect, ""); -// Don't display incubation status here + // Populate other versions + append(versionSelect, ""); + for (var version in releases) { +if (!releases[version].downloadable || releases[version].stable) { continue; } +var releaseDate = releases[version].released; var title = versionShort(version) + " (" + releaseDate.toDateString().slice(4) + ")"; -append(versionSelect, - "" + - title + ""); +append(versionSelect, "" + title + ""); } + append(versionSelect, ""); // Populate packages and (transitively) releases onVersionSelect(); Modified: spark/site/js/downloads.js URL: http://svn.apache.org/viewvc/spark/site/js/downloads.js?rev=1747076=1747075=1747076=diff == --- spark/site/js/downloads.js (original) +++ spark/site/js/downloads.js Mon Jun 6 20:59:54 2016 @@ -53,18 +53,18 @@ addRelease("1.1.0", new Date("9/11/2014" addRelease("1.0.2", new Date("8/5/2014"), sources.concat(packagesV3), true, true); addRelease("1.0.1", new Date("7/11/2014"), sources.concat(packagesV3), false, true); addRelease("1.0.0", new Date("5/30/2014"), sources.concat(packagesV2), false, true); -addRelease("0.9.2", new Date("7/23/2014"), sources.concat(packagesV2), true, false); -addRelease("0.9.1", new Date("4/9/2014"), sources.concat(packagesV2), false, false); -addRelease("0.9.0-incubating", new Date("2/2/2014"), sources.concat(packagesV2), false, false); -addRelease("0.8.1-incubating", new Date("12/19/2013"), sources.concat(packagesV2), true, false); -addRelease("0.8.0-incubating", new Date("9/25/2013"),
svn commit: r1747061 - in /spark: downloads.md js/downloads.js site/downloads.html site/js/downloads.js
Author: srowen Date: Mon Jun 6 19:56:07 2016 New Revision: 1747061 URL: http://svn.apache.org/viewvc?rev=1747061=rev Log: SPARK-15778 add spark-2.0.0-preview release to options and other minor related updates Modified: spark/downloads.md spark/js/downloads.js spark/site/downloads.html spark/site/js/downloads.js Modified: spark/downloads.md URL: http://svn.apache.org/viewvc/spark/downloads.md?rev=1747061=1747060=1747061=diff == --- spark/downloads.md (original) +++ spark/downloads.md Mon Jun 6 19:56:07 2016 @@ -16,7 +16,7 @@ $(document).ready(function() { ## Download Apache Spark -Our latest version is Apache Spark 1.6.1, released on March 9, 2016 +Our latest stable version is Apache Spark 1.6.1, released on March 9, 2016 (release notes) https://github.com/apache/spark/releases/tag/v1.6.1;>(git tag) @@ -36,6 +36,17 @@ Our latest version is Apache Spark 1.6.1 _Note: Scala 2.11 users should download the Spark source package and build [with Scala 2.11 support](http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211)._ +### Latest Preview Release + +Preview releases, as the name suggests, are releases for previewing upcoming features. +Unlike nightly packages, preview releases have been audited by the project's management committee +to satisfy the legal requirements of Apache Software Foundation's release policy. +Preview releases are not meant to be functional, i.e. they can and highly likely will contain +critical bugs or documentation errors. + +The latest preview release is Spark 2.0.0-preview, published on May 24, 2016. +You can select and download it above. + ### Link with Spark Spark artifacts are [hosted in Maven Central](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22). You can add a Maven dependency with the following coordinates: @@ -54,14 +65,9 @@ If you are interested in working with th Once you've downloaded Spark, you can find instructions for installing and building it on the documentation page. -Stable Releases - - -### Latest Preview Release (Spark 2.0.0-preview) -Preview releases, as the name suggests, are releases for previewing upcoming features. Unlike nightly packages, preview releases have been audited by the project's management committee to satisfy the legal requirements of Apache Software Foundation's release policy.Preview releases are not meant to be functional, i.e. they can and highly likely will contain critical bugs or documentation errors. - -The latest preview release is Spark 2.0.0-preview, published on May 24, 2016. You can https://dist.apache.org/repos/dist/release/spark/spark-2.0.0-preview/;>download it here. +### Release Notes for Stable Releases + ### Nightly Packages and Artifacts For developers, Spark maintains nightly builds and SNAPSHOT artifacts. More information is available on the [Spark developer Wiki](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds). Modified: spark/js/downloads.js URL: http://svn.apache.org/viewvc/spark/js/downloads.js?rev=1747061=1747060=1747061=diff == --- spark/js/downloads.js (original) +++ spark/js/downloads.js Mon Jun 6 19:56:07 2016 @@ -3,8 +3,8 @@ releases = {}; -function addRelease(version, releaseDate, packages, downloadable) { - releases[version] = {released: releaseDate, packages: packages, downloadable: downloadable}; +function addRelease(version, releaseDate, packages, downloadable, stable) { + releases[version] = {released: releaseDate, packages: packages, downloadable: downloadable, stable: stable}; } var sources = {pretty: "Source Code [can build several Hadoop versions]", tag: "sources"}; @@ -13,8 +13,9 @@ var hadoop1 = {pretty: "Pre-built for Ha var cdh4 = {pretty: "Pre-built for CDH 4", tag: "cdh4"}; var hadoop2 = {pretty: "Pre-built for Hadoop 2.2", tag: "hadoop2"}; var hadoop2p3 = {pretty: "Pre-built for Hadoop 2.3", tag: "hadoop2.3"}; -var hadoop2p4 = {pretty: "Pre-built for Hadoop 2.4 and later", tag: "hadoop2.4"}; -var hadoop2p6 = {pretty: "Pre-built for Hadoop 2.6 and later", tag: "hadoop2.6"}; +var hadoop2p4 = {pretty: "Pre-built for Hadoop 2.4", tag: "hadoop2.4"}; +var hadoop2p6 = {pretty: "Pre-built for Hadoop 2.6", tag: "hadoop2.6"}; +var hadoop2p7 = {pretty: "Pre-built for Hadoop 2.7 and later", tag: "hadoop2.7"}; var mapr3 = {pretty: "Pre-built for MapR 3.X", tag: "mapr3"}; var mapr4 = {pretty: "Pre-built for MapR 4.X", tag: "mapr4"}; @@ -31,32 +32,35 @@ var packagesV4 = [hadoop2p4, hadoop2p3, var packagesV5 = [hadoop2p6].concat(packagesV4); // 1.4.0+ var packagesV6 = [hadoopFree, hadoop2p6, hadoop2p4, hadoop2p3].concat(packagesV1); +// 2.0.0+ +var packagesV7 = [hadoopFree, hadoop2p7, hadoop2p6, hadoop2p4, hadoop2p3]; -addRelease("1.6.1", new
spark git commit: [SPARK-15783][CORE] still some flakiness in these blacklist tests so ignore for now
Repository: spark Updated Branches: refs/heads/master 0b8d69499 -> 36d3dfa59 [SPARK-15783][CORE] still some flakiness in these blacklist tests so ignore for now ## What changes were proposed in this pull request? There is still some flakiness in BlacklistIntegrationSuite, so turning it off for the moment to avoid breaking more builds -- will turn it back with more fixes. ## How was this patch tested? jenkins. Author: Imran RashidCloses #13528 from squito/ignore_blacklist. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/36d3dfa5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/36d3dfa5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/36d3dfa5 Branch: refs/heads/master Commit: 36d3dfa59a1ec0af6118e0667b80e9b7628e2cb6 Parents: 0b8d694 Author: Imran Rashid Authored: Mon Jun 6 12:53:11 2016 -0700 Committer: Imran Rashid Committed: Mon Jun 6 12:53:11 2016 -0700 -- .../org/apache/spark/scheduler/BlacklistIntegrationSuite.scala | 6 +++--- .../org/apache/spark/scheduler/SchedulerIntegrationSuite.scala | 5 + 2 files changed, 8 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/36d3dfa5/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala index 3a4b7af..d8a4b19 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala @@ -41,7 +41,7 @@ class BlacklistIntegrationSuite extends SchedulerIntegrationSuite[MultiExecutorM // Test demonstrating the issue -- without a config change, the scheduler keeps scheduling // according to locality preferences, and so the job fails - testScheduler("If preferred node is bad, without blacklist job will fail") { + ignore("If preferred node is bad, without blacklist job will fail") { val rdd = new MockRDDWithLocalityPrefs(sc, 10, Nil, badHost) withBackend(badHostBackend _) { val jobFuture = submit(rdd, (0 until 10).toArray) @@ -54,7 +54,7 @@ class BlacklistIntegrationSuite extends SchedulerIntegrationSuite[MultiExecutorM // even with the blacklist turned on, if maxTaskFailures is not more than the number // of executors on the bad node, then locality preferences will lead to us cycling through // the executors on the bad node, and still failing the job - testScheduler( + ignoreScheduler( "With blacklist on, job will still fail if there are too many bad executors on bad host", extraConfs = Seq( // just set this to something much longer than the test duration @@ -72,7 +72,7 @@ class BlacklistIntegrationSuite extends SchedulerIntegrationSuite[MultiExecutorM // Here we run with the blacklist on, and maxTaskFailures high enough that we'll eventually // schedule on a good node and succeed the job - testScheduler( + ignoreScheduler( "Bad node with multiple executors, job will still succeed with the right confs", extraConfs = Seq( // just set this to something much longer than the test duration http://git-wip-us.apache.org/repos/asf/spark/blob/36d3dfa5/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala index 92bd765..5271a56 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala @@ -89,6 +89,11 @@ abstract class SchedulerIntegrationSuite[T <: MockBackend: ClassTag] extends Spa } } + // still a few races to work out in the blacklist tests, so ignore some tests + def ignoreScheduler(name: String, extraConfs: Seq[(String, String)])(testBody: => Unit): Unit = { +ignore(name)(testBody) + } + /** * A map from partition -> results for all tasks of a job when you call this test framework's * [[submit]] method. Two important considerations: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Repository: spark Updated Branches: refs/heads/master 4c74ee8d8 -> 0b8d69499 [SPARK-15764][SQL] Replace N^2 loop in BindReferences BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. Perf. benchmarks to follow. /cc ericl Author: Josh RosenCloses #13505 from JoshRosen/bind-references-improvement. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b8d6949 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b8d6949 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b8d6949 Branch: refs/heads/master Commit: 0b8d694999b43ada4833388cad6c285c7757cbf7 Parents: 4c74ee8 Author: Josh Rosen Authored: Mon Jun 6 11:44:51 2016 -0700 Committer: Josh Rosen Committed: Mon Jun 6 11:44:51 2016 -0700 -- .../sql/catalyst/expressions/AttributeMap.scala | 7 .../catalyst/expressions/BoundAttribute.scala | 6 ++-- .../sql/catalyst/expressions/package.scala | 34 +++- .../spark/sql/catalyst/plans/QueryPlan.scala| 2 +- .../execution/aggregate/HashAggregateExec.scala | 2 +- .../columnar/InMemoryTableScanExec.scala| 4 +-- 6 files changed, 40 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0b8d6949/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala index ef3cc55..96a11e3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala @@ -26,13 +26,6 @@ object AttributeMap { def apply[A](kvs: Seq[(Attribute, A)]): AttributeMap[A] = { new AttributeMap(kvs.map(kv => (kv._1.exprId, kv)).toMap) } - - /** Given a schema, constructs an [[AttributeMap]] from [[Attribute]] to ordinal */ - def byIndex(schema: Seq[Attribute]): AttributeMap[Int] = apply(schema.zipWithIndex) - - /** Given a schema, constructs a map from ordinal to Attribute. */ - def toIndex(schema: Seq[Attribute]): Map[Int, Attribute] = -schema.zipWithIndex.map { case (a, i) => i -> a }.toMap } class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)]) http://git-wip-us.apache.org/repos/asf/spark/blob/0b8d6949/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala index a38f1ec..7d16118 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala @@ -82,16 +82,16 @@ object BindReferences extends Logging { def bindReference[A <: Expression]( expression: A, - input: Seq[Attribute], + input: AttributeSeq, allowFailures: Boolean = false): A = { expression.transform { case a: AttributeReference => attachTree(a, "Binding attribute") { -val ordinal = input.indexWhere(_.exprId == a.exprId) +val ordinal = input.indexOf(a.exprId) if (ordinal == -1) { if (allowFailures) { a } else { -sys.error(s"Couldn't find $a in ${input.mkString("[", ",", "]")}") +sys.error(s"Couldn't find $a in ${input.attrs.mkString("[", ",", "]")}") } } else { BoundReference(ordinal, a.dataType, input(ordinal).nullable) http://git-wip-us.apache.org/repos/asf/spark/blob/0b8d6949/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala -- diff --git
spark git commit: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Repository: spark Updated Branches: refs/heads/branch-2.0 d07bce49f -> 5363783af [SPARK-15764][SQL] Replace N^2 loop in BindReferences BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. Perf. benchmarks to follow. /cc ericl Author: Josh RosenCloses #13505 from JoshRosen/bind-references-improvement. (cherry picked from commit 0b8d694999b43ada4833388cad6c285c7757cbf7) Signed-off-by: Josh Rosen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5363783a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5363783a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5363783a Branch: refs/heads/branch-2.0 Commit: 5363783af7d93b7597181c8b39b83800fa059543 Parents: d07bce4 Author: Josh Rosen Authored: Mon Jun 6 11:44:51 2016 -0700 Committer: Josh Rosen Committed: Mon Jun 6 11:45:35 2016 -0700 -- .../sql/catalyst/expressions/AttributeMap.scala | 7 .../catalyst/expressions/BoundAttribute.scala | 6 ++-- .../sql/catalyst/expressions/package.scala | 34 +++- .../spark/sql/catalyst/plans/QueryPlan.scala| 2 +- .../execution/aggregate/HashAggregateExec.scala | 2 +- .../columnar/InMemoryTableScanExec.scala| 4 +-- 6 files changed, 40 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5363783a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala index ef3cc55..96a11e3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala @@ -26,13 +26,6 @@ object AttributeMap { def apply[A](kvs: Seq[(Attribute, A)]): AttributeMap[A] = { new AttributeMap(kvs.map(kv => (kv._1.exprId, kv)).toMap) } - - /** Given a schema, constructs an [[AttributeMap]] from [[Attribute]] to ordinal */ - def byIndex(schema: Seq[Attribute]): AttributeMap[Int] = apply(schema.zipWithIndex) - - /** Given a schema, constructs a map from ordinal to Attribute. */ - def toIndex(schema: Seq[Attribute]): Map[Int, Attribute] = -schema.zipWithIndex.map { case (a, i) => i -> a }.toMap } class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)]) http://git-wip-us.apache.org/repos/asf/spark/blob/5363783a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala index a38f1ec..7d16118 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala @@ -82,16 +82,16 @@ object BindReferences extends Logging { def bindReference[A <: Expression]( expression: A, - input: Seq[Attribute], + input: AttributeSeq, allowFailures: Boolean = false): A = { expression.transform { case a: AttributeReference => attachTree(a, "Binding attribute") { -val ordinal = input.indexWhere(_.exprId == a.exprId) +val ordinal = input.indexOf(a.exprId) if (ordinal == -1) { if (allowFailures) { a } else { -sys.error(s"Couldn't find $a in ${input.mkString("[", ",", "]")}") +sys.error(s"Couldn't find $a in ${input.attrs.mkString("[", ",", "]")}") } } else { BoundReference(ordinal, a.dataType, input(ordinal).nullable)
spark git commit: [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public
Repository: spark Updated Branches: refs/heads/master fa4bc8ea8 -> 4c74ee8d8 [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public ## What changes were proposed in this pull request? Made DefaultParamsReadable, DefaultParamsWritable public. Also added relevant doc and annotations. Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable. ## How was this patch tested? Wrote example making use of the now-public APIs. Compiled and ran locally Author: Joseph K. BradleyCloses #13461 from jkbradley/defaultparamswritable. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4c74ee8d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4c74ee8d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4c74ee8d Branch: refs/heads/master Commit: 4c74ee8d8e1c3139d3d322ae68977f2ab53295df Parents: fa4bc8e Author: Joseph K. Bradley Authored: Mon Jun 6 09:49:45 2016 -0700 Committer: Joseph K. Bradley Committed: Mon Jun 6 09:49:45 2016 -0700 -- .../examples/ml/UnaryTransformerExample.scala | 122 +++ .../org/apache/spark/ml/util/ReadWrite.scala| 44 ++- 2 files changed, 163 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4c74ee8d/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala -- diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala new file mode 100644 index 000..13c72f8 --- /dev/null +++ b/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println +package org.apache.spark.examples.ml + +// $example on$ +import org.apache.spark.ml.UnaryTransformer +import org.apache.spark.ml.param.DoubleParam +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.functions.col +// $example off$ +import org.apache.spark.sql.SparkSession +// $example on$ +import org.apache.spark.sql.types.{DataType, DataTypes} +import org.apache.spark.util.Utils +// $example off$ + +/** + * An example demonstrating creating a custom [[org.apache.spark.ml.Transformer]] using + * the [[UnaryTransformer]] abstraction. + * + * Run with + * {{{ + * bin/run-example ml.UnaryTransformerExample + * }}} + */ +object UnaryTransformerExample { + + // $example on$ + /** + * Simple Transformer which adds a constant value to input Doubles. + * + * [[UnaryTransformer]] can be used to create a stage usable within Pipelines. + * It defines parameters for specifying input and output columns: + * [[UnaryTransformer.inputCol]] and [[UnaryTransformer.outputCol]]. + * It can optionally handle schema validation. + * + * [[DefaultParamsWritable]] provides a default implementation for persisting instances + * of this Transformer. + */ + class MyTransformer(override val uid: String) +extends UnaryTransformer[Double, Double, MyTransformer] with DefaultParamsWritable { + +final val shift: DoubleParam = new DoubleParam(this, "shift", "Value added to input") + +def getShift: Double = $(shift) + +def setShift(value: Double): this.type = set(shift, value) + +def this() = this(Identifiable.randomUID("myT")) + +override protected def createTransformFunc: Double => Double = (input: Double) => { + input + $(shift) +} + +override protected def outputDataType: DataType = DataTypes.DoubleType + +override protected def validateInputType(inputType: DataType): Unit = { + require(inputType == DataTypes.DoubleType, s"Bad input type: $inputType. Requires Double.") +} + } + + /** + * Companion object for our simple
spark git commit: [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public
Repository: spark Updated Branches: refs/heads/branch-2.0 e38ff70e6 -> d07bce49f [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public ## What changes were proposed in this pull request? Made DefaultParamsReadable, DefaultParamsWritable public. Also added relevant doc and annotations. Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable. ## How was this patch tested? Wrote example making use of the now-public APIs. Compiled and ran locally Author: Joseph K. BradleyCloses #13461 from jkbradley/defaultparamswritable. (cherry picked from commit 4c74ee8d8e1c3139d3d322ae68977f2ab53295df) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d07bce49 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d07bce49 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d07bce49 Branch: refs/heads/branch-2.0 Commit: d07bce49fc77aff25330920cc55b7079a3a2995c Parents: e38ff70 Author: Joseph K. Bradley Authored: Mon Jun 6 09:49:45 2016 -0700 Committer: Joseph K. Bradley Committed: Mon Jun 6 09:49:56 2016 -0700 -- .../examples/ml/UnaryTransformerExample.scala | 122 +++ .../org/apache/spark/ml/util/ReadWrite.scala| 44 ++- 2 files changed, 163 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d07bce49/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala -- diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala new file mode 100644 index 000..13c72f8 --- /dev/null +++ b/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println +package org.apache.spark.examples.ml + +// $example on$ +import org.apache.spark.ml.UnaryTransformer +import org.apache.spark.ml.param.DoubleParam +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.functions.col +// $example off$ +import org.apache.spark.sql.SparkSession +// $example on$ +import org.apache.spark.sql.types.{DataType, DataTypes} +import org.apache.spark.util.Utils +// $example off$ + +/** + * An example demonstrating creating a custom [[org.apache.spark.ml.Transformer]] using + * the [[UnaryTransformer]] abstraction. + * + * Run with + * {{{ + * bin/run-example ml.UnaryTransformerExample + * }}} + */ +object UnaryTransformerExample { + + // $example on$ + /** + * Simple Transformer which adds a constant value to input Doubles. + * + * [[UnaryTransformer]] can be used to create a stage usable within Pipelines. + * It defines parameters for specifying input and output columns: + * [[UnaryTransformer.inputCol]] and [[UnaryTransformer.outputCol]]. + * It can optionally handle schema validation. + * + * [[DefaultParamsWritable]] provides a default implementation for persisting instances + * of this Transformer. + */ + class MyTransformer(override val uid: String) +extends UnaryTransformer[Double, Double, MyTransformer] with DefaultParamsWritable { + +final val shift: DoubleParam = new DoubleParam(this, "shift", "Value added to input") + +def getShift: Double = $(shift) + +def setShift(value: Double): this.type = set(shift, value) + +def this() = this(Identifiable.randomUID("myT")) + +override protected def createTransformFunc: Double => Double = (input: Double) => { + input + $(shift) +} + +override protected def outputDataType: DataType = DataTypes.DoubleType + +override protected def validateInputType(inputType: DataType): Unit = { + require(inputType ==
spark git commit: [SPARK-14279][BUILD] Pick the spark version from pom
Repository: spark Updated Branches: refs/heads/master 00ad4f054 -> fa4bc8ea8 [SPARK-14279][BUILD] Pick the spark version from pom ## What changes were proposed in this pull request? Change the way spark picks up version information. Also embed the build information to better identify the spark version running. More context can be found here : https://github.com/apache/spark/pull/12152 ## How was this patch tested? Ran the mvn and sbt builds to verify the version information was being displayed correctly on executing spark-submit --version ![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png) Author: Dhruve AsharCloses #13061 from dhruve/impr/SPARK-14279. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fa4bc8ea Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fa4bc8ea Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fa4bc8ea Branch: refs/heads/master Commit: fa4bc8ea8bab1277d1482da370dac79947cac719 Parents: 00ad4f0 Author: Dhruve Ashar Authored: Mon Jun 6 09:42:50 2016 -0700 Committer: Marcelo Vanzin Committed: Mon Jun 6 09:42:50 2016 -0700 -- build/spark-build-info | 38 ++ core/pom.xml| 31 +++ .../org/apache/spark/deploy/SparkSubmit.scala | 7 ++- .../main/scala/org/apache/spark/package.scala | 55 +++- pom.xml | 6 ++- project/SparkBuild.scala| 21 ++-- 6 files changed, 150 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fa4bc8ea/build/spark-build-info -- diff --git a/build/spark-build-info b/build/spark-build-info new file mode 100755 index 000..ad0ec67 --- /dev/null +++ b/build/spark-build-info @@ -0,0 +1,38 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script generates the build info for spark and places it into the spark-version-info.properties file. +# Arguments: +# build_tgt_directory - The target directory where properties file would be created. [./core/target/extra-resources] +# spark_version - The current version of spark + +RESOURCE_DIR="$1" +mkdir -p "$RESOURCE_DIR" +SPARK_BUILD_INFO="${RESOURCE_DIR}"/spark-version-info.properties + +echo_build_properties() { + echo version=$1 + echo user=$USER + echo revision=$(git rev-parse HEAD) + echo branch=$(git rev-parse --abbrev-ref HEAD) + echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ) + echo url=$(git config --get remote.origin.url) +} + +echo_build_properties $2 > "$SPARK_BUILD_INFO" http://git-wip-us.apache.org/repos/asf/spark/blob/fa4bc8ea/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 45f8bfc..f5fdb40 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -337,9 +337,40 @@ target/scala-${scala.binary.version}/classes target/scala-${scala.binary.version}/test-classes + + +${project.basedir}/src/main/resources + + + +${project.build.directory}/extra-resources +true + + org.apache.maven.plugins +maven-antrun-plugin + + +generate-resources + + + + + + + + + + + run + + + + + +org.apache.maven.plugins maven-dependency-plugin
spark git commit: [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison, recall, f1
Repository: spark Updated Branches: refs/heads/branch-2.0 86a35a229 -> e38ff70e6 [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1 ## What changes were proposed in this pull request? 1, add accuracy for MulticlassMetrics 2, deprecate overall precision,recall,f1 and recommend accuracy usage ## How was this patch tested? manual tests in pyspark shell Author: Zheng RuiFengCloses #13511 from zhengruifeng/deprecate_py_precisonrecall. (cherry picked from commit 00ad4f054cd044e17d29b7c2c62efd8616462619) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e38ff70e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e38ff70e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e38ff70e Branch: refs/heads/branch-2.0 Commit: e38ff70e6bacf1c85edc390d28f8a8d5ecc6cbc3 Parents: 86a35a2 Author: Zheng RuiFeng Authored: Mon Jun 6 15:19:22 2016 +0100 Committer: Sean Owen Committed: Mon Jun 6 15:19:38 2016 +0100 -- python/pyspark/mllib/evaluation.py | 18 ++ 1 file changed, 18 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e38ff70e/python/pyspark/mllib/evaluation.py -- diff --git a/python/pyspark/mllib/evaluation.py b/python/pyspark/mllib/evaluation.py index 5f32f09..2eaac87 100644 --- a/python/pyspark/mllib/evaluation.py +++ b/python/pyspark/mllib/evaluation.py @@ -15,6 +15,8 @@ # limitations under the License. # +import warnings + from pyspark import since from pyspark.mllib.common import JavaModelWrapper, callMLlibFunc from pyspark.sql import SQLContext @@ -181,6 +183,8 @@ class MulticlassMetrics(JavaModelWrapper): 0.66... >>> metrics.recall() 0.66... +>>> metrics.accuracy() +0.66... >>> metrics.weightedFalsePositiveRate 0.19... >>> metrics.weightedPrecision @@ -233,6 +237,8 @@ class MulticlassMetrics(JavaModelWrapper): Returns precision or precision for a given label (category) if specified. """ if label is None: +# note:: Deprecated in 2.0.0. Use accuracy. +warnings.warn("Deprecated in 2.0.0. Use accuracy.") return self.call("precision") else: return self.call("precision", float(label)) @@ -243,6 +249,8 @@ class MulticlassMetrics(JavaModelWrapper): Returns recall or recall for a given label (category) if specified. """ if label is None: +# note:: Deprecated in 2.0.0. Use accuracy. +warnings.warn("Deprecated in 2.0.0. Use accuracy.") return self.call("recall") else: return self.call("recall", float(label)) @@ -254,6 +262,8 @@ class MulticlassMetrics(JavaModelWrapper): """ if beta is None: if label is None: +# note:: Deprecated in 2.0.0. Use accuracy. +warnings.warn("Deprecated in 2.0.0. Use accuracy.") return self.call("fMeasure") else: return self.call("fMeasure", label) @@ -263,6 +273,14 @@ class MulticlassMetrics(JavaModelWrapper): else: return self.call("fMeasure", label, beta) +@since('2.0.0') +def accuracy(self): +""" +Returns accuracy (equals to the total number of correctly classified instances +out of the total number of instances). +""" +return self.call("accuracy") + @property @since('1.4.0') def weightedTruePositiveRate(self): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison, recall, f1
Repository: spark Updated Branches: refs/heads/master a95252823 -> 00ad4f054 [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1 ## What changes were proposed in this pull request? 1, add accuracy for MulticlassMetrics 2, deprecate overall precision,recall,f1 and recommend accuracy usage ## How was this patch tested? manual tests in pyspark shell Author: Zheng RuiFengCloses #13511 from zhengruifeng/deprecate_py_precisonrecall. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/00ad4f05 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/00ad4f05 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/00ad4f05 Branch: refs/heads/master Commit: 00ad4f054cd044e17d29b7c2c62efd8616462619 Parents: a952528 Author: Zheng RuiFeng Authored: Mon Jun 6 15:19:22 2016 +0100 Committer: Sean Owen Committed: Mon Jun 6 15:19:22 2016 +0100 -- python/pyspark/mllib/evaluation.py | 18 ++ 1 file changed, 18 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/00ad4f05/python/pyspark/mllib/evaluation.py -- diff --git a/python/pyspark/mllib/evaluation.py b/python/pyspark/mllib/evaluation.py index 5f32f09..2eaac87 100644 --- a/python/pyspark/mllib/evaluation.py +++ b/python/pyspark/mllib/evaluation.py @@ -15,6 +15,8 @@ # limitations under the License. # +import warnings + from pyspark import since from pyspark.mllib.common import JavaModelWrapper, callMLlibFunc from pyspark.sql import SQLContext @@ -181,6 +183,8 @@ class MulticlassMetrics(JavaModelWrapper): 0.66... >>> metrics.recall() 0.66... +>>> metrics.accuracy() +0.66... >>> metrics.weightedFalsePositiveRate 0.19... >>> metrics.weightedPrecision @@ -233,6 +237,8 @@ class MulticlassMetrics(JavaModelWrapper): Returns precision or precision for a given label (category) if specified. """ if label is None: +# note:: Deprecated in 2.0.0. Use accuracy. +warnings.warn("Deprecated in 2.0.0. Use accuracy.") return self.call("precision") else: return self.call("precision", float(label)) @@ -243,6 +249,8 @@ class MulticlassMetrics(JavaModelWrapper): Returns recall or recall for a given label (category) if specified. """ if label is None: +# note:: Deprecated in 2.0.0. Use accuracy. +warnings.warn("Deprecated in 2.0.0. Use accuracy.") return self.call("recall") else: return self.call("recall", float(label)) @@ -254,6 +262,8 @@ class MulticlassMetrics(JavaModelWrapper): """ if beta is None: if label is None: +# note:: Deprecated in 2.0.0. Use accuracy. +warnings.warn("Deprecated in 2.0.0. Use accuracy.") return self.call("fMeasure") else: return self.call("fMeasure", label) @@ -263,6 +273,14 @@ class MulticlassMetrics(JavaModelWrapper): else: return self.call("fMeasure", label, beta) +@since('2.0.0') +def accuracy(self): +""" +Returns accuracy (equals to the total number of correctly classified instances +out of the total number of instances). +""" +return self.call("accuracy") + @property @since('1.4.0') def weightedTruePositiveRate(self): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples
Repository: spark Updated Branches: refs/heads/branch-2.0 90e94b826 -> 86a35a229 [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples ## What changes were proposed in this pull request? Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken. ```python pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.' ``` We should use ```accuracy``` to replace ```precision``` in these examples. ## How was this patch tested? Offline tests. Author: Yanbo LiangCloses #13519 from yanboliang/spark-15771. (cherry picked from commit a95252823e09939b654dd425db38dadc4100bc87) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86a35a22 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86a35a22 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86a35a22 Branch: refs/heads/branch-2.0 Commit: 86a35a22985b9e592744e6ef31453995f2322a31 Parents: 90e94b8 Author: Yanbo Liang Authored: Mon Jun 6 09:36:34 2016 +0100 Committer: Sean Owen Committed: Mon Jun 6 09:36:43 2016 +0100 -- .../examples/ml/JavaDecisionTreeClassificationExample.java | 2 +- .../examples/ml/JavaGradientBoostedTreeClassifierExample.java | 2 +- .../examples/ml/JavaMultilayerPerceptronClassifierExample.java | 6 +++--- .../org/apache/spark/examples/ml/JavaNaiveBayesExample.java| 6 +++--- .../org/apache/spark/examples/ml/JavaOneVsRestExample.java | 6 +++--- .../spark/examples/ml/JavaRandomForestClassifierExample.java | 2 +- .../src/main/python/ml/decision_tree_classification_example.py | 2 +- .../main/python/ml/gradient_boosted_tree_classifier_example.py | 2 +- .../src/main/python/ml/multilayer_perceptron_classification.py | 6 +++--- examples/src/main/python/ml/naive_bayes_example.py | 6 +++--- examples/src/main/python/ml/one_vs_rest_example.py | 6 +++--- .../src/main/python/ml/random_forest_classifier_example.py | 2 +- .../spark/examples/ml/DecisionTreeClassificationExample.scala | 2 +- .../examples/ml/GradientBoostedTreeClassifierExample.scala | 2 +- .../examples/ml/MultilayerPerceptronClassifierExample.scala| 6 +++--- .../scala/org/apache/spark/examples/ml/NaiveBayesExample.scala | 6 +++--- .../scala/org/apache/spark/examples/ml/OneVsRestExample.scala | 6 +++--- .../spark/examples/ml/RandomForestClassifierExample.scala | 2 +- python/pyspark/ml/evaluation.py| 2 +- 19 files changed, 37 insertions(+), 37 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/86a35a22/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java index bdb76f0..a9c6e7f 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java @@ -90,7 +90,7 @@ public class JavaDecisionTreeClassificationExample { MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator() .setLabelCol("indexedLabel") .setPredictionCol("prediction") - .setMetricName("precision"); + .setMetricName("accuracy"); double accuracy = evaluator.evaluate(predictions); System.out.println("Test Error = " + (1.0 - accuracy)); http://git-wip-us.apache.org/repos/asf/spark/blob/86a35a22/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java index 5c2e03e..3e9eb99 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java @@ -92,7 +92,7 @@ public class JavaGradientBoostedTreeClassifierExample { MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator() .setLabelCol("indexedLabel")
spark git commit: [MINOR] Fix Typos 'an -> a'
Repository: spark Updated Branches: refs/heads/branch-2.0 7d10e4bdd -> 90e94b826 [MINOR] Fix Typos 'an -> a' ## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFengCloses #13515 from zhengruifeng/an_a. (cherry picked from commit fd8af397132fa1415a4c19d7f5cb5a41aa6ddb27) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90e94b82 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90e94b82 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90e94b82 Branch: refs/heads/branch-2.0 Commit: 90e94b82649d9816cd4065549678b82751238552 Parents: 7d10e4b Author: Zheng RuiFeng Authored: Mon Jun 6 09:35:47 2016 +0100 Committer: Sean Owen Committed: Mon Jun 6 09:35:57 2016 +0100 -- R/pkg/R/utils.R | 2 +- .../src/main/scala/org/apache/spark/Accumulable.scala | 2 +- .../org/apache/spark/api/java/JavaSparkContext.scala | 2 +- .../scala/org/apache/spark/api/python/PythonRDD.scala | 2 +- .../scala/org/apache/spark/deploy/SparkSubmit.scala | 6 +++--- .../src/main/scala/org/apache/spark/rdd/JdbcRDD.scala | 6 +++--- .../main/scala/org/apache/spark/scheduler/Pool.scala | 2 +- .../org/apache/spark/broadcast/BroadcastSuite.scala | 2 +- .../spark/deploy/rest/StandaloneRestSubmitSuite.scala | 2 +- .../test/scala/org/apache/spark/rpc/RpcEnvSuite.scala | 2 +- .../apache/spark/scheduler/DAGSchedulerSuite.scala| 4 ++-- .../org/apache/spark/util/JsonProtocolSuite.scala | 2 +- .../spark/streaming/flume/FlumeBatchFetcher.scala | 2 +- .../spark/graphx/impl/VertexPartitionBaseOps.scala| 2 +- .../scala/org/apache/spark/ml/linalg/Vectors.scala| 2 +- .../src/main/scala/org/apache/spark/ml/Pipeline.scala | 2 +- .../spark/ml/classification/LogisticRegression.scala | 4 ++-- .../org/apache/spark/ml/tree/impl/RandomForest.scala | 2 +- .../mllib/classification/LogisticRegression.scala | 2 +- .../org/apache/spark/mllib/classification/SVM.scala | 2 +- .../spark/mllib/feature/VectorTransformer.scala | 2 +- .../scala/org/apache/spark/mllib/linalg/Vectors.scala | 2 +- .../mllib/linalg/distributed/CoordinateMatrix.scala | 2 +- .../apache/spark/mllib/rdd/MLPairRDDFunctions.scala | 2 +- python/pyspark/ml/classification.py | 4 ++-- python/pyspark/ml/pipeline.py | 2 +- python/pyspark/mllib/classification.py| 2 +- python/pyspark/mllib/common.py| 2 +- python/pyspark/rdd.py | 4 ++-- python/pyspark/sql/session.py | 2 +- python/pyspark/sql/streaming.py | 2 +- python/pyspark/sql/types.py | 2 +- python/pyspark/streaming/dstream.py | 4 ++-- .../src/main/scala/org/apache/spark/sql/Row.scala | 2 +- .../apache/spark/sql/catalyst/analysis/Analyzer.scala | 4 ++-- .../sql/catalyst/analysis/FunctionRegistry.scala | 2 +- .../sql/catalyst/analysis/MultiInstanceRelation.scala | 2 +- .../spark/sql/catalyst/catalog/SessionCatalog.scala | 6 +++--- .../sql/catalyst/catalog/functionResources.scala | 2 +- .../sql/catalyst/expressions/ExpectsInputTypes.scala | 2 +- .../spark/sql/catalyst/expressions/Projection.scala | 4 ++-- .../sql/catalyst/expressions/complexTypeCreator.scala | 2 +- .../org/apache/spark/sql/types/AbstractDataType.scala | 2 +- .../scala/org/apache/spark/sql/DataFrameReader.scala | 2 +- .../main/scala/org/apache/spark/sql/SQLContext.scala | 14 +++--- .../scala/org/apache/spark/sql/SQLImplicits.scala | 2 +- .../scala/org/apache/spark/sql/SparkSession.scala | 14 +++--- .../org/apache/spark/sql/catalyst/SQLBuilder.scala| 2 +- .../aggregate/SortBasedAggregationIterator.scala | 2 +- .../apache/spark/sql/execution/aggregate/udaf.scala | 2 +- .../execution/columnar/GenerateColumnAccessor.scala | 2 +- .../execution/datasources/FileSourceStrategy.scala| 2 +- .../execution/datasources/json/JacksonParser.scala| 2 +- .../datasources/parquet/CatalystRowConverter.scala| 2 +- .../sql/execution/exchange/ExchangeCoordinator.scala | 10 +- .../spark/sql/execution/joins/SortMergeJoinExec.scala | 2 +- .../spark/sql/execution/r/MapPartitionsRWrapper.scala | 2 +- .../scala/org/apache/spark/sql/expressions/udaf.scala | 2 +- .../org/apache/spark/sql/internal/SharedState.scala | 2 +- .../apache/spark/sql/streaming/ContinuousQuery.scala |
spark git commit: [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples
Repository: spark Updated Branches: refs/heads/master fd8af3971 -> a95252823 [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples ## What changes were proposed in this pull request? Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken. ```python pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.' ``` We should use ```accuracy``` to replace ```precision``` in these examples. ## How was this patch tested? Offline tests. Author: Yanbo LiangCloses #13519 from yanboliang/spark-15771. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a9525282 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a9525282 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a9525282 Branch: refs/heads/master Commit: a95252823e09939b654dd425db38dadc4100bc87 Parents: fd8af39 Author: Yanbo Liang Authored: Mon Jun 6 09:36:34 2016 +0100 Committer: Sean Owen Committed: Mon Jun 6 09:36:34 2016 +0100 -- .../examples/ml/JavaDecisionTreeClassificationExample.java | 2 +- .../examples/ml/JavaGradientBoostedTreeClassifierExample.java | 2 +- .../examples/ml/JavaMultilayerPerceptronClassifierExample.java | 6 +++--- .../org/apache/spark/examples/ml/JavaNaiveBayesExample.java| 6 +++--- .../org/apache/spark/examples/ml/JavaOneVsRestExample.java | 6 +++--- .../spark/examples/ml/JavaRandomForestClassifierExample.java | 2 +- .../src/main/python/ml/decision_tree_classification_example.py | 2 +- .../main/python/ml/gradient_boosted_tree_classifier_example.py | 2 +- .../src/main/python/ml/multilayer_perceptron_classification.py | 6 +++--- examples/src/main/python/ml/naive_bayes_example.py | 6 +++--- examples/src/main/python/ml/one_vs_rest_example.py | 6 +++--- .../src/main/python/ml/random_forest_classifier_example.py | 2 +- .../spark/examples/ml/DecisionTreeClassificationExample.scala | 2 +- .../examples/ml/GradientBoostedTreeClassifierExample.scala | 2 +- .../examples/ml/MultilayerPerceptronClassifierExample.scala| 6 +++--- .../scala/org/apache/spark/examples/ml/NaiveBayesExample.scala | 6 +++--- .../scala/org/apache/spark/examples/ml/OneVsRestExample.scala | 6 +++--- .../spark/examples/ml/RandomForestClassifierExample.scala | 2 +- python/pyspark/ml/evaluation.py| 2 +- 19 files changed, 37 insertions(+), 37 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a9525282/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java index bdb76f0..a9c6e7f 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java @@ -90,7 +90,7 @@ public class JavaDecisionTreeClassificationExample { MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator() .setLabelCol("indexedLabel") .setPredictionCol("prediction") - .setMetricName("precision"); + .setMetricName("accuracy"); double accuracy = evaluator.evaluate(predictions); System.out.println("Test Error = " + (1.0 - accuracy)); http://git-wip-us.apache.org/repos/asf/spark/blob/a9525282/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java index 5c2e03e..3e9eb99 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java @@ -92,7 +92,7 @@ public class JavaGradientBoostedTreeClassifierExample { MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator() .setLabelCol("indexedLabel") .setPredictionCol("prediction") - .setMetricName("precision"); + .setMetricName("accuracy"); double accuracy =
spark git commit: [MINOR] Fix Typos 'an -> a'
Repository: spark Updated Branches: refs/heads/master 32f2f95db -> fd8af3971 [MINOR] Fix Typos 'an -> a' ## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFengCloses #13515 from zhengruifeng/an_a. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fd8af397 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fd8af397 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fd8af397 Branch: refs/heads/master Commit: fd8af397132fa1415a4c19d7f5cb5a41aa6ddb27 Parents: 32f2f95 Author: Zheng RuiFeng Authored: Mon Jun 6 09:35:47 2016 +0100 Committer: Sean Owen Committed: Mon Jun 6 09:35:47 2016 +0100 -- R/pkg/R/utils.R | 2 +- .../src/main/scala/org/apache/spark/Accumulable.scala | 2 +- .../org/apache/spark/api/java/JavaSparkContext.scala | 2 +- .../scala/org/apache/spark/api/python/PythonRDD.scala | 2 +- .../scala/org/apache/spark/deploy/SparkSubmit.scala | 6 +++--- .../src/main/scala/org/apache/spark/rdd/JdbcRDD.scala | 6 +++--- .../main/scala/org/apache/spark/scheduler/Pool.scala | 2 +- .../org/apache/spark/broadcast/BroadcastSuite.scala | 2 +- .../spark/deploy/rest/StandaloneRestSubmitSuite.scala | 2 +- .../test/scala/org/apache/spark/rpc/RpcEnvSuite.scala | 2 +- .../apache/spark/scheduler/DAGSchedulerSuite.scala| 4 ++-- .../org/apache/spark/util/JsonProtocolSuite.scala | 2 +- .../spark/streaming/flume/FlumeBatchFetcher.scala | 2 +- .../spark/graphx/impl/VertexPartitionBaseOps.scala| 2 +- .../scala/org/apache/spark/ml/linalg/Vectors.scala| 2 +- .../src/main/scala/org/apache/spark/ml/Pipeline.scala | 2 +- .../spark/ml/classification/LogisticRegression.scala | 4 ++-- .../org/apache/spark/ml/tree/impl/RandomForest.scala | 2 +- .../mllib/classification/LogisticRegression.scala | 2 +- .../org/apache/spark/mllib/classification/SVM.scala | 2 +- .../spark/mllib/feature/VectorTransformer.scala | 2 +- .../scala/org/apache/spark/mllib/linalg/Vectors.scala | 2 +- .../mllib/linalg/distributed/CoordinateMatrix.scala | 2 +- .../apache/spark/mllib/rdd/MLPairRDDFunctions.scala | 2 +- python/pyspark/ml/classification.py | 4 ++-- python/pyspark/ml/pipeline.py | 2 +- python/pyspark/mllib/classification.py| 2 +- python/pyspark/mllib/common.py| 2 +- python/pyspark/rdd.py | 4 ++-- python/pyspark/sql/session.py | 2 +- python/pyspark/sql/streaming.py | 2 +- python/pyspark/sql/types.py | 2 +- python/pyspark/streaming/dstream.py | 4 ++-- .../src/main/scala/org/apache/spark/sql/Row.scala | 2 +- .../apache/spark/sql/catalyst/analysis/Analyzer.scala | 4 ++-- .../sql/catalyst/analysis/FunctionRegistry.scala | 2 +- .../sql/catalyst/analysis/MultiInstanceRelation.scala | 2 +- .../spark/sql/catalyst/catalog/SessionCatalog.scala | 6 +++--- .../sql/catalyst/catalog/functionResources.scala | 2 +- .../sql/catalyst/expressions/ExpectsInputTypes.scala | 2 +- .../spark/sql/catalyst/expressions/Projection.scala | 4 ++-- .../sql/catalyst/expressions/complexTypeCreator.scala | 2 +- .../org/apache/spark/sql/types/AbstractDataType.scala | 2 +- .../scala/org/apache/spark/sql/DataFrameReader.scala | 2 +- .../main/scala/org/apache/spark/sql/SQLContext.scala | 14 +++--- .../scala/org/apache/spark/sql/SQLImplicits.scala | 2 +- .../scala/org/apache/spark/sql/SparkSession.scala | 14 +++--- .../org/apache/spark/sql/catalyst/SQLBuilder.scala| 2 +- .../aggregate/SortBasedAggregationIterator.scala | 2 +- .../apache/spark/sql/execution/aggregate/udaf.scala | 2 +- .../execution/columnar/GenerateColumnAccessor.scala | 2 +- .../execution/datasources/FileSourceStrategy.scala| 2 +- .../execution/datasources/json/JacksonParser.scala| 2 +- .../datasources/parquet/CatalystRowConverter.scala| 2 +- .../sql/execution/exchange/ExchangeCoordinator.scala | 10 +- .../spark/sql/execution/joins/SortMergeJoinExec.scala | 2 +- .../spark/sql/execution/r/MapPartitionsRWrapper.scala | 2 +- .../scala/org/apache/spark/sql/expressions/udaf.scala | 2 +- .../org/apache/spark/sql/internal/SharedState.scala | 2 +- .../apache/spark/sql/streaming/ContinuousQuery.scala | 2 +- .../org/apache/spark/sql/hive/client/HiveClient.scala | 2 +- .../apache/spark/sql/hive/orc/OrcFileOperator.scala |
spark git commit: Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"
Repository: spark Updated Branches: refs/heads/branch-2.0 9e7e2f916 -> 7d10e4bdd Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour" This reverts commit 9e7e2f9164e0b3bd555e795b871626057b4fed31. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d10e4bd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d10e4bd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d10e4bd Branch: refs/heads/branch-2.0 Commit: 7d10e4bdd2adbeb10904665536e4949381f19cf5 Parents: 9e7e2f9 Author: Reynold XinAuthored: Sun Jun 5 23:40:35 2016 -0700 Committer: Reynold Xin Committed: Sun Jun 5 23:40:35 2016 -0700 -- python/pyspark/sql/readwriter.py| 81 ++-- .../execution/datasources/csv/CSVOptions.scala | 11 +-- .../execution/datasources/csv/CSVSuite.scala| 11 --- 3 files changed, 48 insertions(+), 55 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7d10e4bd/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 19aa8dd..9208a52 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -303,11 +303,10 @@ class DataFrameReader(object): return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path))) @since(2.0) -def csv(self, path, schema=None, sep=u',', encoding=u'UTF-8', quote=u'\"', escape=u'\\', -comment=None, header='false', ignoreLeadingWhiteSpace='false', -ignoreTrailingWhiteSpace='false', nullValue='', nanValue='NaN', positiveInf='Inf', -negativeInf='Inf', dateFormat=None, maxColumns='20480', maxCharsPerColumn='100', -mode='PERMISSIVE'): +def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None, +comment=None, header=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, +nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, +maxColumns=None, maxCharsPerColumn=None, mode=None): """Loads a CSV file and returns the result as a [[DataFrame]]. This function goes through the input once to determine the input schema. To avoid going @@ -316,41 +315,44 @@ class DataFrameReader(object): :param path: string, or list of strings, for input path(s). :param schema: an optional :class:`StructType` for the input schema. :param sep: sets the single character as a separator for each field and value. -The default value is ``,``. -:param encoding: decodes the CSV files by the given encoding type. -The default value is ``UTF-8``. +If None is set, it uses the default value, ``,``. +:param encoding: decodes the CSV files by the given encoding type. If None is set, + it uses the default value, ``UTF-8``. :param quote: sets the single character used for escaping quoted values where the - separator can be part of the value. The default value is ``"``. + separator can be part of the value. If None is set, it uses the default + value, ``"``. :param escape: sets the single character used for escaping quotes inside an already - quoted value. The default value is ``\``. + quoted value. If None is set, it uses the default value, ``\``. :param comment: sets the single character used for skipping lines beginning with this character. By default (None), it is disabled. -:param header: uses the first line as names of columns. The default value is ``false``. +:param header: uses the first line as names of columns. If None is set, it uses the + default value, ``false``. :param ignoreLeadingWhiteSpace: defines whether or not leading whitespaces from values -being read should be skipped. The default value is -``false``. +being read should be skipped. If None is set, it uses +the default value, ``false``. :param ignoreTrailingWhiteSpace: defines whether or not trailing whitespaces from values - being read should be skipped. The default value is - ``false``. -:param nullValue: sets the string representation of a null value. The default value
spark git commit: Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"
Repository: spark Updated Branches: refs/heads/master b7e8d1cb3 -> 32f2f95db Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour" This reverts commit b7e8d1cb3ce932ba4a784be59744af8a8ef027ce. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/32f2f95d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/32f2f95d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/32f2f95d Branch: refs/heads/master Commit: 32f2f95dbdfb21491e46d4b608fd4e8ac7ab8973 Parents: b7e8d1c Author: Reynold XinAuthored: Sun Jun 5 23:40:13 2016 -0700 Committer: Reynold Xin Committed: Sun Jun 5 23:40:13 2016 -0700 -- python/pyspark/sql/readwriter.py| 81 ++-- .../execution/datasources/csv/CSVOptions.scala | 11 +-- .../execution/datasources/csv/CSVSuite.scala| 11 --- 3 files changed, 48 insertions(+), 55 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/32f2f95d/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 19aa8dd..9208a52 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -303,11 +303,10 @@ class DataFrameReader(object): return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path))) @since(2.0) -def csv(self, path, schema=None, sep=u',', encoding=u'UTF-8', quote=u'\"', escape=u'\\', -comment=None, header='false', ignoreLeadingWhiteSpace='false', -ignoreTrailingWhiteSpace='false', nullValue='', nanValue='NaN', positiveInf='Inf', -negativeInf='Inf', dateFormat=None, maxColumns='20480', maxCharsPerColumn='100', -mode='PERMISSIVE'): +def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None, +comment=None, header=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, +nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, +maxColumns=None, maxCharsPerColumn=None, mode=None): """Loads a CSV file and returns the result as a [[DataFrame]]. This function goes through the input once to determine the input schema. To avoid going @@ -316,41 +315,44 @@ class DataFrameReader(object): :param path: string, or list of strings, for input path(s). :param schema: an optional :class:`StructType` for the input schema. :param sep: sets the single character as a separator for each field and value. -The default value is ``,``. -:param encoding: decodes the CSV files by the given encoding type. -The default value is ``UTF-8``. +If None is set, it uses the default value, ``,``. +:param encoding: decodes the CSV files by the given encoding type. If None is set, + it uses the default value, ``UTF-8``. :param quote: sets the single character used for escaping quoted values where the - separator can be part of the value. The default value is ``"``. + separator can be part of the value. If None is set, it uses the default + value, ``"``. :param escape: sets the single character used for escaping quotes inside an already - quoted value. The default value is ``\``. + quoted value. If None is set, it uses the default value, ``\``. :param comment: sets the single character used for skipping lines beginning with this character. By default (None), it is disabled. -:param header: uses the first line as names of columns. The default value is ``false``. +:param header: uses the first line as names of columns. If None is set, it uses the + default value, ``false``. :param ignoreLeadingWhiteSpace: defines whether or not leading whitespaces from values -being read should be skipped. The default value is -``false``. +being read should be skipped. If None is set, it uses +the default value, ``false``. :param ignoreTrailingWhiteSpace: defines whether or not trailing whitespaces from values - being read should be skipped. The default value is - ``false``. -:param nullValue: sets the string representation of a null value. The default value is a -
spark git commit: [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour
Repository: spark Updated Branches: refs/heads/master 79268aa46 -> b7e8d1cb3 [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour ## What changes were proposed in this pull request? This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv. Also, it explicitly sets default values for CSV options in python. ## How was this patch tested? Added tests in CSVSuite. Author: Takeshi YAMAMUROCloses #13372 from maropu/SPARK-15585. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b7e8d1cb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b7e8d1cb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b7e8d1cb Branch: refs/heads/master Commit: b7e8d1cb3ce932ba4a784be59744af8a8ef027ce Parents: 79268aa Author: Takeshi YAMAMURO Authored: Sun Jun 5 23:35:04 2016 -0700 Committer: Reynold Xin Committed: Sun Jun 5 23:35:04 2016 -0700 -- python/pyspark/sql/readwriter.py| 81 ++-- .../execution/datasources/csv/CSVOptions.scala | 11 ++- .../execution/datasources/csv/CSVSuite.scala| 11 +++ 3 files changed, 55 insertions(+), 48 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b7e8d1cb/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 9208a52..19aa8dd 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -303,10 +303,11 @@ class DataFrameReader(object): return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path))) @since(2.0) -def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None, -comment=None, header=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, -nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, -maxColumns=None, maxCharsPerColumn=None, mode=None): +def csv(self, path, schema=None, sep=u',', encoding=u'UTF-8', quote=u'\"', escape=u'\\', +comment=None, header='false', ignoreLeadingWhiteSpace='false', +ignoreTrailingWhiteSpace='false', nullValue='', nanValue='NaN', positiveInf='Inf', +negativeInf='Inf', dateFormat=None, maxColumns='20480', maxCharsPerColumn='100', +mode='PERMISSIVE'): """Loads a CSV file and returns the result as a [[DataFrame]]. This function goes through the input once to determine the input schema. To avoid going @@ -315,44 +316,41 @@ class DataFrameReader(object): :param path: string, or list of strings, for input path(s). :param schema: an optional :class:`StructType` for the input schema. :param sep: sets the single character as a separator for each field and value. -If None is set, it uses the default value, ``,``. -:param encoding: decodes the CSV files by the given encoding type. If None is set, - it uses the default value, ``UTF-8``. +The default value is ``,``. +:param encoding: decodes the CSV files by the given encoding type. +The default value is ``UTF-8``. :param quote: sets the single character used for escaping quoted values where the - separator can be part of the value. If None is set, it uses the default - value, ``"``. + separator can be part of the value. The default value is ``"``. :param escape: sets the single character used for escaping quotes inside an already - quoted value. If None is set, it uses the default value, ``\``. + quoted value. The default value is ``\``. :param comment: sets the single character used for skipping lines beginning with this character. By default (None), it is disabled. -:param header: uses the first line as names of columns. If None is set, it uses the - default value, ``false``. +:param header: uses the first line as names of columns. The default value is ``false``. :param ignoreLeadingWhiteSpace: defines whether or not leading whitespaces from values -being read should be skipped. If None is set, it uses -the default value, ``false``. +being read should be skipped. The default value is +``false``. :param ignoreTrailingWhiteSpace:
spark git commit: [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour
Repository: spark Updated Branches: refs/heads/branch-2.0 790de600b -> 9e7e2f916 [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour ## What changes were proposed in this pull request? This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv. Also, it explicitly sets default values for CSV options in python. ## How was this patch tested? Added tests in CSVSuite. Author: Takeshi YAMAMUROCloses #13372 from maropu/SPARK-15585. (cherry picked from commit b7e8d1cb3ce932ba4a784be59744af8a8ef027ce) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9e7e2f91 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9e7e2f91 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9e7e2f91 Branch: refs/heads/branch-2.0 Commit: 9e7e2f9164e0b3bd555e795b871626057b4fed31 Parents: 790de60 Author: Takeshi YAMAMURO Authored: Sun Jun 5 23:35:04 2016 -0700 Committer: Reynold Xin Committed: Sun Jun 5 23:35:10 2016 -0700 -- python/pyspark/sql/readwriter.py| 81 ++-- .../execution/datasources/csv/CSVOptions.scala | 11 ++- .../execution/datasources/csv/CSVSuite.scala| 11 +++ 3 files changed, 55 insertions(+), 48 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9e7e2f91/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 9208a52..19aa8dd 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -303,10 +303,11 @@ class DataFrameReader(object): return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path))) @since(2.0) -def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None, -comment=None, header=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, -nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, -maxColumns=None, maxCharsPerColumn=None, mode=None): +def csv(self, path, schema=None, sep=u',', encoding=u'UTF-8', quote=u'\"', escape=u'\\', +comment=None, header='false', ignoreLeadingWhiteSpace='false', +ignoreTrailingWhiteSpace='false', nullValue='', nanValue='NaN', positiveInf='Inf', +negativeInf='Inf', dateFormat=None, maxColumns='20480', maxCharsPerColumn='100', +mode='PERMISSIVE'): """Loads a CSV file and returns the result as a [[DataFrame]]. This function goes through the input once to determine the input schema. To avoid going @@ -315,44 +316,41 @@ class DataFrameReader(object): :param path: string, or list of strings, for input path(s). :param schema: an optional :class:`StructType` for the input schema. :param sep: sets the single character as a separator for each field and value. -If None is set, it uses the default value, ``,``. -:param encoding: decodes the CSV files by the given encoding type. If None is set, - it uses the default value, ``UTF-8``. +The default value is ``,``. +:param encoding: decodes the CSV files by the given encoding type. +The default value is ``UTF-8``. :param quote: sets the single character used for escaping quoted values where the - separator can be part of the value. If None is set, it uses the default - value, ``"``. + separator can be part of the value. The default value is ``"``. :param escape: sets the single character used for escaping quotes inside an already - quoted value. If None is set, it uses the default value, ``\``. + quoted value. The default value is ``\``. :param comment: sets the single character used for skipping lines beginning with this character. By default (None), it is disabled. -:param header: uses the first line as names of columns. If None is set, it uses the - default value, ``false``. +:param header: uses the first line as names of columns. The default value is ``false``. :param ignoreLeadingWhiteSpace: defines whether or not leading whitespaces from values -being read should be skipped. If None is set, it uses -the default value, ``false``. +being read should be