spark git commit: [SPARK-15792][SQL] Allows operator to change the verbosity in explain output

2016-06-06 Thread lian
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 a5bec5b81 -> 57dd4efcd


[SPARK-15792][SQL] Allows operator to change the verbosity in explain output

## What changes were proposed in this pull request?

This PR allows customization of verbosity in explain output. After change, 
`dataframe.explain()` and `dataframe.explain(true)` has different verbosity 
output for physical plan.

Currently, this PR only enables verbosity string for operator 
`HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity 
string for more operators in future.

**Less verbose mode:** dataframe.explain(extended = false)

`output=[count(a)#85L]` is **NOT** displayed for HashAggregate.

```
scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2")
scala> spark.sql("select count(a) from df2").explain()
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)])
  +- LocalTableScan
```

**Verbose mode:** dataframe.explain(extended = true)

`output=[count(a)#85L]` is displayed for HashAggregate.

```
scala> spark.sql("select count(a) from df2").explain(true)  // 
"output=[count(a)#85L]" is added
...
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L])
  +- LocalTableScan
```

## How was this patch tested?

Manual test.

Author: Sean Zhong 

Closes #13535 from clockfly/verbose_breakdown_2.

(cherry picked from commit 5f731d6859c4516941e5f90c99c966ef76268864)
Signed-off-by: Cheng Lian 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57dd4efc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57dd4efc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57dd4efc

Branch: refs/heads/branch-2.0
Commit: 57dd4efcda9158646df41ea8d70754dc110ecd6f
Parents: a5bec5b
Author: Sean Zhong 
Authored: Mon Jun 6 22:59:25 2016 -0700
Committer: Cheng Lian 
Committed: Mon Jun 6 22:59:34 2016 -0700

--
 .../sql/catalyst/expressions/Expression.scala   |  4 
 .../spark/sql/catalyst/plans/QueryPlan.scala|  2 ++
 .../spark/sql/catalyst/trees/TreeNode.scala | 23 +++-
 .../spark/sql/execution/QueryExecution.scala| 14 +++-
 .../sql/execution/WholeStageCodegenExec.scala   |  6 +++--
 .../execution/aggregate/HashAggregateExec.scala | 12 --
 .../execution/aggregate/SortAggregateExec.scala | 12 --
 7 files changed, 55 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/57dd4efc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
index 2ec4621..efe592d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
@@ -190,6 +190,10 @@ abstract class Expression extends TreeNode[Expression] {
 case single => single :: Nil
   }
 
+  // Marks this as final, Expression.verboseString should never be called, and 
thus shouldn't be
+  // overridden by concrete classes.
+  final override def verboseString: String = simpleString
+
   override def simpleString: String = toString
 
   override def toString: String = prettyName + flatArguments.mkString("(", ", 
", ")")

http://git-wip-us.apache.org/repos/asf/spark/blob/57dd4efc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
index 19a66cf..cf34f4b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
@@ -257,6 +257,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] 
extends TreeNode[PlanT
 
   override def simpleString: String = statePrefix + super.simpleString
 
+  override def verboseString: String = simpleString
+
   /**
* All the subqueries of current plan.
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/57dd4efc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

spark git commit: [SPARK-15792][SQL] Allows operator to change the verbosity in explain output

2016-06-06 Thread lian
Repository: spark
Updated Branches:
  refs/heads/master 0e0904a2f -> 5f731d685


[SPARK-15792][SQL] Allows operator to change the verbosity in explain output

## What changes were proposed in this pull request?

This PR allows customization of verbosity in explain output. After change, 
`dataframe.explain()` and `dataframe.explain(true)` has different verbosity 
output for physical plan.

Currently, this PR only enables verbosity string for operator 
`HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity 
string for more operators in future.

**Less verbose mode:** dataframe.explain(extended = false)

`output=[count(a)#85L]` is **NOT** displayed for HashAggregate.

```
scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2")
scala> spark.sql("select count(a) from df2").explain()
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)])
  +- LocalTableScan
```

**Verbose mode:** dataframe.explain(extended = true)

`output=[count(a)#85L]` is displayed for HashAggregate.

```
scala> spark.sql("select count(a) from df2").explain(true)  // 
"output=[count(a)#85L]" is added
...
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L])
  +- LocalTableScan
```

## How was this patch tested?

Manual test.

Author: Sean Zhong 

Closes #13535 from clockfly/verbose_breakdown_2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5f731d68
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5f731d68
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5f731d68

Branch: refs/heads/master
Commit: 5f731d6859c4516941e5f90c99c966ef76268864
Parents: 0e0904a
Author: Sean Zhong 
Authored: Mon Jun 6 22:59:25 2016 -0700
Committer: Cheng Lian 
Committed: Mon Jun 6 22:59:25 2016 -0700

--
 .../sql/catalyst/expressions/Expression.scala   |  4 
 .../spark/sql/catalyst/plans/QueryPlan.scala|  2 ++
 .../spark/sql/catalyst/trees/TreeNode.scala | 23 +++-
 .../spark/sql/execution/QueryExecution.scala| 14 +++-
 .../sql/execution/WholeStageCodegenExec.scala   |  6 +++--
 .../execution/aggregate/HashAggregateExec.scala | 12 --
 .../execution/aggregate/SortAggregateExec.scala | 12 --
 7 files changed, 55 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5f731d68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
index 2ec4621..efe592d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
@@ -190,6 +190,10 @@ abstract class Expression extends TreeNode[Expression] {
 case single => single :: Nil
   }
 
+  // Marks this as final, Expression.verboseString should never be called, and 
thus shouldn't be
+  // overridden by concrete classes.
+  final override def verboseString: String = simpleString
+
   override def simpleString: String = toString
 
   override def toString: String = prettyName + flatArguments.mkString("(", ", 
", ")")

http://git-wip-us.apache.org/repos/asf/spark/blob/5f731d68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
index 19a66cf..cf34f4b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
@@ -257,6 +257,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] 
extends TreeNode[PlanT
 
   override def simpleString: String = statePrefix + super.simpleString
 
+  override def verboseString: String = simpleString
+
   /**
* All the subqueries of current plan.
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/5f731d68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
--
diff --git 

spark git commit: [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema

2016-06-06 Thread lian
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 62765cbeb -> a5bec5b81


[SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema

## What changes were proposed in this pull request?

This PR makes sure the typed Filter doesn't change the Dataset schema.

**Before the change:**

```
scala> val df = spark.range(0,9)
scala> df.schema
res12: org.apache.spark.sql.types.StructType = 
StructType(StructField(id,LongType,false))
scala> val afterFilter = df.filter(_=>true)
scala> afterFilter.schema   // !!! schema is CHANGED!!! Column name is changed 
from id to value, nullable is changed from false to true.
res13: org.apache.spark.sql.types.StructType = 
StructType(StructField(value,LongType,true))

```

SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, 
and these two can possibly change the schema of Dataset.

**After the change:**

```
scala> afterFilter.schema   // schema is NOT changed.
res47: org.apache.spark.sql.types.StructType = 
StructType(StructField(id,LongType,false))
```

## How was this patch tested?

Unit test.

Author: Sean Zhong 

Closes #13529 from clockfly/spark-15632.

(cherry picked from commit 0e0904a2fce3c4447c24f1752307b6d01ffbd0ad)
Signed-off-by: Cheng Lian 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a5bec5b8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a5bec5b8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a5bec5b8

Branch: refs/heads/branch-2.0
Commit: a5bec5b81d9e8ce17f1ce509731b030f0f3538e3
Parents: 62765cb
Author: Sean Zhong 
Authored: Mon Jun 6 22:40:21 2016 -0700
Committer: Cheng Lian 
Committed: Mon Jun 6 22:40:29 2016 -0700

--
 .../optimizer/TypedFilterOptimizationSuite.scala|  4 +++-
 .../main/scala/org/apache/spark/sql/Dataset.scala   | 16 
 .../test/org/apache/spark/sql/JavaDatasetSuite.java | 13 +
 .../scala/org/apache/spark/sql/DatasetSuite.scala   |  6 ++
 .../sql/execution/WholeStageCodegenSuite.scala  |  2 +-
 5 files changed, 31 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a5bec5b8/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
index 289c16a..63d87bf 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
@@ -57,7 +57,9 @@ class TypedFilterOptimizationSuite extends PlanTest {
 comparePlans(optimized, expected)
   }
 
-  test("embed deserializer in filter condition if there is only one filter") {
+  // TODO: Remove this after we completely fix SPARK-15632 by adding 
optimization rules
+  // for typed filters.
+  ignore("embed deserializer in typed filter condition if there is only one 
filter") {
 val input = LocalRelation('_1.int, '_2.int)
 val f = (i: (Int, Int)) => i._1 > 0
 

http://git-wip-us.apache.org/repos/asf/spark/blob/a5bec5b8/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 96c871d..6cbc27d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1944,11 +1944,11 @@ class Dataset[T] private[sql](
*/
   @Experimental
   def filter(func: T => Boolean): Dataset[T] = {
-val deserialized = CatalystSerde.deserialize[T](logicalPlan)
+val deserializer = UnresolvedDeserializer(encoderFor[T].deserializer)
 val function = Literal.create(func, ObjectType(classOf[T => Boolean]))
-val condition = Invoke(function, "apply", BooleanType, deserialized.output)
-val filter = Filter(condition, deserialized)
-withTypedPlan(CatalystSerde.serialize[T](filter))
+val condition = Invoke(function, "apply", BooleanType, deserializer :: Nil)
+val filter = Filter(condition, logicalPlan)
+withTypedPlan(filter)
   }
 
   /**
@@ -1961,11 +1961,11 @@ class Dataset[T] private[sql](
*/
   @Experimental
   def filter(func: FilterFunction[T]): Dataset[T] = {
-val deserialized = CatalystSerde.deserialize[T](logicalPlan)
+val deserializer = 

spark git commit: [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema

2016-06-06 Thread lian
Repository: spark
Updated Branches:
  refs/heads/master c409e23ab -> 0e0904a2f


[SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema

## What changes were proposed in this pull request?

This PR makes sure the typed Filter doesn't change the Dataset schema.

**Before the change:**

```
scala> val df = spark.range(0,9)
scala> df.schema
res12: org.apache.spark.sql.types.StructType = 
StructType(StructField(id,LongType,false))
scala> val afterFilter = df.filter(_=>true)
scala> afterFilter.schema   // !!! schema is CHANGED!!! Column name is changed 
from id to value, nullable is changed from false to true.
res13: org.apache.spark.sql.types.StructType = 
StructType(StructField(value,LongType,true))

```

SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, 
and these two can possibly change the schema of Dataset.

**After the change:**

```
scala> afterFilter.schema   // schema is NOT changed.
res47: org.apache.spark.sql.types.StructType = 
StructType(StructField(id,LongType,false))
```

## How was this patch tested?

Unit test.

Author: Sean Zhong 

Closes #13529 from clockfly/spark-15632.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0e0904a2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0e0904a2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0e0904a2

Branch: refs/heads/master
Commit: 0e0904a2fce3c4447c24f1752307b6d01ffbd0ad
Parents: c409e23
Author: Sean Zhong 
Authored: Mon Jun 6 22:40:21 2016 -0700
Committer: Cheng Lian 
Committed: Mon Jun 6 22:40:21 2016 -0700

--
 .../optimizer/TypedFilterOptimizationSuite.scala|  4 +++-
 .../main/scala/org/apache/spark/sql/Dataset.scala   | 16 
 .../test/org/apache/spark/sql/JavaDatasetSuite.java | 13 +
 .../scala/org/apache/spark/sql/DatasetSuite.scala   |  6 ++
 .../sql/execution/WholeStageCodegenSuite.scala  |  2 +-
 5 files changed, 31 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0e0904a2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
index 289c16a..63d87bf 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala
@@ -57,7 +57,9 @@ class TypedFilterOptimizationSuite extends PlanTest {
 comparePlans(optimized, expected)
   }
 
-  test("embed deserializer in filter condition if there is only one filter") {
+  // TODO: Remove this after we completely fix SPARK-15632 by adding 
optimization rules
+  // for typed filters.
+  ignore("embed deserializer in typed filter condition if there is only one 
filter") {
 val input = LocalRelation('_1.int, '_2.int)
 val f = (i: (Int, Int)) => i._1 > 0
 

http://git-wip-us.apache.org/repos/asf/spark/blob/0e0904a2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 96c871d..6cbc27d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1944,11 +1944,11 @@ class Dataset[T] private[sql](
*/
   @Experimental
   def filter(func: T => Boolean): Dataset[T] = {
-val deserialized = CatalystSerde.deserialize[T](logicalPlan)
+val deserializer = UnresolvedDeserializer(encoderFor[T].deserializer)
 val function = Literal.create(func, ObjectType(classOf[T => Boolean]))
-val condition = Invoke(function, "apply", BooleanType, deserialized.output)
-val filter = Filter(condition, deserialized)
-withTypedPlan(CatalystSerde.serialize[T](filter))
+val condition = Invoke(function, "apply", BooleanType, deserializer :: Nil)
+val filter = Filter(condition, logicalPlan)
+withTypedPlan(filter)
   }
 
   /**
@@ -1961,11 +1961,11 @@ class Dataset[T] private[sql](
*/
   @Experimental
   def filter(func: FilterFunction[T]): Dataset[T] = {
-val deserialized = CatalystSerde.deserialize[T](logicalPlan)
+val deserializer = UnresolvedDeserializer(encoderFor[T].deserializer)
 val function = Literal.create(func, ObjectType(classOf[FilterFunction[T]]))
-val condition = 

spark git commit: [SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of SparkLauncher

2016-06-06 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 5363783af -> 62765cbeb


[SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of 
SparkLauncher

## What changes were proposed in this pull request?
This situation can happen when the LauncherConnection gets an exception while 
reading through the socket and terminating silently without notifying making 
the client/listener think that the job is still in previous state.
The fix force sends a notification to client that the job finished with unknown 
status and let client handle it accordingly.

## How was this patch tested?
Added a unit test.

Author: Subroto Sanyal 

Closes #13497 from subrotosanyal/SPARK-15652-handle-spark-submit-jvm-crash.

(cherry picked from commit c409e23abd128dad33557025f1e824ef47e6222f)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/62765cbe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/62765cbe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/62765cbe

Branch: refs/heads/branch-2.0
Commit: 62765cbebe0cb41b0c4fdc344828337ee15e1fd2
Parents: 5363783
Author: Subroto Sanyal 
Authored: Mon Jun 6 16:05:40 2016 -0700
Committer: Marcelo Vanzin 
Committed: Mon Jun 6 16:05:52 2016 -0700

--
 .../apache/spark/launcher/LauncherServer.java   |  4 +++
 .../apache/spark/launcher/SparkAppHandle.java   |  4 ++-
 .../spark/launcher/LauncherServerSuite.java | 31 
 3 files changed, 38 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/62765cbe/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
--
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java 
b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
index e3413fd..28e9420 100644
--- a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
+++ b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
@@ -337,6 +337,10 @@ class LauncherServer implements Closeable {
   }
   super.close();
   if (handle != null) {
+   if (!handle.getState().isFinal()) {
+ LOG.log(Level.WARNING, "Lost connection to spark application.");
+ handle.setState(SparkAppHandle.State.LOST);
+   }
 handle.disconnect();
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/62765cbe/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java
--
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java 
b/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java
index 625d026..0aa7bd1 100644
--- a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java
+++ b/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java
@@ -46,7 +46,9 @@ public interface SparkAppHandle {
 /** The application finished with a failed status. */
 FAILED(true),
 /** The application was killed. */
-KILLED(true);
+KILLED(true),
+/** The Spark Submit JVM exited with a unknown status. */
+LOST(true);
 
 private final boolean isFinal;
 

http://git-wip-us.apache.org/repos/asf/spark/blob/62765cbe/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
--
diff --git 
a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java 
b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
index bfe1fcc..12f1a0c 100644
--- a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
+++ b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
@@ -152,6 +152,37 @@ public class LauncherServerSuite extends BaseSuite {
 }
   }
 
+  @Test
+  public void testSparkSubmitVmShutsDown() throws Exception {
+ChildProcAppHandle handle = LauncherServer.newAppHandle();
+TestClient client = null;
+final Semaphore semaphore = new Semaphore(0);
+try {
+  Socket s = new Socket(InetAddress.getLoopbackAddress(),
+LauncherServer.getServerInstance().getPort());
+  handle.addListener(new SparkAppHandle.Listener() {
+public void stateChanged(SparkAppHandle handle) {
+  semaphore.release();
+}
+public void infoChanged(SparkAppHandle handle) {
+  semaphore.release();
+}
+  });
+  client = new TestClient(s);
+  client.send(new Hello(handle.getSecret(), "1.4.0"));
+  assertTrue(semaphore.tryAcquire(30, TimeUnit.SECONDS));
+  // Make sure the server 

spark git commit: [SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of SparkLauncher

2016-06-06 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/master 36d3dfa59 -> c409e23ab


[SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of 
SparkLauncher

## What changes were proposed in this pull request?
This situation can happen when the LauncherConnection gets an exception while 
reading through the socket and terminating silently without notifying making 
the client/listener think that the job is still in previous state.
The fix force sends a notification to client that the job finished with unknown 
status and let client handle it accordingly.

## How was this patch tested?
Added a unit test.

Author: Subroto Sanyal 

Closes #13497 from subrotosanyal/SPARK-15652-handle-spark-submit-jvm-crash.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c409e23a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c409e23a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c409e23a

Branch: refs/heads/master
Commit: c409e23abd128dad33557025f1e824ef47e6222f
Parents: 36d3dfa
Author: Subroto Sanyal 
Authored: Mon Jun 6 16:05:40 2016 -0700
Committer: Marcelo Vanzin 
Committed: Mon Jun 6 16:05:40 2016 -0700

--
 .../apache/spark/launcher/LauncherServer.java   |  4 +++
 .../apache/spark/launcher/SparkAppHandle.java   |  4 ++-
 .../spark/launcher/LauncherServerSuite.java | 31 
 3 files changed, 38 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c409e23a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
--
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java 
b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
index e3413fd..28e9420 100644
--- a/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
+++ b/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
@@ -337,6 +337,10 @@ class LauncherServer implements Closeable {
   }
   super.close();
   if (handle != null) {
+   if (!handle.getState().isFinal()) {
+ LOG.log(Level.WARNING, "Lost connection to spark application.");
+ handle.setState(SparkAppHandle.State.LOST);
+   }
 handle.disconnect();
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/c409e23a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java
--
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java 
b/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java
index 625d026..0aa7bd1 100644
--- a/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java
+++ b/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java
@@ -46,7 +46,9 @@ public interface SparkAppHandle {
 /** The application finished with a failed status. */
 FAILED(true),
 /** The application was killed. */
-KILLED(true);
+KILLED(true),
+/** The Spark Submit JVM exited with a unknown status. */
+LOST(true);
 
 private final boolean isFinal;
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c409e23a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
--
diff --git 
a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java 
b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
index bfe1fcc..12f1a0c 100644
--- a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
+++ b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
@@ -152,6 +152,37 @@ public class LauncherServerSuite extends BaseSuite {
 }
   }
 
+  @Test
+  public void testSparkSubmitVmShutsDown() throws Exception {
+ChildProcAppHandle handle = LauncherServer.newAppHandle();
+TestClient client = null;
+final Semaphore semaphore = new Semaphore(0);
+try {
+  Socket s = new Socket(InetAddress.getLoopbackAddress(),
+LauncherServer.getServerInstance().getPort());
+  handle.addListener(new SparkAppHandle.Listener() {
+public void stateChanged(SparkAppHandle handle) {
+  semaphore.release();
+}
+public void infoChanged(SparkAppHandle handle) {
+  semaphore.release();
+}
+  });
+  client = new TestClient(s);
+  client.send(new Hello(handle.getSecret(), "1.4.0"));
+  assertTrue(semaphore.tryAcquire(30, TimeUnit.SECONDS));
+  // Make sure the server matched the client to the handle.
+  assertNotNull(handle.getConnection());
+  close(client);
+  

svn commit: r1747076 - in /spark: js/downloads.js site/js/downloads.js

2016-06-06 Thread srowen
Author: srowen
Date: Mon Jun  6 20:59:54 2016
New Revision: 1747076

URL: http://svn.apache.org/viewvc?rev=1747076=rev
Log:
SPARK-15778 part 2: group preview/stable releases in download version dropdown

Modified:
spark/js/downloads.js
spark/site/js/downloads.js

Modified: spark/js/downloads.js
URL: 
http://svn.apache.org/viewvc/spark/js/downloads.js?rev=1747076=1747075=1747076=diff
==
--- spark/js/downloads.js (original)
+++ spark/js/downloads.js Mon Jun  6 20:59:54 2016
@@ -53,18 +53,18 @@ addRelease("1.1.0", new Date("9/11/2014"
 addRelease("1.0.2", new Date("8/5/2014"), sources.concat(packagesV3), true, 
true);
 addRelease("1.0.1", new Date("7/11/2014"), sources.concat(packagesV3), false, 
true);
 addRelease("1.0.0", new Date("5/30/2014"), sources.concat(packagesV2), false, 
true);
-addRelease("0.9.2", new Date("7/23/2014"), sources.concat(packagesV2), true, 
false);
-addRelease("0.9.1", new Date("4/9/2014"), sources.concat(packagesV2), false, 
false);
-addRelease("0.9.0-incubating", new Date("2/2/2014"), 
sources.concat(packagesV2), false, false);
-addRelease("0.8.1-incubating", new Date("12/19/2013"), 
sources.concat(packagesV2), true, false);
-addRelease("0.8.0-incubating", new Date("9/25/2013"), 
sources.concat(packagesV1), true, false);
-addRelease("0.7.3", new Date("7/16/2013"), sources.concat(packagesV1), true, 
false);
-addRelease("0.7.2", new Date("2/6/2013"), sources.concat(packagesV1), false, 
false);
-addRelease("0.7.0", new Date("2/27/2013"), sources, false, false);
+addRelease("0.9.2", new Date("7/23/2014"), sources.concat(packagesV2), true, 
true);
+addRelease("0.9.1", new Date("4/9/2014"), sources.concat(packagesV2), false, 
true);
+addRelease("0.9.0-incubating", new Date("2/2/2014"), 
sources.concat(packagesV2), false, true);
+addRelease("0.8.1-incubating", new Date("12/19/2013"), 
sources.concat(packagesV2), true, true);
+addRelease("0.8.0-incubating", new Date("9/25/2013"), 
sources.concat(packagesV1), true, true);
+addRelease("0.7.3", new Date("7/16/2013"), sources.concat(packagesV1), true, 
true);
+addRelease("0.7.2", new Date("2/6/2013"), sources.concat(packagesV1), false, 
true);
+addRelease("0.7.0", new Date("2/27/2013"), sources, false, true);
 
 function append(el, contents) {
-  el.innerHTML = el.innerHTML + contents;
-};
+  el.innerHTML += contents;
+}
 
 function empty(el) {
   el.innerHTML = "";
@@ -79,27 +79,25 @@ function versionShort(version) { return
 function initDownloads() {
   var versionSelect = document.getElementById("sparkVersionSelect");
 
-  // Populate versions
-  var markedDefault = false;
+  // Populate stable versions
+  append(versionSelect, "");
   for (var version in releases) {
+if (!releases[version].downloadable || !releases[version].stable) { 
continue; }
 var releaseDate = releases[version].released;
-var downloadable = releases[version].downloadable;
-var stable = releases[version].stable;
-
-if (!downloadable) { continue; }
-
-var selected = false;
-if (!markedDefault && stable) {
-  selected = true;
-  markedDefault = true;
-}
+var title = versionShort(version) + " (" + 
releaseDate.toDateString().slice(4) + ")";
+append(versionSelect, "" + title + 
"");
+  }
+  append(versionSelect, "");
 
-// Don't display incubation status here
+  // Populate other versions
+  append(versionSelect, "");
+  for (var version in releases) {
+if (!releases[version].downloadable || releases[version].stable) { 
continue; }
+var releaseDate = releases[version].released;
 var title = versionShort(version) + " (" + 
releaseDate.toDateString().slice(4) + ")";
-append(versionSelect, 
-  "" +
-  title + "");
+append(versionSelect, "" + title + 
"");
   }
+  append(versionSelect, "");
 
   // Populate packages and (transitively) releases
   onVersionSelect();

Modified: spark/site/js/downloads.js
URL: 
http://svn.apache.org/viewvc/spark/site/js/downloads.js?rev=1747076=1747075=1747076=diff
==
--- spark/site/js/downloads.js (original)
+++ spark/site/js/downloads.js Mon Jun  6 20:59:54 2016
@@ -53,18 +53,18 @@ addRelease("1.1.0", new Date("9/11/2014"
 addRelease("1.0.2", new Date("8/5/2014"), sources.concat(packagesV3), true, 
true);
 addRelease("1.0.1", new Date("7/11/2014"), sources.concat(packagesV3), false, 
true);
 addRelease("1.0.0", new Date("5/30/2014"), sources.concat(packagesV2), false, 
true);
-addRelease("0.9.2", new Date("7/23/2014"), sources.concat(packagesV2), true, 
false);
-addRelease("0.9.1", new Date("4/9/2014"), sources.concat(packagesV2), false, 
false);
-addRelease("0.9.0-incubating", new Date("2/2/2014"), 
sources.concat(packagesV2), false, false);
-addRelease("0.8.1-incubating", new Date("12/19/2013"), 
sources.concat(packagesV2), true, false);
-addRelease("0.8.0-incubating", new Date("9/25/2013"), 

svn commit: r1747061 - in /spark: downloads.md js/downloads.js site/downloads.html site/js/downloads.js

2016-06-06 Thread srowen
Author: srowen
Date: Mon Jun  6 19:56:07 2016
New Revision: 1747061

URL: http://svn.apache.org/viewvc?rev=1747061=rev
Log:
SPARK-15778 add spark-2.0.0-preview release to options and other minor related 
updates

Modified:
spark/downloads.md
spark/js/downloads.js
spark/site/downloads.html
spark/site/js/downloads.js

Modified: spark/downloads.md
URL: 
http://svn.apache.org/viewvc/spark/downloads.md?rev=1747061=1747060=1747061=diff
==
--- spark/downloads.md (original)
+++ spark/downloads.md Mon Jun  6 19:56:07 2016
@@ -16,7 +16,7 @@ $(document).ready(function() {
 
 ## Download Apache Spark
 
-Our latest version is Apache Spark 1.6.1, released on March 9, 2016
+Our latest stable version is Apache Spark 1.6.1, released on March 9, 2016
 (release notes)
 https://github.com/apache/spark/releases/tag/v1.6.1;>(git 
tag)
 
@@ -36,6 +36,17 @@ Our latest version is Apache Spark 1.6.1
 _Note: Scala 2.11 users should download the Spark source package and build
 [with Scala 2.11 
support](http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211)._
 
+### Latest Preview Release
+
+Preview releases, as the name suggests, are releases for previewing upcoming 
features.
+Unlike nightly packages, preview releases have been audited by the project's 
management committee
+to satisfy the legal requirements of Apache Software Foundation's release 
policy.
+Preview releases are not meant to be functional, i.e. they can and highly 
likely will contain
+critical bugs or documentation errors.
+
+The latest preview release is Spark 2.0.0-preview, published on May 24, 2016.
+You can select and download it above.
+
 ### Link with Spark
 Spark artifacts are [hosted in Maven 
Central](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22).
 You can add a Maven dependency with the following coordinates:
 
@@ -54,14 +65,9 @@ If you are interested in working with th
 
 Once you've downloaded Spark, you can find instructions for installing and 
building it on the documentation 
page.
 
-Stable Releases
-
-
-### Latest Preview Release (Spark 2.0.0-preview)
-Preview releases, as the name suggests, are releases for previewing upcoming 
features. Unlike nightly packages, preview releases have been audited by the 
project's management committee to satisfy the legal requirements of Apache 
Software Foundation's release policy.Preview releases are not meant to be 
functional, i.e. they can and highly likely will contain critical bugs or 
documentation errors.
-
-The latest preview release is Spark 2.0.0-preview, published on May 24, 2016. 
You can https://dist.apache.org/repos/dist/release/spark/spark-2.0.0-preview/;>download
 it here.
+### Release Notes for Stable Releases
 
+
 
 ### Nightly Packages and Artifacts
 For developers, Spark maintains nightly builds and SNAPSHOT artifacts. More 
information is available on the [Spark developer 
Wiki](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds).

Modified: spark/js/downloads.js
URL: 
http://svn.apache.org/viewvc/spark/js/downloads.js?rev=1747061=1747060=1747061=diff
==
--- spark/js/downloads.js (original)
+++ spark/js/downloads.js Mon Jun  6 19:56:07 2016
@@ -3,8 +3,8 @@
 
 releases = {};
 
-function addRelease(version, releaseDate, packages, downloadable) {
-  releases[version] = {released: releaseDate, packages: packages, 
downloadable: downloadable};
+function addRelease(version, releaseDate, packages, downloadable, stable) {
+  releases[version] = {released: releaseDate, packages: packages, 
downloadable: downloadable, stable: stable};
 }
 
 var sources = {pretty: "Source Code [can build several Hadoop versions]", tag: 
"sources"};
@@ -13,8 +13,9 @@ var hadoop1 = {pretty: "Pre-built for Ha
 var cdh4 = {pretty: "Pre-built for CDH 4", tag: "cdh4"};
 var hadoop2 = {pretty: "Pre-built for Hadoop 2.2", tag: "hadoop2"};
 var hadoop2p3 = {pretty: "Pre-built for Hadoop 2.3", tag: "hadoop2.3"};
-var hadoop2p4 = {pretty: "Pre-built for Hadoop 2.4 and later", tag: 
"hadoop2.4"};
-var hadoop2p6 = {pretty: "Pre-built for Hadoop 2.6 and later", tag: 
"hadoop2.6"};
+var hadoop2p4 = {pretty: "Pre-built for Hadoop 2.4", tag: "hadoop2.4"};
+var hadoop2p6 = {pretty: "Pre-built for Hadoop 2.6", tag: "hadoop2.6"};
+var hadoop2p7 = {pretty: "Pre-built for Hadoop 2.7 and later", tag: 
"hadoop2.7"};
 var mapr3 = {pretty: "Pre-built for MapR 3.X", tag: "mapr3"};
 var mapr4 = {pretty: "Pre-built for MapR 4.X", tag: "mapr4"};
 
@@ -31,32 +32,35 @@ var packagesV4 = [hadoop2p4, hadoop2p3,
 var packagesV5 = [hadoop2p6].concat(packagesV4);
 // 1.4.0+
 var packagesV6 = [hadoopFree, hadoop2p6, hadoop2p4, 
hadoop2p3].concat(packagesV1);
+// 2.0.0+
+var packagesV7 = [hadoopFree, hadoop2p7, hadoop2p6, hadoop2p4, hadoop2p3];
 
-addRelease("1.6.1", new 

spark git commit: [SPARK-15783][CORE] still some flakiness in these blacklist tests so ignore for now

2016-06-06 Thread irashid
Repository: spark
Updated Branches:
  refs/heads/master 0b8d69499 -> 36d3dfa59


[SPARK-15783][CORE] still some flakiness in these blacklist tests so ignore for 
now

## What changes were proposed in this pull request?

There is still some flakiness in BlacklistIntegrationSuite, so turning it off 
for the moment to avoid breaking more builds -- will turn it back with more 
fixes.

## How was this patch tested?

jenkins.

Author: Imran Rashid 

Closes #13528 from squito/ignore_blacklist.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/36d3dfa5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/36d3dfa5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/36d3dfa5

Branch: refs/heads/master
Commit: 36d3dfa59a1ec0af6118e0667b80e9b7628e2cb6
Parents: 0b8d694
Author: Imran Rashid 
Authored: Mon Jun 6 12:53:11 2016 -0700
Committer: Imran Rashid 
Committed: Mon Jun 6 12:53:11 2016 -0700

--
 .../org/apache/spark/scheduler/BlacklistIntegrationSuite.scala | 6 +++---
 .../org/apache/spark/scheduler/SchedulerIntegrationSuite.scala | 5 +
 2 files changed, 8 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/36d3dfa5/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala
 
b/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala
index 3a4b7af..d8a4b19 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala
@@ -41,7 +41,7 @@ class BlacklistIntegrationSuite extends 
SchedulerIntegrationSuite[MultiExecutorM
 
   // Test demonstrating the issue -- without a config change, the scheduler 
keeps scheduling
   // according to locality preferences, and so the job fails
-  testScheduler("If preferred node is bad, without blacklist job will fail") {
+  ignore("If preferred node is bad, without blacklist job will fail") {
 val rdd = new MockRDDWithLocalityPrefs(sc, 10, Nil, badHost)
 withBackend(badHostBackend _) {
   val jobFuture = submit(rdd, (0 until 10).toArray)
@@ -54,7 +54,7 @@ class BlacklistIntegrationSuite extends 
SchedulerIntegrationSuite[MultiExecutorM
   // even with the blacklist turned on, if maxTaskFailures is not more than 
the number
   // of executors on the bad node, then locality preferences will lead to us 
cycling through
   // the executors on the bad node, and still failing the job
-  testScheduler(
+  ignoreScheduler(
 "With blacklist on, job will still fail if there are too many bad 
executors on bad host",
 extraConfs = Seq(
   // just set this to something much longer than the test duration
@@ -72,7 +72,7 @@ class BlacklistIntegrationSuite extends 
SchedulerIntegrationSuite[MultiExecutorM
 
   // Here we run with the blacklist on, and maxTaskFailures high enough that 
we'll eventually
   // schedule on a good node and succeed the job
-  testScheduler(
+  ignoreScheduler(
 "Bad node with multiple executors, job will still succeed with the right 
confs",
 extraConfs = Seq(
   // just set this to something much longer than the test duration

http://git-wip-us.apache.org/repos/asf/spark/blob/36d3dfa5/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala
 
b/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala
index 92bd765..5271a56 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala
@@ -89,6 +89,11 @@ abstract class SchedulerIntegrationSuite[T <: MockBackend: 
ClassTag] extends Spa
 }
   }
 
+  // still a few races to work out in the blacklist tests, so ignore some tests
+  def ignoreScheduler(name: String, extraConfs: Seq[(String, 
String)])(testBody: => Unit): Unit = {
+ignore(name)(testBody)
+  }
+
   /**
* A map from partition -> results for all tasks of a job when you call this 
test framework's
* [[submit]] method.  Two important considerations:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-06 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/master 4c74ee8d8 -> 0b8d69499


[SPARK-15764][SQL] Replace N^2 loop in BindReferences

BindReferences contains a n^2 loop which causes performance issues when 
operating over large schemas: to determine the ordinal of an attribute 
reference, we perform a linear scan over the `input` array. Because input can 
sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n).

Instead of performing a linear scan, we can convert the input into an array and 
build a hash map to map from expression ids to ordinals. The greater up-front 
cost of the map construction is offset by the fact that an expression can 
contain multiple attribute references, so the cost of the map construction is 
amortized across a number of lookups.

Perf. benchmarks to follow. /cc ericl

Author: Josh Rosen 

Closes #13505 from JoshRosen/bind-references-improvement.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b8d6949
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b8d6949
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b8d6949

Branch: refs/heads/master
Commit: 0b8d694999b43ada4833388cad6c285c7757cbf7
Parents: 4c74ee8
Author: Josh Rosen 
Authored: Mon Jun 6 11:44:51 2016 -0700
Committer: Josh Rosen 
Committed: Mon Jun 6 11:44:51 2016 -0700

--
 .../sql/catalyst/expressions/AttributeMap.scala |  7 
 .../catalyst/expressions/BoundAttribute.scala   |  6 ++--
 .../sql/catalyst/expressions/package.scala  | 34 +++-
 .../spark/sql/catalyst/plans/QueryPlan.scala|  2 +-
 .../execution/aggregate/HashAggregateExec.scala |  2 +-
 .../columnar/InMemoryTableScanExec.scala|  4 +--
 6 files changed, 40 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0b8d6949/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
index ef3cc55..96a11e3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
@@ -26,13 +26,6 @@ object AttributeMap {
   def apply[A](kvs: Seq[(Attribute, A)]): AttributeMap[A] = {
 new AttributeMap(kvs.map(kv => (kv._1.exprId, kv)).toMap)
   }
-
-  /** Given a schema, constructs an [[AttributeMap]] from [[Attribute]] to 
ordinal */
-  def byIndex(schema: Seq[Attribute]): AttributeMap[Int] = 
apply(schema.zipWithIndex)
-
-  /** Given a schema, constructs a map from ordinal to Attribute. */
-  def toIndex(schema: Seq[Attribute]): Map[Int, Attribute] =
-schema.zipWithIndex.map { case (a, i) => i -> a }.toMap
 }
 
 class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

http://git-wip-us.apache.org/repos/asf/spark/blob/0b8d6949/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
index a38f1ec..7d16118 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
@@ -82,16 +82,16 @@ object BindReferences extends Logging {
 
   def bindReference[A <: Expression](
   expression: A,
-  input: Seq[Attribute],
+  input: AttributeSeq,
   allowFailures: Boolean = false): A = {
 expression.transform { case a: AttributeReference =>
   attachTree(a, "Binding attribute") {
-val ordinal = input.indexWhere(_.exprId == a.exprId)
+val ordinal = input.indexOf(a.exprId)
 if (ordinal == -1) {
   if (allowFailures) {
 a
   } else {
-sys.error(s"Couldn't find $a in ${input.mkString("[", ",", "]")}")
+sys.error(s"Couldn't find $a in ${input.attrs.mkString("[", ",", 
"]")}")
   }
 } else {
   BoundReference(ordinal, a.dataType, input(ordinal).nullable)

http://git-wip-us.apache.org/repos/asf/spark/blob/0b8d6949/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
--
diff --git 

spark git commit: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-06 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 d07bce49f -> 5363783af


[SPARK-15764][SQL] Replace N^2 loop in BindReferences

BindReferences contains a n^2 loop which causes performance issues when 
operating over large schemas: to determine the ordinal of an attribute 
reference, we perform a linear scan over the `input` array. Because input can 
sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n).

Instead of performing a linear scan, we can convert the input into an array and 
build a hash map to map from expression ids to ordinals. The greater up-front 
cost of the map construction is offset by the fact that an expression can 
contain multiple attribute references, so the cost of the map construction is 
amortized across a number of lookups.

Perf. benchmarks to follow. /cc ericl

Author: Josh Rosen 

Closes #13505 from JoshRosen/bind-references-improvement.

(cherry picked from commit 0b8d694999b43ada4833388cad6c285c7757cbf7)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5363783a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5363783a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5363783a

Branch: refs/heads/branch-2.0
Commit: 5363783af7d93b7597181c8b39b83800fa059543
Parents: d07bce4
Author: Josh Rosen 
Authored: Mon Jun 6 11:44:51 2016 -0700
Committer: Josh Rosen 
Committed: Mon Jun 6 11:45:35 2016 -0700

--
 .../sql/catalyst/expressions/AttributeMap.scala |  7 
 .../catalyst/expressions/BoundAttribute.scala   |  6 ++--
 .../sql/catalyst/expressions/package.scala  | 34 +++-
 .../spark/sql/catalyst/plans/QueryPlan.scala|  2 +-
 .../execution/aggregate/HashAggregateExec.scala |  2 +-
 .../columnar/InMemoryTableScanExec.scala|  4 +--
 6 files changed, 40 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5363783a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
index ef3cc55..96a11e3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
@@ -26,13 +26,6 @@ object AttributeMap {
   def apply[A](kvs: Seq[(Attribute, A)]): AttributeMap[A] = {
 new AttributeMap(kvs.map(kv => (kv._1.exprId, kv)).toMap)
   }
-
-  /** Given a schema, constructs an [[AttributeMap]] from [[Attribute]] to 
ordinal */
-  def byIndex(schema: Seq[Attribute]): AttributeMap[Int] = 
apply(schema.zipWithIndex)
-
-  /** Given a schema, constructs a map from ordinal to Attribute. */
-  def toIndex(schema: Seq[Attribute]): Map[Int, Attribute] =
-schema.zipWithIndex.map { case (a, i) => i -> a }.toMap
 }
 
 class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

http://git-wip-us.apache.org/repos/asf/spark/blob/5363783a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
index a38f1ec..7d16118 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
@@ -82,16 +82,16 @@ object BindReferences extends Logging {
 
   def bindReference[A <: Expression](
   expression: A,
-  input: Seq[Attribute],
+  input: AttributeSeq,
   allowFailures: Boolean = false): A = {
 expression.transform { case a: AttributeReference =>
   attachTree(a, "Binding attribute") {
-val ordinal = input.indexWhere(_.exprId == a.exprId)
+val ordinal = input.indexOf(a.exprId)
 if (ordinal == -1) {
   if (allowFailures) {
 a
   } else {
-sys.error(s"Couldn't find $a in ${input.mkString("[", ",", "]")}")
+sys.error(s"Couldn't find $a in ${input.attrs.mkString("[", ",", 
"]")}")
   }
 } else {
   BoundReference(ordinal, a.dataType, input(ordinal).nullable)


spark git commit: [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public

2016-06-06 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master fa4bc8ea8 -> 4c74ee8d8


[SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public

## What changes were proposed in this pull request?

Made DefaultParamsReadable, DefaultParamsWritable public.  Also added relevant 
doc and annotations.  Added UnaryTransformerExample to demonstrate use of 
UnaryTransformer and DefaultParamsReadable,Writable.

## How was this patch tested?

Wrote example making use of the now-public APIs.  Compiled and ran locally

Author: Joseph K. Bradley 

Closes #13461 from jkbradley/defaultparamswritable.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4c74ee8d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4c74ee8d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4c74ee8d

Branch: refs/heads/master
Commit: 4c74ee8d8e1c3139d3d322ae68977f2ab53295df
Parents: fa4bc8e
Author: Joseph K. Bradley 
Authored: Mon Jun 6 09:49:45 2016 -0700
Committer: Joseph K. Bradley 
Committed: Mon Jun 6 09:49:45 2016 -0700

--
 .../examples/ml/UnaryTransformerExample.scala   | 122 +++
 .../org/apache/spark/ml/util/ReadWrite.scala|  44 ++-
 2 files changed, 163 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4c74ee8d/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala
--
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala
 
b/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala
new file mode 100644
index 000..13c72f8
--- /dev/null
+++ 
b/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.UnaryTransformer
+import org.apache.spark.ml.param.DoubleParam
+import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, 
Identifiable}
+import org.apache.spark.sql.functions.col
+// $example off$
+import org.apache.spark.sql.SparkSession
+// $example on$
+import org.apache.spark.sql.types.{DataType, DataTypes}
+import org.apache.spark.util.Utils
+// $example off$
+
+/**
+ * An example demonstrating creating a custom 
[[org.apache.spark.ml.Transformer]] using
+ * the [[UnaryTransformer]] abstraction.
+ *
+ * Run with
+ * {{{
+ * bin/run-example ml.UnaryTransformerExample
+ * }}}
+ */
+object UnaryTransformerExample {
+
+  // $example on$
+  /**
+   * Simple Transformer which adds a constant value to input Doubles.
+   *
+   * [[UnaryTransformer]] can be used to create a stage usable within 
Pipelines.
+   * It defines parameters for specifying input and output columns:
+   * [[UnaryTransformer.inputCol]] and [[UnaryTransformer.outputCol]].
+   * It can optionally handle schema validation.
+   *
+   * [[DefaultParamsWritable]] provides a default implementation for 
persisting instances
+   * of this Transformer.
+   */
+  class MyTransformer(override val uid: String)
+extends UnaryTransformer[Double, Double, MyTransformer] with 
DefaultParamsWritable {
+
+final val shift: DoubleParam = new DoubleParam(this, "shift", "Value added 
to input")
+
+def getShift: Double = $(shift)
+
+def setShift(value: Double): this.type = set(shift, value)
+
+def this() = this(Identifiable.randomUID("myT"))
+
+override protected def createTransformFunc: Double => Double = (input: 
Double) => {
+  input + $(shift)
+}
+
+override protected def outputDataType: DataType = DataTypes.DoubleType
+
+override protected def validateInputType(inputType: DataType): Unit = {
+  require(inputType == DataTypes.DoubleType, s"Bad input type: $inputType. 
Requires Double.")
+}
+  }
+
+  /**
+   * Companion object for our simple 

spark git commit: [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public

2016-06-06 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 e38ff70e6 -> d07bce49f


[SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public

## What changes were proposed in this pull request?

Made DefaultParamsReadable, DefaultParamsWritable public.  Also added relevant 
doc and annotations.  Added UnaryTransformerExample to demonstrate use of 
UnaryTransformer and DefaultParamsReadable,Writable.

## How was this patch tested?

Wrote example making use of the now-public APIs.  Compiled and ran locally

Author: Joseph K. Bradley 

Closes #13461 from jkbradley/defaultparamswritable.

(cherry picked from commit 4c74ee8d8e1c3139d3d322ae68977f2ab53295df)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d07bce49
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d07bce49
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d07bce49

Branch: refs/heads/branch-2.0
Commit: d07bce49fc77aff25330920cc55b7079a3a2995c
Parents: e38ff70
Author: Joseph K. Bradley 
Authored: Mon Jun 6 09:49:45 2016 -0700
Committer: Joseph K. Bradley 
Committed: Mon Jun 6 09:49:56 2016 -0700

--
 .../examples/ml/UnaryTransformerExample.scala   | 122 +++
 .../org/apache/spark/ml/util/ReadWrite.scala|  44 ++-
 2 files changed, 163 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d07bce49/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala
--
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala
 
b/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala
new file mode 100644
index 000..13c72f8
--- /dev/null
+++ 
b/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.UnaryTransformer
+import org.apache.spark.ml.param.DoubleParam
+import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, 
Identifiable}
+import org.apache.spark.sql.functions.col
+// $example off$
+import org.apache.spark.sql.SparkSession
+// $example on$
+import org.apache.spark.sql.types.{DataType, DataTypes}
+import org.apache.spark.util.Utils
+// $example off$
+
+/**
+ * An example demonstrating creating a custom 
[[org.apache.spark.ml.Transformer]] using
+ * the [[UnaryTransformer]] abstraction.
+ *
+ * Run with
+ * {{{
+ * bin/run-example ml.UnaryTransformerExample
+ * }}}
+ */
+object UnaryTransformerExample {
+
+  // $example on$
+  /**
+   * Simple Transformer which adds a constant value to input Doubles.
+   *
+   * [[UnaryTransformer]] can be used to create a stage usable within 
Pipelines.
+   * It defines parameters for specifying input and output columns:
+   * [[UnaryTransformer.inputCol]] and [[UnaryTransformer.outputCol]].
+   * It can optionally handle schema validation.
+   *
+   * [[DefaultParamsWritable]] provides a default implementation for 
persisting instances
+   * of this Transformer.
+   */
+  class MyTransformer(override val uid: String)
+extends UnaryTransformer[Double, Double, MyTransformer] with 
DefaultParamsWritable {
+
+final val shift: DoubleParam = new DoubleParam(this, "shift", "Value added 
to input")
+
+def getShift: Double = $(shift)
+
+def setShift(value: Double): this.type = set(shift, value)
+
+def this() = this(Identifiable.randomUID("myT"))
+
+override protected def createTransformFunc: Double => Double = (input: 
Double) => {
+  input + $(shift)
+}
+
+override protected def outputDataType: DataType = DataTypes.DoubleType
+
+override protected def validateInputType(inputType: DataType): Unit = {
+  require(inputType == 

spark git commit: [SPARK-14279][BUILD] Pick the spark version from pom

2016-06-06 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/master 00ad4f054 -> fa4bc8ea8


[SPARK-14279][BUILD] Pick the spark version from pom

## What changes were proposed in this pull request?
Change the way spark picks up version information. Also embed the build 
information to better identify the spark version running.

More context can be found here : https://github.com/apache/spark/pull/12152

## How was this patch tested?
Ran the mvn and sbt builds to verify the version information was being 
displayed correctly on executing spark-submit --version 

![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png)

Author: Dhruve Ashar 

Closes #13061 from dhruve/impr/SPARK-14279.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fa4bc8ea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fa4bc8ea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fa4bc8ea

Branch: refs/heads/master
Commit: fa4bc8ea8bab1277d1482da370dac79947cac719
Parents: 00ad4f0
Author: Dhruve Ashar 
Authored: Mon Jun 6 09:42:50 2016 -0700
Committer: Marcelo Vanzin 
Committed: Mon Jun 6 09:42:50 2016 -0700

--
 build/spark-build-info  | 38 ++
 core/pom.xml| 31 +++
 .../org/apache/spark/deploy/SparkSubmit.scala   |  7 ++-
 .../main/scala/org/apache/spark/package.scala   | 55 +++-
 pom.xml |  6 ++-
 project/SparkBuild.scala| 21 ++--
 6 files changed, 150 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fa4bc8ea/build/spark-build-info
--
diff --git a/build/spark-build-info b/build/spark-build-info
new file mode 100755
index 000..ad0ec67
--- /dev/null
+++ b/build/spark-build-info
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script generates the build info for spark and places it into the 
spark-version-info.properties file.
+# Arguments:
+#   build_tgt_directory - The target directory where properties file would be 
created. [./core/target/extra-resources]
+#   spark_version - The current version of spark
+
+RESOURCE_DIR="$1"
+mkdir -p "$RESOURCE_DIR"
+SPARK_BUILD_INFO="${RESOURCE_DIR}"/spark-version-info.properties
+
+echo_build_properties() {
+  echo version=$1
+  echo user=$USER
+  echo revision=$(git rev-parse HEAD)
+  echo branch=$(git rev-parse --abbrev-ref HEAD)
+  echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
+  echo url=$(git config --get remote.origin.url)
+}
+
+echo_build_properties $2 > "$SPARK_BUILD_INFO"

http://git-wip-us.apache.org/repos/asf/spark/blob/fa4bc8ea/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index 45f8bfc..f5fdb40 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -337,9 +337,40 @@
   
 
target/scala-${scala.binary.version}/classes
 
target/scala-${scala.binary.version}/test-classes
+
+  
+${project.basedir}/src/main/resources
+  
+  
+
+${project.build.directory}/extra-resources
+true
+  
+
 
   
 org.apache.maven.plugins
+maven-antrun-plugin
+
+  
+generate-resources
+
+  
+  
+
+  
+  
+
+  
+
+
+  run
+
+  
+
+  
+  
+org.apache.maven.plugins
 maven-dependency-plugin
 
   

spark git commit: [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison, recall, f1

2016-06-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 86a35a229 -> e38ff70e6


[SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1

## What changes were proposed in this pull request?
1, add accuracy for MulticlassMetrics
2, deprecate overall precision,recall,f1 and recommend accuracy usage

## How was this patch tested?
manual tests in pyspark shell

Author: Zheng RuiFeng 

Closes #13511 from zhengruifeng/deprecate_py_precisonrecall.

(cherry picked from commit 00ad4f054cd044e17d29b7c2c62efd8616462619)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e38ff70e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e38ff70e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e38ff70e

Branch: refs/heads/branch-2.0
Commit: e38ff70e6bacf1c85edc390d28f8a8d5ecc6cbc3
Parents: 86a35a2
Author: Zheng RuiFeng 
Authored: Mon Jun 6 15:19:22 2016 +0100
Committer: Sean Owen 
Committed: Mon Jun 6 15:19:38 2016 +0100

--
 python/pyspark/mllib/evaluation.py | 18 ++
 1 file changed, 18 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e38ff70e/python/pyspark/mllib/evaluation.py
--
diff --git a/python/pyspark/mllib/evaluation.py 
b/python/pyspark/mllib/evaluation.py
index 5f32f09..2eaac87 100644
--- a/python/pyspark/mllib/evaluation.py
+++ b/python/pyspark/mllib/evaluation.py
@@ -15,6 +15,8 @@
 # limitations under the License.
 #
 
+import warnings
+
 from pyspark import since
 from pyspark.mllib.common import JavaModelWrapper, callMLlibFunc
 from pyspark.sql import SQLContext
@@ -181,6 +183,8 @@ class MulticlassMetrics(JavaModelWrapper):
 0.66...
 >>> metrics.recall()
 0.66...
+>>> metrics.accuracy()
+0.66...
 >>> metrics.weightedFalsePositiveRate
 0.19...
 >>> metrics.weightedPrecision
@@ -233,6 +237,8 @@ class MulticlassMetrics(JavaModelWrapper):
 Returns precision or precision for a given label (category) if 
specified.
 """
 if label is None:
+# note:: Deprecated in 2.0.0. Use accuracy.
+warnings.warn("Deprecated in 2.0.0. Use accuracy.")
 return self.call("precision")
 else:
 return self.call("precision", float(label))
@@ -243,6 +249,8 @@ class MulticlassMetrics(JavaModelWrapper):
 Returns recall or recall for a given label (category) if specified.
 """
 if label is None:
+# note:: Deprecated in 2.0.0. Use accuracy.
+warnings.warn("Deprecated in 2.0.0. Use accuracy.")
 return self.call("recall")
 else:
 return self.call("recall", float(label))
@@ -254,6 +262,8 @@ class MulticlassMetrics(JavaModelWrapper):
 """
 if beta is None:
 if label is None:
+# note:: Deprecated in 2.0.0. Use accuracy.
+warnings.warn("Deprecated in 2.0.0. Use accuracy.")
 return self.call("fMeasure")
 else:
 return self.call("fMeasure", label)
@@ -263,6 +273,14 @@ class MulticlassMetrics(JavaModelWrapper):
 else:
 return self.call("fMeasure", label, beta)
 
+@since('2.0.0')
+def accuracy(self):
+"""
+Returns accuracy (equals to the total number of correctly classified 
instances
+out of the total number of instances).
+"""
+return self.call("accuracy")
+
 @property
 @since('1.4.0')
 def weightedTruePositiveRate(self):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison, recall, f1

2016-06-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master a95252823 -> 00ad4f054


[SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1

## What changes were proposed in this pull request?
1, add accuracy for MulticlassMetrics
2, deprecate overall precision,recall,f1 and recommend accuracy usage

## How was this patch tested?
manual tests in pyspark shell

Author: Zheng RuiFeng 

Closes #13511 from zhengruifeng/deprecate_py_precisonrecall.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/00ad4f05
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/00ad4f05
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/00ad4f05

Branch: refs/heads/master
Commit: 00ad4f054cd044e17d29b7c2c62efd8616462619
Parents: a952528
Author: Zheng RuiFeng 
Authored: Mon Jun 6 15:19:22 2016 +0100
Committer: Sean Owen 
Committed: Mon Jun 6 15:19:22 2016 +0100

--
 python/pyspark/mllib/evaluation.py | 18 ++
 1 file changed, 18 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/00ad4f05/python/pyspark/mllib/evaluation.py
--
diff --git a/python/pyspark/mllib/evaluation.py 
b/python/pyspark/mllib/evaluation.py
index 5f32f09..2eaac87 100644
--- a/python/pyspark/mllib/evaluation.py
+++ b/python/pyspark/mllib/evaluation.py
@@ -15,6 +15,8 @@
 # limitations under the License.
 #
 
+import warnings
+
 from pyspark import since
 from pyspark.mllib.common import JavaModelWrapper, callMLlibFunc
 from pyspark.sql import SQLContext
@@ -181,6 +183,8 @@ class MulticlassMetrics(JavaModelWrapper):
 0.66...
 >>> metrics.recall()
 0.66...
+>>> metrics.accuracy()
+0.66...
 >>> metrics.weightedFalsePositiveRate
 0.19...
 >>> metrics.weightedPrecision
@@ -233,6 +237,8 @@ class MulticlassMetrics(JavaModelWrapper):
 Returns precision or precision for a given label (category) if 
specified.
 """
 if label is None:
+# note:: Deprecated in 2.0.0. Use accuracy.
+warnings.warn("Deprecated in 2.0.0. Use accuracy.")
 return self.call("precision")
 else:
 return self.call("precision", float(label))
@@ -243,6 +249,8 @@ class MulticlassMetrics(JavaModelWrapper):
 Returns recall or recall for a given label (category) if specified.
 """
 if label is None:
+# note:: Deprecated in 2.0.0. Use accuracy.
+warnings.warn("Deprecated in 2.0.0. Use accuracy.")
 return self.call("recall")
 else:
 return self.call("recall", float(label))
@@ -254,6 +262,8 @@ class MulticlassMetrics(JavaModelWrapper):
 """
 if beta is None:
 if label is None:
+# note:: Deprecated in 2.0.0. Use accuracy.
+warnings.warn("Deprecated in 2.0.0. Use accuracy.")
 return self.call("fMeasure")
 else:
 return self.call("fMeasure", label)
@@ -263,6 +273,14 @@ class MulticlassMetrics(JavaModelWrapper):
 else:
 return self.call("fMeasure", label, beta)
 
+@since('2.0.0')
+def accuracy(self):
+"""
+Returns accuracy (equals to the total number of correctly classified 
instances
+out of the total number of instances).
+"""
+return self.call("accuracy")
+
 @property
 @since('1.4.0')
 def weightedTruePositiveRate(self):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples

2016-06-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 90e94b826 -> 86a35a229


[SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML 
examples

## What changes were proposed in this pull request?
Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) 
deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML 
examples broken.
```python
pyspark.sql.utils.IllegalArgumentException: 
u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName 
given invalid value precision.'
```
We should use ```accuracy``` to replace ```precision``` in these examples.

## How was this patch tested?
Offline tests.

Author: Yanbo Liang 

Closes #13519 from yanboliang/spark-15771.

(cherry picked from commit a95252823e09939b654dd425db38dadc4100bc87)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86a35a22
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86a35a22
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86a35a22

Branch: refs/heads/branch-2.0
Commit: 86a35a22985b9e592744e6ef31453995f2322a31
Parents: 90e94b8
Author: Yanbo Liang 
Authored: Mon Jun 6 09:36:34 2016 +0100
Committer: Sean Owen 
Committed: Mon Jun 6 09:36:43 2016 +0100

--
 .../examples/ml/JavaDecisionTreeClassificationExample.java | 2 +-
 .../examples/ml/JavaGradientBoostedTreeClassifierExample.java  | 2 +-
 .../examples/ml/JavaMultilayerPerceptronClassifierExample.java | 6 +++---
 .../org/apache/spark/examples/ml/JavaNaiveBayesExample.java| 6 +++---
 .../org/apache/spark/examples/ml/JavaOneVsRestExample.java | 6 +++---
 .../spark/examples/ml/JavaRandomForestClassifierExample.java   | 2 +-
 .../src/main/python/ml/decision_tree_classification_example.py | 2 +-
 .../main/python/ml/gradient_boosted_tree_classifier_example.py | 2 +-
 .../src/main/python/ml/multilayer_perceptron_classification.py | 6 +++---
 examples/src/main/python/ml/naive_bayes_example.py | 6 +++---
 examples/src/main/python/ml/one_vs_rest_example.py | 6 +++---
 .../src/main/python/ml/random_forest_classifier_example.py | 2 +-
 .../spark/examples/ml/DecisionTreeClassificationExample.scala  | 2 +-
 .../examples/ml/GradientBoostedTreeClassifierExample.scala | 2 +-
 .../examples/ml/MultilayerPerceptronClassifierExample.scala| 6 +++---
 .../scala/org/apache/spark/examples/ml/NaiveBayesExample.scala | 6 +++---
 .../scala/org/apache/spark/examples/ml/OneVsRestExample.scala  | 6 +++---
 .../spark/examples/ml/RandomForestClassifierExample.scala  | 2 +-
 python/pyspark/ml/evaluation.py| 2 +-
 19 files changed, 37 insertions(+), 37 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/86a35a22/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
index bdb76f0..a9c6e7f 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
@@ -90,7 +90,7 @@ public class JavaDecisionTreeClassificationExample {
 MulticlassClassificationEvaluator evaluator = new 
MulticlassClassificationEvaluator()
   .setLabelCol("indexedLabel")
   .setPredictionCol("prediction")
-  .setMetricName("precision");
+  .setMetricName("accuracy");
 double accuracy = evaluator.evaluate(predictions);
 System.out.println("Test Error = " + (1.0 - accuracy));
 

http://git-wip-us.apache.org/repos/asf/spark/blob/86a35a22/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
index 5c2e03e..3e9eb99 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
@@ -92,7 +92,7 @@ public class JavaGradientBoostedTreeClassifierExample {
 MulticlassClassificationEvaluator evaluator = new 
MulticlassClassificationEvaluator()
   .setLabelCol("indexedLabel")
   

spark git commit: [MINOR] Fix Typos 'an -> a'

2016-06-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 7d10e4bdd -> 90e94b826


[MINOR] Fix Typos 'an -> a'

## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} 
&& echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng 

Closes #13515 from zhengruifeng/an_a.

(cherry picked from commit fd8af397132fa1415a4c19d7f5cb5a41aa6ddb27)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90e94b82
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90e94b82
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90e94b82

Branch: refs/heads/branch-2.0
Commit: 90e94b82649d9816cd4065549678b82751238552
Parents: 7d10e4b
Author: Zheng RuiFeng 
Authored: Mon Jun 6 09:35:47 2016 +0100
Committer: Sean Owen 
Committed: Mon Jun 6 09:35:57 2016 +0100

--
 R/pkg/R/utils.R   |  2 +-
 .../src/main/scala/org/apache/spark/Accumulable.scala |  2 +-
 .../org/apache/spark/api/java/JavaSparkContext.scala  |  2 +-
 .../scala/org/apache/spark/api/python/PythonRDD.scala |  2 +-
 .../scala/org/apache/spark/deploy/SparkSubmit.scala   |  6 +++---
 .../src/main/scala/org/apache/spark/rdd/JdbcRDD.scala |  6 +++---
 .../main/scala/org/apache/spark/scheduler/Pool.scala  |  2 +-
 .../org/apache/spark/broadcast/BroadcastSuite.scala   |  2 +-
 .../spark/deploy/rest/StandaloneRestSubmitSuite.scala |  2 +-
 .../test/scala/org/apache/spark/rpc/RpcEnvSuite.scala |  2 +-
 .../apache/spark/scheduler/DAGSchedulerSuite.scala|  4 ++--
 .../org/apache/spark/util/JsonProtocolSuite.scala |  2 +-
 .../spark/streaming/flume/FlumeBatchFetcher.scala |  2 +-
 .../spark/graphx/impl/VertexPartitionBaseOps.scala|  2 +-
 .../scala/org/apache/spark/ml/linalg/Vectors.scala|  2 +-
 .../src/main/scala/org/apache/spark/ml/Pipeline.scala |  2 +-
 .../spark/ml/classification/LogisticRegression.scala  |  4 ++--
 .../org/apache/spark/ml/tree/impl/RandomForest.scala  |  2 +-
 .../mllib/classification/LogisticRegression.scala |  2 +-
 .../org/apache/spark/mllib/classification/SVM.scala   |  2 +-
 .../spark/mllib/feature/VectorTransformer.scala   |  2 +-
 .../scala/org/apache/spark/mllib/linalg/Vectors.scala |  2 +-
 .../mllib/linalg/distributed/CoordinateMatrix.scala   |  2 +-
 .../apache/spark/mllib/rdd/MLPairRDDFunctions.scala   |  2 +-
 python/pyspark/ml/classification.py   |  4 ++--
 python/pyspark/ml/pipeline.py |  2 +-
 python/pyspark/mllib/classification.py|  2 +-
 python/pyspark/mllib/common.py|  2 +-
 python/pyspark/rdd.py |  4 ++--
 python/pyspark/sql/session.py |  2 +-
 python/pyspark/sql/streaming.py   |  2 +-
 python/pyspark/sql/types.py   |  2 +-
 python/pyspark/streaming/dstream.py   |  4 ++--
 .../src/main/scala/org/apache/spark/sql/Row.scala |  2 +-
 .../apache/spark/sql/catalyst/analysis/Analyzer.scala |  4 ++--
 .../sql/catalyst/analysis/FunctionRegistry.scala  |  2 +-
 .../sql/catalyst/analysis/MultiInstanceRelation.scala |  2 +-
 .../spark/sql/catalyst/catalog/SessionCatalog.scala   |  6 +++---
 .../sql/catalyst/catalog/functionResources.scala  |  2 +-
 .../sql/catalyst/expressions/ExpectsInputTypes.scala  |  2 +-
 .../spark/sql/catalyst/expressions/Projection.scala   |  4 ++--
 .../sql/catalyst/expressions/complexTypeCreator.scala |  2 +-
 .../org/apache/spark/sql/types/AbstractDataType.scala |  2 +-
 .../scala/org/apache/spark/sql/DataFrameReader.scala  |  2 +-
 .../main/scala/org/apache/spark/sql/SQLContext.scala  | 14 +++---
 .../scala/org/apache/spark/sql/SQLImplicits.scala |  2 +-
 .../scala/org/apache/spark/sql/SparkSession.scala | 14 +++---
 .../org/apache/spark/sql/catalyst/SQLBuilder.scala|  2 +-
 .../aggregate/SortBasedAggregationIterator.scala  |  2 +-
 .../apache/spark/sql/execution/aggregate/udaf.scala   |  2 +-
 .../execution/columnar/GenerateColumnAccessor.scala   |  2 +-
 .../execution/datasources/FileSourceStrategy.scala|  2 +-
 .../execution/datasources/json/JacksonParser.scala|  2 +-
 .../datasources/parquet/CatalystRowConverter.scala|  2 +-
 .../sql/execution/exchange/ExchangeCoordinator.scala  | 10 +-
 .../spark/sql/execution/joins/SortMergeJoinExec.scala |  2 +-
 .../spark/sql/execution/r/MapPartitionsRWrapper.scala |  2 +-
 .../scala/org/apache/spark/sql/expressions/udaf.scala |  2 +-
 .../org/apache/spark/sql/internal/SharedState.scala   |  2 +-
 .../apache/spark/sql/streaming/ContinuousQuery.scala  |  

spark git commit: [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples

2016-06-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master fd8af3971 -> a95252823


[SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML 
examples

## What changes were proposed in this pull request?
Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) 
deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML 
examples broken.
```python
pyspark.sql.utils.IllegalArgumentException: 
u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName 
given invalid value precision.'
```
We should use ```accuracy``` to replace ```precision``` in these examples.

## How was this patch tested?
Offline tests.

Author: Yanbo Liang 

Closes #13519 from yanboliang/spark-15771.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a9525282
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a9525282
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a9525282

Branch: refs/heads/master
Commit: a95252823e09939b654dd425db38dadc4100bc87
Parents: fd8af39
Author: Yanbo Liang 
Authored: Mon Jun 6 09:36:34 2016 +0100
Committer: Sean Owen 
Committed: Mon Jun 6 09:36:34 2016 +0100

--
 .../examples/ml/JavaDecisionTreeClassificationExample.java | 2 +-
 .../examples/ml/JavaGradientBoostedTreeClassifierExample.java  | 2 +-
 .../examples/ml/JavaMultilayerPerceptronClassifierExample.java | 6 +++---
 .../org/apache/spark/examples/ml/JavaNaiveBayesExample.java| 6 +++---
 .../org/apache/spark/examples/ml/JavaOneVsRestExample.java | 6 +++---
 .../spark/examples/ml/JavaRandomForestClassifierExample.java   | 2 +-
 .../src/main/python/ml/decision_tree_classification_example.py | 2 +-
 .../main/python/ml/gradient_boosted_tree_classifier_example.py | 2 +-
 .../src/main/python/ml/multilayer_perceptron_classification.py | 6 +++---
 examples/src/main/python/ml/naive_bayes_example.py | 6 +++---
 examples/src/main/python/ml/one_vs_rest_example.py | 6 +++---
 .../src/main/python/ml/random_forest_classifier_example.py | 2 +-
 .../spark/examples/ml/DecisionTreeClassificationExample.scala  | 2 +-
 .../examples/ml/GradientBoostedTreeClassifierExample.scala | 2 +-
 .../examples/ml/MultilayerPerceptronClassifierExample.scala| 6 +++---
 .../scala/org/apache/spark/examples/ml/NaiveBayesExample.scala | 6 +++---
 .../scala/org/apache/spark/examples/ml/OneVsRestExample.scala  | 6 +++---
 .../spark/examples/ml/RandomForestClassifierExample.scala  | 2 +-
 python/pyspark/ml/evaluation.py| 2 +-
 19 files changed, 37 insertions(+), 37 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a9525282/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
index bdb76f0..a9c6e7f 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
@@ -90,7 +90,7 @@ public class JavaDecisionTreeClassificationExample {
 MulticlassClassificationEvaluator evaluator = new 
MulticlassClassificationEvaluator()
   .setLabelCol("indexedLabel")
   .setPredictionCol("prediction")
-  .setMetricName("precision");
+  .setMetricName("accuracy");
 double accuracy = evaluator.evaluate(predictions);
 System.out.println("Test Error = " + (1.0 - accuracy));
 

http://git-wip-us.apache.org/repos/asf/spark/blob/a9525282/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
index 5c2e03e..3e9eb99 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java
@@ -92,7 +92,7 @@ public class JavaGradientBoostedTreeClassifierExample {
 MulticlassClassificationEvaluator evaluator = new 
MulticlassClassificationEvaluator()
   .setLabelCol("indexedLabel")
   .setPredictionCol("prediction")
-  .setMetricName("precision");
+  .setMetricName("accuracy");
 double accuracy = 

spark git commit: [MINOR] Fix Typos 'an -> a'

2016-06-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 32f2f95db -> fd8af3971


[MINOR] Fix Typos 'an -> a'

## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} 
&& echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng 

Closes #13515 from zhengruifeng/an_a.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fd8af397
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fd8af397
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fd8af397

Branch: refs/heads/master
Commit: fd8af397132fa1415a4c19d7f5cb5a41aa6ddb27
Parents: 32f2f95
Author: Zheng RuiFeng 
Authored: Mon Jun 6 09:35:47 2016 +0100
Committer: Sean Owen 
Committed: Mon Jun 6 09:35:47 2016 +0100

--
 R/pkg/R/utils.R   |  2 +-
 .../src/main/scala/org/apache/spark/Accumulable.scala |  2 +-
 .../org/apache/spark/api/java/JavaSparkContext.scala  |  2 +-
 .../scala/org/apache/spark/api/python/PythonRDD.scala |  2 +-
 .../scala/org/apache/spark/deploy/SparkSubmit.scala   |  6 +++---
 .../src/main/scala/org/apache/spark/rdd/JdbcRDD.scala |  6 +++---
 .../main/scala/org/apache/spark/scheduler/Pool.scala  |  2 +-
 .../org/apache/spark/broadcast/BroadcastSuite.scala   |  2 +-
 .../spark/deploy/rest/StandaloneRestSubmitSuite.scala |  2 +-
 .../test/scala/org/apache/spark/rpc/RpcEnvSuite.scala |  2 +-
 .../apache/spark/scheduler/DAGSchedulerSuite.scala|  4 ++--
 .../org/apache/spark/util/JsonProtocolSuite.scala |  2 +-
 .../spark/streaming/flume/FlumeBatchFetcher.scala |  2 +-
 .../spark/graphx/impl/VertexPartitionBaseOps.scala|  2 +-
 .../scala/org/apache/spark/ml/linalg/Vectors.scala|  2 +-
 .../src/main/scala/org/apache/spark/ml/Pipeline.scala |  2 +-
 .../spark/ml/classification/LogisticRegression.scala  |  4 ++--
 .../org/apache/spark/ml/tree/impl/RandomForest.scala  |  2 +-
 .../mllib/classification/LogisticRegression.scala |  2 +-
 .../org/apache/spark/mllib/classification/SVM.scala   |  2 +-
 .../spark/mllib/feature/VectorTransformer.scala   |  2 +-
 .../scala/org/apache/spark/mllib/linalg/Vectors.scala |  2 +-
 .../mllib/linalg/distributed/CoordinateMatrix.scala   |  2 +-
 .../apache/spark/mllib/rdd/MLPairRDDFunctions.scala   |  2 +-
 python/pyspark/ml/classification.py   |  4 ++--
 python/pyspark/ml/pipeline.py |  2 +-
 python/pyspark/mllib/classification.py|  2 +-
 python/pyspark/mllib/common.py|  2 +-
 python/pyspark/rdd.py |  4 ++--
 python/pyspark/sql/session.py |  2 +-
 python/pyspark/sql/streaming.py   |  2 +-
 python/pyspark/sql/types.py   |  2 +-
 python/pyspark/streaming/dstream.py   |  4 ++--
 .../src/main/scala/org/apache/spark/sql/Row.scala |  2 +-
 .../apache/spark/sql/catalyst/analysis/Analyzer.scala |  4 ++--
 .../sql/catalyst/analysis/FunctionRegistry.scala  |  2 +-
 .../sql/catalyst/analysis/MultiInstanceRelation.scala |  2 +-
 .../spark/sql/catalyst/catalog/SessionCatalog.scala   |  6 +++---
 .../sql/catalyst/catalog/functionResources.scala  |  2 +-
 .../sql/catalyst/expressions/ExpectsInputTypes.scala  |  2 +-
 .../spark/sql/catalyst/expressions/Projection.scala   |  4 ++--
 .../sql/catalyst/expressions/complexTypeCreator.scala |  2 +-
 .../org/apache/spark/sql/types/AbstractDataType.scala |  2 +-
 .../scala/org/apache/spark/sql/DataFrameReader.scala  |  2 +-
 .../main/scala/org/apache/spark/sql/SQLContext.scala  | 14 +++---
 .../scala/org/apache/spark/sql/SQLImplicits.scala |  2 +-
 .../scala/org/apache/spark/sql/SparkSession.scala | 14 +++---
 .../org/apache/spark/sql/catalyst/SQLBuilder.scala|  2 +-
 .../aggregate/SortBasedAggregationIterator.scala  |  2 +-
 .../apache/spark/sql/execution/aggregate/udaf.scala   |  2 +-
 .../execution/columnar/GenerateColumnAccessor.scala   |  2 +-
 .../execution/datasources/FileSourceStrategy.scala|  2 +-
 .../execution/datasources/json/JacksonParser.scala|  2 +-
 .../datasources/parquet/CatalystRowConverter.scala|  2 +-
 .../sql/execution/exchange/ExchangeCoordinator.scala  | 10 +-
 .../spark/sql/execution/joins/SortMergeJoinExec.scala |  2 +-
 .../spark/sql/execution/r/MapPartitionsRWrapper.scala |  2 +-
 .../scala/org/apache/spark/sql/expressions/udaf.scala |  2 +-
 .../org/apache/spark/sql/internal/SharedState.scala   |  2 +-
 .../apache/spark/sql/streaming/ContinuousQuery.scala  |  2 +-
 .../org/apache/spark/sql/hive/client/HiveClient.scala |  2 +-
 .../apache/spark/sql/hive/orc/OrcFileOperator.scala   | 

spark git commit: Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"

2016-06-06 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9e7e2f916 -> 7d10e4bdd


Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"

This reverts commit 9e7e2f9164e0b3bd555e795b871626057b4fed31.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d10e4bd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d10e4bd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d10e4bd

Branch: refs/heads/branch-2.0
Commit: 7d10e4bdd2adbeb10904665536e4949381f19cf5
Parents: 9e7e2f9
Author: Reynold Xin 
Authored: Sun Jun 5 23:40:35 2016 -0700
Committer: Reynold Xin 
Committed: Sun Jun 5 23:40:35 2016 -0700

--
 python/pyspark/sql/readwriter.py| 81 ++--
 .../execution/datasources/csv/CSVOptions.scala  | 11 +--
 .../execution/datasources/csv/CSVSuite.scala| 11 ---
 3 files changed, 48 insertions(+), 55 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7d10e4bd/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 19aa8dd..9208a52 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -303,11 +303,10 @@ class DataFrameReader(object):
 return 
self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
 
 @since(2.0)
-def csv(self, path, schema=None, sep=u',', encoding=u'UTF-8', quote=u'\"', 
escape=u'\\',
-comment=None, header='false', ignoreLeadingWhiteSpace='false',
-ignoreTrailingWhiteSpace='false', nullValue='', nanValue='NaN', 
positiveInf='Inf',
-negativeInf='Inf', dateFormat=None, maxColumns='20480', 
maxCharsPerColumn='100',
-mode='PERMISSIVE'):
+def csv(self, path, schema=None, sep=None, encoding=None, quote=None, 
escape=None,
+comment=None, header=None, ignoreLeadingWhiteSpace=None, 
ignoreTrailingWhiteSpace=None,
+nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, 
dateFormat=None,
+maxColumns=None, maxCharsPerColumn=None, mode=None):
 """Loads a CSV file and returns the result as a [[DataFrame]].
 
 This function goes through the input once to determine the input 
schema. To avoid going
@@ -316,41 +315,44 @@ class DataFrameReader(object):
 :param path: string, or list of strings, for input path(s).
 :param schema: an optional :class:`StructType` for the input schema.
 :param sep: sets the single character as a separator for each field 
and value.
-The default value is ``,``.
-:param encoding: decodes the CSV files by the given encoding type.
-The default value is ``UTF-8``.
+If None is set, it uses the default value, ``,``.
+:param encoding: decodes the CSV files by the given encoding type. If 
None is set,
+ it uses the default value, ``UTF-8``.
 :param quote: sets the single character used for escaping quoted 
values where the
-  separator can be part of the value. The default value is 
``"``.
+  separator can be part of the value. If None is set, it 
uses the default
+  value, ``"``.
 :param escape: sets the single character used for escaping quotes 
inside an already
-   quoted value. The default value is ``\``.
+   quoted value. If None is set, it uses the default 
value, ``\``.
 :param comment: sets the single character used for skipping lines 
beginning with this
 character. By default (None), it is disabled.
-:param header: uses the first line as names of columns. The default 
value is ``false``.
+:param header: uses the first line as names of columns. If None is 
set, it uses the
+   default value, ``false``.
 :param ignoreLeadingWhiteSpace: defines whether or not leading 
whitespaces from values
-being read should be skipped. The 
default value is
-``false``.
+being read should be skipped. If None 
is set, it uses
+the default value, ``false``.
 :param ignoreTrailingWhiteSpace: defines whether or not trailing 
whitespaces from values
- being read should be skipped. The 
default value is
- ``false``.
-:param nullValue: sets the string representation of a null value. The 
default value 

spark git commit: Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"

2016-06-06 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master b7e8d1cb3 -> 32f2f95db


Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"

This reverts commit b7e8d1cb3ce932ba4a784be59744af8a8ef027ce.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/32f2f95d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/32f2f95d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/32f2f95d

Branch: refs/heads/master
Commit: 32f2f95dbdfb21491e46d4b608fd4e8ac7ab8973
Parents: b7e8d1c
Author: Reynold Xin 
Authored: Sun Jun 5 23:40:13 2016 -0700
Committer: Reynold Xin 
Committed: Sun Jun 5 23:40:13 2016 -0700

--
 python/pyspark/sql/readwriter.py| 81 ++--
 .../execution/datasources/csv/CSVOptions.scala  | 11 +--
 .../execution/datasources/csv/CSVSuite.scala| 11 ---
 3 files changed, 48 insertions(+), 55 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/32f2f95d/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 19aa8dd..9208a52 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -303,11 +303,10 @@ class DataFrameReader(object):
 return 
self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
 
 @since(2.0)
-def csv(self, path, schema=None, sep=u',', encoding=u'UTF-8', quote=u'\"', 
escape=u'\\',
-comment=None, header='false', ignoreLeadingWhiteSpace='false',
-ignoreTrailingWhiteSpace='false', nullValue='', nanValue='NaN', 
positiveInf='Inf',
-negativeInf='Inf', dateFormat=None, maxColumns='20480', 
maxCharsPerColumn='100',
-mode='PERMISSIVE'):
+def csv(self, path, schema=None, sep=None, encoding=None, quote=None, 
escape=None,
+comment=None, header=None, ignoreLeadingWhiteSpace=None, 
ignoreTrailingWhiteSpace=None,
+nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, 
dateFormat=None,
+maxColumns=None, maxCharsPerColumn=None, mode=None):
 """Loads a CSV file and returns the result as a [[DataFrame]].
 
 This function goes through the input once to determine the input 
schema. To avoid going
@@ -316,41 +315,44 @@ class DataFrameReader(object):
 :param path: string, or list of strings, for input path(s).
 :param schema: an optional :class:`StructType` for the input schema.
 :param sep: sets the single character as a separator for each field 
and value.
-The default value is ``,``.
-:param encoding: decodes the CSV files by the given encoding type.
-The default value is ``UTF-8``.
+If None is set, it uses the default value, ``,``.
+:param encoding: decodes the CSV files by the given encoding type. If 
None is set,
+ it uses the default value, ``UTF-8``.
 :param quote: sets the single character used for escaping quoted 
values where the
-  separator can be part of the value. The default value is 
``"``.
+  separator can be part of the value. If None is set, it 
uses the default
+  value, ``"``.
 :param escape: sets the single character used for escaping quotes 
inside an already
-   quoted value. The default value is ``\``.
+   quoted value. If None is set, it uses the default 
value, ``\``.
 :param comment: sets the single character used for skipping lines 
beginning with this
 character. By default (None), it is disabled.
-:param header: uses the first line as names of columns. The default 
value is ``false``.
+:param header: uses the first line as names of columns. If None is 
set, it uses the
+   default value, ``false``.
 :param ignoreLeadingWhiteSpace: defines whether or not leading 
whitespaces from values
-being read should be skipped. The 
default value is
-``false``.
+being read should be skipped. If None 
is set, it uses
+the default value, ``false``.
 :param ignoreTrailingWhiteSpace: defines whether or not trailing 
whitespaces from values
- being read should be skipped. The 
default value is
- ``false``.
-:param nullValue: sets the string representation of a null value. The 
default value is a
-   

spark git commit: [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour

2016-06-06 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 79268aa46 -> b7e8d1cb3


[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour

## What changes were proposed in this pull request?
This pr fixes the behaviour of `format("csv").option("quote", null)` along with 
one of spark-csv.
Also, it explicitly sets default values for CSV options in python.

## How was this patch tested?
Added tests in CSVSuite.

Author: Takeshi YAMAMURO 

Closes #13372 from maropu/SPARK-15585.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b7e8d1cb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b7e8d1cb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b7e8d1cb

Branch: refs/heads/master
Commit: b7e8d1cb3ce932ba4a784be59744af8a8ef027ce
Parents: 79268aa
Author: Takeshi YAMAMURO 
Authored: Sun Jun 5 23:35:04 2016 -0700
Committer: Reynold Xin 
Committed: Sun Jun 5 23:35:04 2016 -0700

--
 python/pyspark/sql/readwriter.py| 81 ++--
 .../execution/datasources/csv/CSVOptions.scala  | 11 ++-
 .../execution/datasources/csv/CSVSuite.scala| 11 +++
 3 files changed, 55 insertions(+), 48 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b7e8d1cb/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 9208a52..19aa8dd 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -303,10 +303,11 @@ class DataFrameReader(object):
 return 
self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
 
 @since(2.0)
-def csv(self, path, schema=None, sep=None, encoding=None, quote=None, 
escape=None,
-comment=None, header=None, ignoreLeadingWhiteSpace=None, 
ignoreTrailingWhiteSpace=None,
-nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, 
dateFormat=None,
-maxColumns=None, maxCharsPerColumn=None, mode=None):
+def csv(self, path, schema=None, sep=u',', encoding=u'UTF-8', quote=u'\"', 
escape=u'\\',
+comment=None, header='false', ignoreLeadingWhiteSpace='false',
+ignoreTrailingWhiteSpace='false', nullValue='', nanValue='NaN', 
positiveInf='Inf',
+negativeInf='Inf', dateFormat=None, maxColumns='20480', 
maxCharsPerColumn='100',
+mode='PERMISSIVE'):
 """Loads a CSV file and returns the result as a [[DataFrame]].
 
 This function goes through the input once to determine the input 
schema. To avoid going
@@ -315,44 +316,41 @@ class DataFrameReader(object):
 :param path: string, or list of strings, for input path(s).
 :param schema: an optional :class:`StructType` for the input schema.
 :param sep: sets the single character as a separator for each field 
and value.
-If None is set, it uses the default value, ``,``.
-:param encoding: decodes the CSV files by the given encoding type. If 
None is set,
- it uses the default value, ``UTF-8``.
+The default value is ``,``.
+:param encoding: decodes the CSV files by the given encoding type.
+The default value is ``UTF-8``.
 :param quote: sets the single character used for escaping quoted 
values where the
-  separator can be part of the value. If None is set, it 
uses the default
-  value, ``"``.
+  separator can be part of the value. The default value is 
``"``.
 :param escape: sets the single character used for escaping quotes 
inside an already
-   quoted value. If None is set, it uses the default 
value, ``\``.
+   quoted value. The default value is ``\``.
 :param comment: sets the single character used for skipping lines 
beginning with this
 character. By default (None), it is disabled.
-:param header: uses the first line as names of columns. If None is 
set, it uses the
-   default value, ``false``.
+:param header: uses the first line as names of columns. The default 
value is ``false``.
 :param ignoreLeadingWhiteSpace: defines whether or not leading 
whitespaces from values
-being read should be skipped. If None 
is set, it uses
-the default value, ``false``.
+being read should be skipped. The 
default value is
+``false``.
 :param ignoreTrailingWhiteSpace: 

spark git commit: [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour

2016-06-06 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 790de600b -> 9e7e2f916


[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour

## What changes were proposed in this pull request?
This pr fixes the behaviour of `format("csv").option("quote", null)` along with 
one of spark-csv.
Also, it explicitly sets default values for CSV options in python.

## How was this patch tested?
Added tests in CSVSuite.

Author: Takeshi YAMAMURO 

Closes #13372 from maropu/SPARK-15585.

(cherry picked from commit b7e8d1cb3ce932ba4a784be59744af8a8ef027ce)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9e7e2f91
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9e7e2f91
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9e7e2f91

Branch: refs/heads/branch-2.0
Commit: 9e7e2f9164e0b3bd555e795b871626057b4fed31
Parents: 790de60
Author: Takeshi YAMAMURO 
Authored: Sun Jun 5 23:35:04 2016 -0700
Committer: Reynold Xin 
Committed: Sun Jun 5 23:35:10 2016 -0700

--
 python/pyspark/sql/readwriter.py| 81 ++--
 .../execution/datasources/csv/CSVOptions.scala  | 11 ++-
 .../execution/datasources/csv/CSVSuite.scala| 11 +++
 3 files changed, 55 insertions(+), 48 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9e7e2f91/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 9208a52..19aa8dd 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -303,10 +303,11 @@ class DataFrameReader(object):
 return 
self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
 
 @since(2.0)
-def csv(self, path, schema=None, sep=None, encoding=None, quote=None, 
escape=None,
-comment=None, header=None, ignoreLeadingWhiteSpace=None, 
ignoreTrailingWhiteSpace=None,
-nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, 
dateFormat=None,
-maxColumns=None, maxCharsPerColumn=None, mode=None):
+def csv(self, path, schema=None, sep=u',', encoding=u'UTF-8', quote=u'\"', 
escape=u'\\',
+comment=None, header='false', ignoreLeadingWhiteSpace='false',
+ignoreTrailingWhiteSpace='false', nullValue='', nanValue='NaN', 
positiveInf='Inf',
+negativeInf='Inf', dateFormat=None, maxColumns='20480', 
maxCharsPerColumn='100',
+mode='PERMISSIVE'):
 """Loads a CSV file and returns the result as a [[DataFrame]].
 
 This function goes through the input once to determine the input 
schema. To avoid going
@@ -315,44 +316,41 @@ class DataFrameReader(object):
 :param path: string, or list of strings, for input path(s).
 :param schema: an optional :class:`StructType` for the input schema.
 :param sep: sets the single character as a separator for each field 
and value.
-If None is set, it uses the default value, ``,``.
-:param encoding: decodes the CSV files by the given encoding type. If 
None is set,
- it uses the default value, ``UTF-8``.
+The default value is ``,``.
+:param encoding: decodes the CSV files by the given encoding type.
+The default value is ``UTF-8``.
 :param quote: sets the single character used for escaping quoted 
values where the
-  separator can be part of the value. If None is set, it 
uses the default
-  value, ``"``.
+  separator can be part of the value. The default value is 
``"``.
 :param escape: sets the single character used for escaping quotes 
inside an already
-   quoted value. If None is set, it uses the default 
value, ``\``.
+   quoted value. The default value is ``\``.
 :param comment: sets the single character used for skipping lines 
beginning with this
 character. By default (None), it is disabled.
-:param header: uses the first line as names of columns. If None is 
set, it uses the
-   default value, ``false``.
+:param header: uses the first line as names of columns. The default 
value is ``false``.
 :param ignoreLeadingWhiteSpace: defines whether or not leading 
whitespaces from values
-being read should be skipped. If None 
is set, it uses
-the default value, ``false``.
+being read should be