date:20151216

spark git commit: [SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 6a880afa8 -> 8148cc7a5


[SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6

No known breaking changes, but some deprecations and changes of behavior.

CC: mengxr

Author: Joseph K. Bradley 

Closes #10235 from jkbradley/mllib-guide-update-1.6.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8148cc7a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8148cc7a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8148cc7a

Branch: refs/heads/master
Commit: 8148cc7a5c9f52c82c2eb7652d9aeba85e72d406
Parents: 6a880af
Author: Joseph K. Bradley 
Authored: Wed Dec 16 11:53:04 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 11:53:04 2015 -0800

--
 docs/mllib-guide.md| 38 ++---
 docs/mllib-migration-guides.md | 19 +++
 2 files changed, 42 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8148cc7a/docs/mllib-guide.md
--
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 680ed48..7ef91a1 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -74,7 +74,7 @@ We list major functionality from both below, with links to 
detailed guides.
 * [Advanced topics](ml-advanced.html)
 
 Some techniques are not available yet in spark.ml, most notably dimensionality 
reduction 
-Users can seemlessly combine the implementation of these techniques found in 
`spark.mllib` with the rest of the algorithms found in `spark.ml`.
+Users can seamlessly combine the implementation of these techniques found in 
`spark.mllib` with the rest of the algorithms found in `spark.ml`.
 
 # Dependencies
 
@@ -101,24 +101,32 @@ MLlib is under active development.
 The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
 and the migration guide below will explain all changes between releases.
 
-## From 1.4 to 1.5
+## From 1.5 to 1.6
 
-In the `spark.mllib` package, there are no break API changes but several 
behavior changes:
+There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, 
but there are
+deprecations and changes of behavior.
 
-* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
-  `RegressionMetrics.explainedVariance` returns the average regression sum of 
squares.
-* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): 
`NaiveBayesModel.labels` become
-  sorted.
-* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): 
`GradientDescent` has a default
-  convergence tolerance `1e-3`, and hence iterations might end earlier than 
1.4.
+Deprecations:
 
-In the `spark.ml` package, there exists one break API change and one behavior 
change:
+* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
+ In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
+* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
+ In `spark.ml.classification.LogisticRegressionModel` and
+ `spark.ml.regression.LinearRegressionModel`, the `weights` field has been 
deprecated in favor of
+ the new name `coefficients`.  This helps disambiguate from instance (row) 
"weights" given to
+ algorithms.
 
-* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's 
varargs support is removed
-  from `Params.setDefault` due to a
-  [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
-* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): 
`Evaluator.isLargerBetter` is
-  added to indicate metric ordering. Metrics like RMSE no longer flip signs as 
in 1.4.
+Changes of behavior:
+
+* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
+ `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed 
semantics in 1.6.
+ Previously, it was a threshold for absolute change in error. Now, it 
resembles the behavior of
+ `GradientDescent`'s `convergenceTol`: For large errors, it uses relative 
error (relative to the
+ previous error); for small errors (`< 0.01`), it uses absolute error.
+* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
+ `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to 
lowercase before
+ tokenizing. Now, it converts to lowercase by default, with an option not to. 
This matches the
+ behavior of the simpler `Tokenizer` transformer.
 
 ## Previous Spark versions
 

http://git-wip-us.apache.org/repos/asf/spark/blob/8148cc7a/docs/mllib-migration-guides.md
--
diff --git a/docs/mllib-migration-guides.md b/docs/mllib-migration-guides.md
index

spark git commit: [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR

2015-12-16 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 a2d584ed9 -> ac0e2ea7c


[SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR

Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated 
```saveAsParquetFile```.

Author: Yanbo Liang 

Closes #10281 from yanboliang/spark-12310.

(cherry picked from commit 22f6cd86fc2e2d6f6ad2c3aae416732c46ebf1b1)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac0e2ea7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac0e2ea7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac0e2ea7

Branch: refs/heads/branch-1.6
Commit: ac0e2ea7c712e91503b02ae3c12fa2fcf5079886
Parents: a2d584e
Author: Yanbo Liang 
Authored: Wed Dec 16 10:34:30 2015 -0800
Committer: Shivaram Venkataraman 
Committed: Wed Dec 16 10:34:54 2015 -0800

--
 R/pkg/NAMESPACE   |   4 +-
 R/pkg/R/DataFrame.R   |  51 ++--
 R/pkg/R/generics.R|  16 +++-
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 104 ++---
 4 files changed, 119 insertions(+), 56 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ac0e2ea7/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index cab39d6..ccc01fe 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -92,7 +92,9 @@ exportMethods("arrange",
   "with",
   "withColumn",
   "withColumnRenamed",
-  "write.df")
+  "write.df",
+  "write.json",
+  "write.parquet")
 
 exportClasses("Column")
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ac0e2ea7/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 764597d..7292433 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -596,17 +596,44 @@ setMethod("toJSON",
 RDD(jrdd, serializedMode = "string")
   })
 
-#' saveAsParquetFile
+#' write.json
+#'
+#' Save the contents of a DataFrame as a JSON file (one object per line). 
Files written out
+#' with this method can be read back in as a DataFrame using read.json().
+#'
+#' @param x A SparkSQL DataFrame
+#' @param path The directory where the file is saved
+#'
+#' @family DataFrame functions
+#' @rdname write.json
+#' @name write.json
+#' @export
+#' @examples
+#'\dontrun{
+#' sc <- sparkR.init()
+#' sqlContext <- sparkRSQL.init(sc)
+#' path <- "path/to/file.json"
+#' df <- read.json(sqlContext, path)
+#' write.json(df, "/tmp/sparkr-tmp/")
+#'}
+setMethod("write.json",
+  signature(x = "DataFrame", path = "character"),
+  function(x, path) {
+write <- callJMethod(x@sdf, "write")
+invisible(callJMethod(write, "json", path))
+  })
+
+#' write.parquet
 #'
 #' Save the contents of a DataFrame as a Parquet file, preserving the schema. 
Files written out
-#' with this method can be read back in as a DataFrame using parquetFile().
+#' with this method can be read back in as a DataFrame using read.parquet().
 #'
 #' @param x A SparkSQL DataFrame
 #' @param path The directory where the file is saved
 #'
 #' @family DataFrame functions
-#' @rdname saveAsParquetFile
-#' @name saveAsParquetFile
+#' @rdname write.parquet
+#' @name write.parquet
 #' @export
 #' @examples
 #'\dontrun{
@@ -614,12 +641,24 @@ setMethod("toJSON",
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
 #' df <- read.json(sqlContext, path)
-#' saveAsParquetFile(df, "/tmp/sparkr-tmp/")
+#' write.parquet(df, "/tmp/sparkr-tmp1/")
+#' saveAsParquetFile(df, "/tmp/sparkr-tmp2/")
 #'}
+setMethod("write.parquet",
+  signature(x = "DataFrame", path = "character"),
+  function(x, path) {
+write <- callJMethod(x@sdf, "write")
+invisible(callJMethod(write, "parquet", path))
+  })
+
+#' @rdname write.parquet
+#' @name saveAsParquetFile
+#' @export
 setMethod("saveAsParquetFile",
   signature(x = "DataFrame", path = "character"),
   function(x, path) {
-invisible(callJMethod(x@sdf, "saveAsParquetFile", path))
+.Deprecated("write.parquet")
+write.parquet(x, path)
   })
 
 #' Distinct

http://git-wip-us.apache.org/repos/asf/spark/blob/ac0e2ea7/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index c383e6e..62be2dd 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -519,10 +519,6 @@

spark git commit: [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR

2015-12-16 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 2eb5af5f0 -> 22f6cd86f


[SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR

Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated 
```saveAsParquetFile```.

Author: Yanbo Liang 

Closes #10281 from yanboliang/spark-12310.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/22f6cd86
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/22f6cd86
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/22f6cd86

Branch: refs/heads/master
Commit: 22f6cd86fc2e2d6f6ad2c3aae416732c46ebf1b1
Parents: 2eb5af5
Author: Yanbo Liang 
Authored: Wed Dec 16 10:34:30 2015 -0800
Committer: Shivaram Venkataraman 
Committed: Wed Dec 16 10:34:30 2015 -0800

--
 R/pkg/NAMESPACE   |   4 +-
 R/pkg/R/DataFrame.R   |  51 ++--
 R/pkg/R/generics.R|  16 +++-
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 104 ++---
 4 files changed, 119 insertions(+), 56 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/22f6cd86/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index cab39d6..ccc01fe 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -92,7 +92,9 @@ exportMethods("arrange",
   "with",
   "withColumn",
   "withColumnRenamed",
-  "write.df")
+  "write.df",
+  "write.json",
+  "write.parquet")
 
 exportClasses("Column")
 

http://git-wip-us.apache.org/repos/asf/spark/blob/22f6cd86/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 380a13f..0cfa12b9 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -596,17 +596,44 @@ setMethod("toJSON",
 RDD(jrdd, serializedMode = "string")
   })
 
-#' saveAsParquetFile
+#' write.json
+#'
+#' Save the contents of a DataFrame as a JSON file (one object per line). 
Files written out
+#' with this method can be read back in as a DataFrame using read.json().
+#'
+#' @param x A SparkSQL DataFrame
+#' @param path The directory where the file is saved
+#'
+#' @family DataFrame functions
+#' @rdname write.json
+#' @name write.json
+#' @export
+#' @examples
+#'\dontrun{
+#' sc <- sparkR.init()
+#' sqlContext <- sparkRSQL.init(sc)
+#' path <- "path/to/file.json"
+#' df <- read.json(sqlContext, path)
+#' write.json(df, "/tmp/sparkr-tmp/")
+#'}
+setMethod("write.json",
+  signature(x = "DataFrame", path = "character"),
+  function(x, path) {
+write <- callJMethod(x@sdf, "write")
+invisible(callJMethod(write, "json", path))
+  })
+
+#' write.parquet
 #'
 #' Save the contents of a DataFrame as a Parquet file, preserving the schema. 
Files written out
-#' with this method can be read back in as a DataFrame using parquetFile().
+#' with this method can be read back in as a DataFrame using read.parquet().
 #'
 #' @param x A SparkSQL DataFrame
 #' @param path The directory where the file is saved
 #'
 #' @family DataFrame functions
-#' @rdname saveAsParquetFile
-#' @name saveAsParquetFile
+#' @rdname write.parquet
+#' @name write.parquet
 #' @export
 #' @examples
 #'\dontrun{
@@ -614,12 +641,24 @@ setMethod("toJSON",
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
 #' df <- read.json(sqlContext, path)
-#' saveAsParquetFile(df, "/tmp/sparkr-tmp/")
+#' write.parquet(df, "/tmp/sparkr-tmp1/")
+#' saveAsParquetFile(df, "/tmp/sparkr-tmp2/")
 #'}
+setMethod("write.parquet",
+  signature(x = "DataFrame", path = "character"),
+  function(x, path) {
+write <- callJMethod(x@sdf, "write")
+invisible(callJMethod(write, "parquet", path))
+  })
+
+#' @rdname write.parquet
+#' @name saveAsParquetFile
+#' @export
 setMethod("saveAsParquetFile",
   signature(x = "DataFrame", path = "character"),
   function(x, path) {
-invisible(callJMethod(x@sdf, "saveAsParquetFile", path))
+.Deprecated("write.parquet")
+write.parquet(x, path)
   })
 
 #' Distinct

http://git-wip-us.apache.org/repos/asf/spark/blob/22f6cd86/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index c383e6e..62be2dd 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -519,10 +519,6 @@ setGeneric("sample_frac",
 #' @export
 setGeneric("sampleBy", function(x, col, fractions, seed) { 
standardGeneric("sampleBy") })
 
-#' @rdname saveAsParquetFile

[spark] Git Push Summary

2015-12-16 Thread pwendell

Repository: spark
Updated Tags:  refs/tags/v1.6.0-rc3 [created] 168c89e07

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[1/2] spark git commit: Preparing Spark release v1.6.0-rc3

2015-12-16 Thread pwendell

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 e1adf6d7d -> aee88eb55


Preparing Spark release v1.6.0-rc3


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/168c89e0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/168c89e0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/168c89e0

Branch: refs/heads/branch-1.6
Commit: 168c89e07c51fa24b0bb88582c739cec0acb44d7
Parents: e1adf6d
Author: Patrick Wendell 
Authored: Wed Dec 16 11:23:41 2015 -0800
Committer: Patrick Wendell 
Committed: Wed Dec 16 11:23:41 2015 -0800

--
 assembly/pom.xml| 2 +-
 bagel/pom.xml   | 2 +-
 core/pom.xml| 2 +-
 docker-integration-tests/pom.xml| 2 +-
 examples/pom.xml| 2 +-
 external/flume-assembly/pom.xml | 2 +-
 external/flume-sink/pom.xml | 2 +-
 external/flume/pom.xml  | 2 +-
 external/kafka-assembly/pom.xml | 2 +-
 external/kafka/pom.xml  | 2 +-
 external/mqtt-assembly/pom.xml  | 2 +-
 external/mqtt/pom.xml   | 2 +-
 external/twitter/pom.xml| 2 +-
 external/zeromq/pom.xml | 2 +-
 extras/java8-tests/pom.xml  | 2 +-
 extras/kinesis-asl-assembly/pom.xml | 2 +-
 extras/kinesis-asl/pom.xml  | 2 +-
 extras/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml  | 2 +-
 launcher/pom.xml| 2 +-
 mllib/pom.xml   | 2 +-
 network/common/pom.xml  | 2 +-
 network/shuffle/pom.xml | 2 +-
 network/yarn/pom.xml| 2 +-
 pom.xml | 2 +-
 repl/pom.xml| 2 +-
 sql/catalyst/pom.xml| 2 +-
 sql/core/pom.xml| 2 +-
 sql/hive-thriftserver/pom.xml   | 2 +-
 sql/hive/pom.xml| 2 +-
 streaming/pom.xml   | 2 +-
 tags/pom.xml| 2 +-
 tools/pom.xml   | 2 +-
 unsafe/pom.xml  | 2 +-
 yarn/pom.xml| 2 +-
 35 files changed, 35 insertions(+), 35 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 4b60ee0..fbabaa5 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/bagel/pom.xml
--
diff --git a/bagel/pom.xml b/bagel/pom.xml
index 672e946..1b3e417 100644
--- a/bagel/pom.xml
+++ b/bagel/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index 61744bb..15b8d75 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/docker-integration-tests/pom.xml
--
diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml
index 39d3f34..d579879 100644
--- a/docker-integration-tests/pom.xml
+++ b/docker-integration-tests/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/examples/pom.xml
--
diff --git a/examples/pom.xml b/examples/pom.xml
index f5ab2a7..37b15bb 100644
--- a/examples/pom.xml
+++ b/examples/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/external/flume-assembly/pom.xml
--
diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml
index dceedcf..295455a 100644
--- a/external/flume-assembly/pom.xml
+++ b/external/flume-assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/external/flume-sink/pom.xml

spark git commit: [SPARK-12364][ML][SPARKR] Add ML example for SparkR

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 dffa6100d -> 04e868b63


[SPARK-12364][ML][SPARKR] Add ML example for SparkR

We have DataFrame example for SparkR, we also need to add ML example under 
```examples/src/main/r```.

cc mengxr jkbradley shivaram

Author: Yanbo Liang 

Closes #10324 from yanboliang/spark-12364.

(cherry picked from commit 1a8b2a17db7ab7a213d553079b83274aeebba86f)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/04e868b6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/04e868b6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/04e868b6

Branch: refs/heads/branch-1.6
Commit: 04e868b63bfda5afe5cb1a0d6387fb873ad393ba
Parents: dffa610
Author: Yanbo Liang 
Authored: Wed Dec 16 12:59:22 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 12:59:33 2015 -0800

--
 examples/src/main/r/ml.R | 54 +++
 1 file changed, 54 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/04e868b6/examples/src/main/r/ml.R
--
diff --git a/examples/src/main/r/ml.R b/examples/src/main/r/ml.R
new file mode 100644
index 000..a0c9039
--- /dev/null
+++ b/examples/src/main/r/ml.R
@@ -0,0 +1,54 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/sparkR examples/src/main/r/ml.R
+
+# Load SparkR library into your R session
+library(SparkR)
+
+# Initialize SparkContext and SQLContext
+sc <- sparkR.init(appName="SparkR-ML-example")
+sqlContext <- sparkRSQL.init(sc)
+
+# Train GLM of family 'gaussian'
+training1 <- suppressWarnings(createDataFrame(sqlContext, iris))
+test1 <- training1
+model1 <- glm(Sepal_Length ~ Sepal_Width + Species, training1, family = 
"gaussian")
+
+# Model summary
+summary(model1)
+
+# Prediction
+predictions1 <- predict(model1, test1)
+head(select(predictions1, "Sepal_Length", "prediction"))
+
+# Train GLM of family 'binomial'
+training2 <- filter(training1, training1$Species != "setosa")
+test2 <- training2
+model2 <- glm(Species ~ Sepal_Length + Sepal_Width, data = training2, family = 
"binomial")
+
+# Model summary
+summary(model2)
+
+# Prediction (Currently the output of prediction for binomial GLM is the 
indexed label,
+# we need to transform back to the original string label later)
+predictions2 <- predict(model2, test2)
+head(select(predictions2, "Species", "prediction"))
+
+# Stop the SparkContext now
+sparkR.stop()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 aee88eb55 -> dffa6100d


[SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6

No known breaking changes, but some deprecations and changes of behavior.

CC: mengxr

Author: Joseph K. Bradley 

Closes #10235 from jkbradley/mllib-guide-update-1.6.

(cherry picked from commit 8148cc7a5c9f52c82c2eb7652d9aeba85e72d406)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dffa6100
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dffa6100
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dffa6100

Branch: refs/heads/branch-1.6
Commit: dffa6100d7d96eb38bf8a56f546d66f7a884b03f
Parents: aee88eb
Author: Joseph K. Bradley 
Authored: Wed Dec 16 11:53:04 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 11:53:15 2015 -0800

--
 docs/mllib-guide.md| 38 ++---
 docs/mllib-migration-guides.md | 19 +++
 2 files changed, 42 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dffa6100/docs/mllib-guide.md
--
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 680ed48..7ef91a1 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -74,7 +74,7 @@ We list major functionality from both below, with links to 
detailed guides.
 * [Advanced topics](ml-advanced.html)
 
 Some techniques are not available yet in spark.ml, most notably dimensionality 
reduction 
-Users can seemlessly combine the implementation of these techniques found in 
`spark.mllib` with the rest of the algorithms found in `spark.ml`.
+Users can seamlessly combine the implementation of these techniques found in 
`spark.mllib` with the rest of the algorithms found in `spark.ml`.
 
 # Dependencies
 
@@ -101,24 +101,32 @@ MLlib is under active development.
 The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
 and the migration guide below will explain all changes between releases.
 
-## From 1.4 to 1.5
+## From 1.5 to 1.6
 
-In the `spark.mllib` package, there are no break API changes but several 
behavior changes:
+There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, 
but there are
+deprecations and changes of behavior.
 
-* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
-  `RegressionMetrics.explainedVariance` returns the average regression sum of 
squares.
-* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): 
`NaiveBayesModel.labels` become
-  sorted.
-* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): 
`GradientDescent` has a default
-  convergence tolerance `1e-3`, and hence iterations might end earlier than 
1.4.
+Deprecations:
 
-In the `spark.ml` package, there exists one break API change and one behavior 
change:
+* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
+ In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
+* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
+ In `spark.ml.classification.LogisticRegressionModel` and
+ `spark.ml.regression.LinearRegressionModel`, the `weights` field has been 
deprecated in favor of
+ the new name `coefficients`.  This helps disambiguate from instance (row) 
"weights" given to
+ algorithms.
 
-* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's 
varargs support is removed
-  from `Params.setDefault` due to a
-  [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
-* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): 
`Evaluator.isLargerBetter` is
-  added to indicate metric ordering. Metrics like RMSE no longer flip signs as 
in 1.4.
+Changes of behavior:
+
+* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
+ `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed 
semantics in 1.6.
+ Previously, it was a threshold for absolute change in error. Now, it 
resembles the behavior of
+ `GradientDescent`'s `convergenceTol`: For large errors, it uses relative 
error (relative to the
+ previous error); for small errors (`< 0.01`), it uses absolute error.
+* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
+ `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to 
lowercase before
+ tokenizing. Now, it converts to lowercase by default, with an option not to. 
This matches the
+ behavior of the simpler `Tokenizer` transformer.
 
 ## Previous Spark versions
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dffa6100/docs/mllib-migration-guides.md

[spark] Git Push Summary

2015-12-16 Thread marmbrus

Repository: spark
Updated Tags:  refs/tags/v1.6.0-rc3 [deleted] 00a39d9c0

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12361][PYSPARK][TESTS] Should set PYSPARK_DRIVER_PYTHON before Python tests

2015-12-16 Thread joshrosen

Repository: spark
Updated Branches:
  refs/heads/master d252b2d54 -> 6a880afa8


[SPARK-12361][PYSPARK][TESTS] Should set PYSPARK_DRIVER_PYTHON before Python 
tests

Although this patch still doesn't solve the issue why the return code is 0 (see 
JIRA description), it resolves the issue of python version mismatch.

Author: Jeff Zhang 

Closes #10322 from zjffdu/SPARK-12361.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a880afa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a880afa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a880afa

Branch: refs/heads/master
Commit: 6a880afa831348b413ba95b98ff089377b950666
Parents: d252b2d
Author: Jeff Zhang 
Authored: Wed Dec 16 11:29:47 2015 -0800
Committer: Josh Rosen 
Committed: Wed Dec 16 11:29:51 2015 -0800

--
 python/run-tests.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6a880afa/python/run-tests.py
--
diff --git a/python/run-tests.py b/python/run-tests.py
index f5857f8..ee73eb1 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -56,7 +56,8 @@ LOGGER = logging.getLogger()
 
 def run_individual_python_test(test_name, pyspark_python):
 env = dict(os.environ)
-env.update({'SPARK_TESTING': '1', 'PYSPARK_PYTHON': which(pyspark_python)})
+env.update({'SPARK_TESTING': '1', 'PYSPARK_PYTHON': which(pyspark_python),
+'PYSPARK_DRIVER_PYTHON': which(pyspark_python)})
 LOGGER.debug("Starting test(%s): %s", pyspark_python, test_name)
 start_time = time.time()
 try:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12320][SQL] throw exception if the number of fields does not line up for Tuple encoder

2015-12-16 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/master 1a8b2a17d -> a783a8ed4


[SPARK-12320][SQL] throw exception if the number of fields does not line up for 
Tuple encoder

Author: Wenchen Fan 

Closes #10293 from cloud-fan/err-msg.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a783a8ed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a783a8ed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a783a8ed

Branch: refs/heads/master
Commit: a783a8ed49814a09fde653433a3d6de398ddf888
Parents: 1a8b2a1
Author: Wenchen Fan 
Authored: Wed Dec 16 13:18:56 2015 -0800
Committer: Michael Armbrust 
Committed: Wed Dec 16 13:20:12 2015 -0800

--
 .../apache/spark/sql/catalyst/dsl/package.scala |  3 +-
 .../catalyst/encoders/ExpressionEncoder.scala   | 36 +++-
 .../expressions/complexTypeExtractors.scala | 10 ++--
 .../encoders/EncoderResolutionSuite.scala   | 60 +---
 .../catalyst/expressions/ComplexTypeSuite.scala |  2 +-
 5 files changed, 93 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a783a8ed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
index e509711..8102c93 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
@@ -227,9 +227,10 @@ package object dsl {
 AttributeReference(s, mapType, nullable = true)()
 
   /** Creates a new AttributeReference of type struct */
-  def struct(fields: StructField*): AttributeReference = 
struct(StructType(fields))
   def struct(structType: StructType): AttributeReference =
 AttributeReference(s, structType, nullable = true)()
+  def struct(attrs: AttributeReference*): AttributeReference =
+struct(StructType.fromAttributes(attrs))
 }
 
 implicit class DslAttribute(a: AttributeReference) {

http://git-wip-us.apache.org/repos/asf/spark/blob/a783a8ed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
index 363178b..7a4401c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
@@ -244,9 +244,41 @@ case class ExpressionEncoder[T](
   def resolve(
   schema: Seq[Attribute],
   outerScopes: ConcurrentMap[String, AnyRef]): ExpressionEncoder[T] = {
-val positionToAttribute = AttributeMap.toIndex(schema)
+def fail(st: StructType, maxOrdinal: Int): Unit = {
+  throw new AnalysisException(s"Try to map ${st.simpleString} to 
Tuple${maxOrdinal + 1}, " +
+"but failed as the number of fields does not line up.\n" +
+" - Input schema: " + StructType.fromAttributes(schema).simpleString + 
"\n" +
+" - Target schema: " + this.schema.simpleString)
+}
+
+var maxOrdinal = -1
+fromRowExpression.foreach {
+  case b: BoundReference => if (b.ordinal > maxOrdinal) maxOrdinal = 
b.ordinal
+  case _ =>
+}
+if (maxOrdinal >= 0 && maxOrdinal != schema.length - 1) {
+  fail(StructType.fromAttributes(schema), maxOrdinal)
+}
+
 val unbound = fromRowExpression transform {
-  case b: BoundReference => positionToAttribute(b.ordinal)
+  case b: BoundReference => schema(b.ordinal)
+}
+
+val exprToMaxOrdinal = scala.collection.mutable.HashMap.empty[Expression, 
Int]
+unbound.foreach {
+  case g: GetStructField =>
+val maxOrdinal = exprToMaxOrdinal.getOrElse(g.child, -1)
+if (maxOrdinal < g.ordinal) {
+  exprToMaxOrdinal.update(g.child, g.ordinal)
+}
+  case _ =>
+}
+exprToMaxOrdinal.foreach {
+  case (expr, maxOrdinal) =>
+val schema = expr.dataType.asInstanceOf[StructType]
+if (maxOrdinal != schema.length - 1) {
+  fail(schema, maxOrdinal)
+}
 }
 
 val plan = Project(Alias(unbound, "")() :: Nil, LocalRelation(schema))

http://git-wip-us.apache.org/repos/asf/spark/blob/a783a8ed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala

spark git commit: [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 22f6cd86f -> 26d70bd2b


[SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml

cc jkbradley

Author: Yu ISHIKAWA 

Closes #10244 from yu-iskw/SPARK-12215.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/26d70bd2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/26d70bd2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/26d70bd2

Branch: refs/heads/master
Commit: 26d70bd2b42617ff731b6e9e6d77933b38597ebe
Parents: 22f6cd8
Author: Yu ISHIKAWA 
Authored: Wed Dec 16 10:43:45 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 10:43:45 2015 -0800

--
 docs/ml-clustering.md   | 71 
 .../spark/examples/ml/JavaKMeansExample.java|  8 ++-
 .../spark/examples/ml/KMeansExample.scala   | 49 +++---
 3 files changed, 100 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/26d70bd2/docs/ml-clustering.md
--
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md
index a59f7e3..440c455 100644
--- a/docs/ml-clustering.md
+++ b/docs/ml-clustering.md
@@ -11,6 +11,77 @@ In this section, we introduce the pipeline API for 
[clustering in mllib](mllib-c
 * This will become a table of contents (this text will be scraped).
 {:toc}
 
+## K-means
+
+[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
+most commonly used clustering algorithms that clusters the data points into a
+predefined number of clusters. The MLlib implementation includes a parallelized
+variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
+called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
+
+`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the 
base model.
+
+### Input Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+
+  
+  
+
+  featuresCol
+  Vector
+  "features"
+  Feature vector
+
+  
+
+
+### Output Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+
+  
+  
+
+  predictionCol
+  Int
+  "prediction"
+  Predicted cluster center
+
+  
+
+
+### Example
+
+
+
+
+Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more 
details.
+
+{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
+
+
+
+Refer to the [Java API 
docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.
+
+{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
+
+
+
+
+
 ## Latent Dirichlet allocation (LDA)
 
 `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and 
`OnlineLDAOptimizer`,

http://git-wip-us.apache.org/repos/asf/spark/blob/26d70bd2/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java
index 47665ff..96481d8 100644
--- a/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java
+++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java
@@ -23,6 +23,9 @@ import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.api.java.function.Function;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.catalyst.expressions.GenericRow;
+// $example on$
 import org.apache.spark.ml.clustering.KMeansModel;
 import org.apache.spark.ml.clustering.KMeans;
 import org.apache.spark.mllib.linalg.Vector;
@@ -30,11 +33,10 @@ import org.apache.spark.mllib.linalg.VectorUDT;
 import org.apache.spark.mllib.linalg.Vectors;
 import org.apache.spark.sql.DataFrame;
 import org.apache.spark.sql.Row;
-import org.apache.spark.sql.SQLContext;
-import org.apache.spark.sql.catalyst.expressions.GenericRow;
 import org.apache.spark.sql.types.Metadata;
 import org.apache.spark.sql.types.StructField;
 import org.apache.spark.sql.types.StructType;
+// $example off$
 
 
 /**
@@ -74,6 +76,7 @@ public class JavaKMeansExample {
 JavaSparkContext jsc = new JavaSparkContext(conf);
 SQLContext sqlContext = new SQLContext(jsc);
 
+// $example on$
 // Loads data
 JavaRDD points = jsc.textFile(inputFile).map(new ParsePoint());
 StructField[] fields = {new StructField("features", new VectorUDT(), 
false,

spark git commit: [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 ac0e2ea7c -> 16edd933d


[SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml

cc jkbradley

Author: Yu ISHIKAWA 

Closes #10244 from yu-iskw/SPARK-12215.

(cherry picked from commit 26d70bd2b42617ff731b6e9e6d77933b38597ebe)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16edd933
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16edd933
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16edd933

Branch: refs/heads/branch-1.6
Commit: 16edd933d7323f8b6861409bbd62bc1efe244c14
Parents: ac0e2ea
Author: Yu ISHIKAWA 
Authored: Wed Dec 16 10:43:45 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 10:43:55 2015 -0800

--
 docs/ml-clustering.md   | 71 
 .../spark/examples/ml/JavaKMeansExample.java|  8 ++-
 .../spark/examples/ml/KMeansExample.scala   | 49 +++---
 3 files changed, 100 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/16edd933/docs/ml-clustering.md
--
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md
index a59f7e3..440c455 100644
--- a/docs/ml-clustering.md
+++ b/docs/ml-clustering.md
@@ -11,6 +11,77 @@ In this section, we introduce the pipeline API for 
[clustering in mllib](mllib-c
 * This will become a table of contents (this text will be scraped).
 {:toc}
 
+## K-means
+
+[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
+most commonly used clustering algorithms that clusters the data points into a
+predefined number of clusters. The MLlib implementation includes a parallelized
+variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
+called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
+
+`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the 
base model.
+
+### Input Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+
+  
+  
+
+  featuresCol
+  Vector
+  "features"
+  Feature vector
+
+  
+
+
+### Output Columns
+
+
+  
+
+  Param name
+  Type(s)
+  Default
+  Description
+
+  
+  
+
+  predictionCol
+  Int
+  "prediction"
+  Predicted cluster center
+
+  
+
+
+### Example
+
+
+
+
+Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more 
details.
+
+{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
+
+
+
+Refer to the [Java API 
docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.
+
+{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
+
+
+
+
+
 ## Latent Dirichlet allocation (LDA)
 
 `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and 
`OnlineLDAOptimizer`,

http://git-wip-us.apache.org/repos/asf/spark/blob/16edd933/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java
index 47665ff..96481d8 100644
--- a/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java
+++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java
@@ -23,6 +23,9 @@ import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.api.java.function.Function;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.catalyst.expressions.GenericRow;
+// $example on$
 import org.apache.spark.ml.clustering.KMeansModel;
 import org.apache.spark.ml.clustering.KMeans;
 import org.apache.spark.mllib.linalg.Vector;
@@ -30,11 +33,10 @@ import org.apache.spark.mllib.linalg.VectorUDT;
 import org.apache.spark.mllib.linalg.Vectors;
 import org.apache.spark.sql.DataFrame;
 import org.apache.spark.sql.Row;
-import org.apache.spark.sql.SQLContext;
-import org.apache.spark.sql.catalyst.expressions.GenericRow;
 import org.apache.spark.sql.types.Metadata;
 import org.apache.spark.sql.types.StructField;
 import org.apache.spark.sql.types.StructType;
+// $example off$
 
 
 /**
@@ -74,6 +76,7 @@ public class JavaKMeansExample {
 JavaSparkContext jsc = new JavaSparkContext(conf);
 SQLContext sqlContext = new SQLContext(jsc);
 
+// $example on$
 // Loads data
 JavaRDD points =

spark git commit: [SPARK-9694][ML] Add random seed Param to Scala CrossValidator

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 7b6dc29d0 -> 860dc7f2f


[SPARK-9694][ML] Add random seed Param to Scala CrossValidator

Add random seed Param to Scala CrossValidator

Author: Yanbo Liang 

Closes #9108 from yanboliang/spark-9694.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/860dc7f2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/860dc7f2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/860dc7f2

Branch: refs/heads/master
Commit: 860dc7f2f8dd01f2562ba83b7af27ba29d91cb62
Parents: 7b6dc29
Author: Yanbo Liang 
Authored: Wed Dec 16 11:05:37 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 11:05:37 2015 -0800

--
 .../org/apache/spark/ml/tuning/CrossValidator.scala  | 11 ---
 .../main/scala/org/apache/spark/mllib/util/MLUtils.scala |  8 
 2 files changed, 16 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/860dc7f2/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala 
b/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
index 5c09f1a..40f8857 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
@@ -29,8 +29,9 @@ import org.apache.spark.ml.classification.OneVsRestParams
 import org.apache.spark.ml.evaluation.Evaluator
 import org.apache.spark.ml.feature.RFormulaModel
 import org.apache.spark.ml.param._
-import org.apache.spark.ml.util.DefaultParamsReader.Metadata
+import org.apache.spark.ml.param.shared.HasSeed
 import org.apache.spark.ml.util._
+import org.apache.spark.ml.util.DefaultParamsReader.Metadata
 import org.apache.spark.mllib.util.MLUtils
 import org.apache.spark.sql.DataFrame
 import org.apache.spark.sql.types.StructType
@@ -39,7 +40,7 @@ import org.apache.spark.sql.types.StructType
 /**
  * Params for [[CrossValidator]] and [[CrossValidatorModel]].
  */
-private[ml] trait CrossValidatorParams extends ValidatorParams {
+private[ml] trait CrossValidatorParams extends ValidatorParams with HasSeed {
   /**
* Param for number of folds for cross validation.  Must be >= 2.
* Default: 3
@@ -85,6 +86,10 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
   @Since("1.2.0")
   def setNumFolds(value: Int): this.type = set(numFolds, value)
 
+  /** @group setParam */
+  @Since("2.0.0")
+  def setSeed(value: Long): this.type = set(seed, value)
+
   @Since("1.4.0")
   override def fit(dataset: DataFrame): CrossValidatorModel = {
 val schema = dataset.schema
@@ -95,7 +100,7 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
 val epm = $(estimatorParamMaps)
 val numModels = epm.length
 val metrics = new Array[Double](epm.length)
-val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
+val splits = MLUtils.kFold(dataset.rdd, $(numFolds), $(seed))
 splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
   val validationDataset = sqlCtx.createDataFrame(validation, 
schema).cache()

http://git-wip-us.apache.org/repos/asf/spark/blob/860dc7f2/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
index 414ea99..4c9151f 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
@@ -265,6 +265,14 @@ object MLUtils {
*/
   @Since("1.0.0")
   def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): 
Array[(RDD[T], RDD[T])] = {
+kFold(rdd, numFolds, seed.toLong)
+  }
+
+  /**
+   * Version of [[kFold()]] taking a Long seed.
+   */
+  @Since("2.0.0")
+  def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Long): 
Array[(RDD[T], RDD[T])] = {
 val numFoldsF = numFolds.toFloat
 (1 to numFolds).map { fold =>
   val sampler = new BernoulliCellSampler[T]((fold - 1) / numFoldsF, fold / 
numFoldsF,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 fb08f7b78 -> a2d584ed9


[SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation

This fixes the sidebar, using a pure CSS mechanism to hide it when the 
browser's viewport is too narrow.
Credit goes to the original author Titan-C (mentioned in the NOTICE).

Note that I am not a CSS expert, so I can only address comments up to some 
extent.

Default view:
https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png;>

When collapsed manually by the user:
https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png;>

Disappears when column is too narrow:
https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png;>

Can still be opened by the user if necessary:
https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png;>

Author: Timothy Hunter 

Closes #10297 from thunterdb/12324.

(cherry picked from commit a6325fc401f68d9fa30cc947c44acc9d64ebda7b)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a2d584ed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a2d584ed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a2d584ed

Branch: refs/heads/branch-1.6
Commit: a2d584ed9ab3c073df057bed5314bdf877a47616
Parents: fb08f7b
Author: Timothy Hunter 
Authored: Wed Dec 16 10:12:33 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 10:12:47 2015 -0800

--
 NOTICE|   9 ++-
 docs/_layouts/global.html |  35 +++
 docs/css/main.css | 137 ++---
 docs/js/main.js   |   2 +-
 4 files changed, 149 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a2d584ed/NOTICE
--
diff --git a/NOTICE b/NOTICE
index 7f7769f..571f8c2 100644
--- a/NOTICE
+++ b/NOTICE
@@ -606,4 +606,11 @@ Vis.js uses and redistributes the following third-party 
libraries:
 
 - keycharm
   https://github.com/AlexDM0/keycharm
-  The MIT License
\ No newline at end of file
+  The MIT License
+
+===
+
+The CSS style for the navigation sidebar of the documentation was originally 
+submitted by Ãscar NÃ¡jera for the scikit-learn project. The scikit-learn 
project
+is distributed under the 3-Clause BSD license.
+===

http://git-wip-us.apache.org/repos/asf/spark/blob/a2d584ed/docs/_layouts/global.html
--
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 0b5b0cd..3089474 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -1,3 +1,4 @@
+
 
 
 
@@ -127,20 +128,32 @@
 
 
 {% if page.url contains "/ml" %}
-  {% include nav-left-wrapper-ml.html 
nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
-{% endif %}
-
+{% include nav-left-wrapper-ml.html 
nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
+
+
+
+{% if page.displayTitle %}
+{{ page.displayTitle }}
+{% else %}
+{{ page.title }}
+{% endif %}
+
+{{ content }}
 
-
-  {% if page.displayTitle %}
-{{ page.displayTitle }}
-  {% else %}
-{{ page.title }}
-  {% endif %}
+
+{% else %}
+
+{% if page.displayTitle %}
+{{ page.displayTitle }}
+{% else %}
+{{ page.title }}
+{% endif %}
 
-  {{ content }}
+{{ content }}
 
- 
+
+{% endif %}
+ 
 
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/a2d584ed/docs/css/main.css
--
diff --git a/docs/css/main.css b/docs/css/main.css
index 356b324..175e800 100755
--- a/docs/css/main.css
+++ b/docs/css/main.css
@@ -40,17 +40,14 @@
 }
 
 body .container-wrapper {
-  position: absolute;
-  width: 100%;
-  display: flex;
-}
-
-body #content {
+  background-color: #FFF;
+  color: #1D1F22;
+  max-width: 1024px;
+  margin-top:

spark git commit: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 1a3d0cd9f -> a6325fc40


[SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation

This fixes the sidebar, using a pure CSS mechanism to hide it when the 
browser's viewport is too narrow.
Credit goes to the original author Titan-C (mentioned in the NOTICE).

Note that I am not a CSS expert, so I can only address comments up to some 
extent.

Default view:
https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png;>

When collapsed manually by the user:
https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png;>

Disappears when column is too narrow:
https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png;>

Can still be opened by the user if necessary:
https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png;>

Author: Timothy Hunter 

Closes #10297 from thunterdb/12324.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6325fc4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6325fc4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6325fc4

Branch: refs/heads/master
Commit: a6325fc401f68d9fa30cc947c44acc9d64ebda7b
Parents: 1a3d0cd
Author: Timothy Hunter 
Authored: Wed Dec 16 10:12:33 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 10:12:33 2015 -0800

--
 NOTICE|   9 ++-
 docs/_layouts/global.html |  35 +++
 docs/css/main.css | 137 ++---
 docs/js/main.js   |   2 +-
 4 files changed, 149 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a6325fc4/NOTICE
--
diff --git a/NOTICE b/NOTICE
index 7f7769f..571f8c2 100644
--- a/NOTICE
+++ b/NOTICE
@@ -606,4 +606,11 @@ Vis.js uses and redistributes the following third-party 
libraries:
 
 - keycharm
   https://github.com/AlexDM0/keycharm
-  The MIT License
\ No newline at end of file
+  The MIT License
+
+===
+
+The CSS style for the navigation sidebar of the documentation was originally 
+submitted by Ãscar NÃ¡jera for the scikit-learn project. The scikit-learn 
project
+is distributed under the 3-Clause BSD license.
+===

http://git-wip-us.apache.org/repos/asf/spark/blob/a6325fc4/docs/_layouts/global.html
--
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 0b5b0cd..3089474 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -1,3 +1,4 @@
+
 
 
 
@@ -127,20 +128,32 @@
 
 
 {% if page.url contains "/ml" %}
-  {% include nav-left-wrapper-ml.html 
nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
-{% endif %}
-
+{% include nav-left-wrapper-ml.html 
nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
+
+
+
+{% if page.displayTitle %}
+{{ page.displayTitle }}
+{% else %}
+{{ page.title }}
+{% endif %}
+
+{{ content }}
 
-
-  {% if page.displayTitle %}
-{{ page.displayTitle }}
-  {% else %}
-{{ page.title }}
-  {% endif %}
+
+{% else %}
+
+{% if page.displayTitle %}
+{{ page.displayTitle }}
+{% else %}
+{{ page.title }}
+{% endif %}
 
-  {{ content }}
+{{ content }}
 
- 
+
+{% endif %}
+ 
 
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/a6325fc4/docs/css/main.css
--
diff --git a/docs/css/main.css b/docs/css/main.css
index 356b324..175e800 100755
--- a/docs/css/main.css
+++ b/docs/css/main.css
@@ -40,17 +40,14 @@
 }
 
 body .container-wrapper {
-  position: absolute;
-  width: 100%;
-  display: flex;
-}
-
-body #content {
+  background-color: #FFF;
+  color: #1D1F22;
+  max-width: 1024px;
+  margin-top: 10px;
+  margin-left: auto;
+  margin-right: auto;
+  border-radius: 15px;
   position: relative;
-
-  line-height: 1.6; /* Inspired by

spark git commit: [SPARK-12318][SPARKR] Save mode in SparkR should be error by default

2015-12-16 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 54c512ba9 -> 2eb5af5f0


[SPARK-12318][SPARKR] Save mode in SparkR should be error by default

shivaram  Please help review.

Author: Jeff Zhang 

Closes #10290 from zjffdu/SPARK-12318.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2eb5af5f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2eb5af5f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2eb5af5f

Branch: refs/heads/master
Commit: 2eb5af5f0d3c424dc617bb1a18dd0210ea9ba0bc
Parents: 54c512b
Author: Jeff Zhang 
Authored: Wed Dec 16 10:32:32 2015 -0800
Committer: Shivaram Venkataraman 
Committed: Wed Dec 16 10:32:32 2015 -0800

--
 R/pkg/R/DataFrame.R | 10 +-
 docs/sparkr.md  |  9 -
 2 files changed, 13 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2eb5af5f/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 764597d..380a13f 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -1886,7 +1886,7 @@ setMethod("except",
 #' @param df A SparkSQL DataFrame
 #' @param path A name for the table
 #' @param source A name for external data source
-#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode
+#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode (it 
is 'error' by default)
 #'
 #' @family DataFrame functions
 #' @rdname write.df
@@ -1903,7 +1903,7 @@ setMethod("except",
 #' }
 setMethod("write.df",
   signature(df = "DataFrame", path = "character"),
-  function(df, path, source = NULL, mode = "append", ...){
+  function(df, path, source = NULL, mode = "error", ...){
 if (is.null(source)) {
   sqlContext <- get(".sparkRSQLsc", envir = .sparkREnv)
   source <- callJMethod(sqlContext, "getConf", 
"spark.sql.sources.default",
@@ -1928,7 +1928,7 @@ setMethod("write.df",
 #' @export
 setMethod("saveDF",
   signature(df = "DataFrame", path = "character"),
-  function(df, path, source = NULL, mode = "append", ...){
+  function(df, path, source = NULL, mode = "error", ...){
 write.df(df, path, source, mode, ...)
   })
 
@@ -1951,7 +1951,7 @@ setMethod("saveDF",
 #' @param df A SparkSQL DataFrame
 #' @param tableName A name for the table
 #' @param source A name for external data source
-#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode
+#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode (it 
is 'error' by default)
 #'
 #' @family DataFrame functions
 #' @rdname saveAsTable
@@ -1968,7 +1968,7 @@ setMethod("saveDF",
 setMethod("saveAsTable",
   signature(df = "DataFrame", tableName = "character", source = 
"character",
 mode = "character"),
-  function(df, tableName, source = NULL, mode="append", ...){
+  function(df, tableName, source = NULL, mode="error", ...){
 if (is.null(source)) {
   sqlContext <- get(".sparkRSQLsc", envir = .sparkREnv)
   source <- callJMethod(sqlContext, "getConf", 
"spark.sql.sources.default",

http://git-wip-us.apache.org/repos/asf/spark/blob/2eb5af5f/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 0114878..9ddd2ed 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -148,7 +148,7 @@ printSchema(people)
 
 
 The data sources API can also be used to save out DataFrames into multiple 
file formats. For example we can save the DataFrame from the previous example
-to a Parquet file using `write.df`
+to a Parquet file using `write.df` (Until Spark 1.6, the default mode for 
writes was `append`. It was changed in Spark 1.7 to `error` to match the Scala 
API)
 
 
 {% highlight r %}
@@ -387,3 +387,10 @@ The following functions are masked by the SparkR package:
 Since part of SparkR is modeled on the `dplyr` package, certain functions in 
SparkR share the same names with those in `dplyr`. Depending on the load order 
of the two packages, some functions from the package loaded first are masked by 
those in the package loaded after. In such case, prefix such calls with the 
package name, for instance, `SparkR::cume_dist(x)` or `dplyr::cume_dist(x)`.
   
 You can inspect the search path in R with 
[`search()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/search.html)
+
+
+# Migration Guide
+
+## Upgrading From SparkR 1.6 to 1.7
+
+ - Until Spark 1.6, the default mode for writes was `append`. It was changed 
in Spark 1.7 to `error` to match the Scala API.

spark git commit: [SPARK-12318][SPARKR] Save mode in SparkR should be error by default

2015-12-16 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 16edd933d -> f81512729


[SPARK-12318][SPARKR] Save mode in SparkR should be error by default

shivaram  Please help review.

Author: Jeff Zhang 

Closes #10290 from zjffdu/SPARK-12318.

(cherry picked from commit 2eb5af5f0d3c424dc617bb1a18dd0210ea9ba0bc)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f8151272
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f8151272
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f8151272

Branch: refs/heads/branch-1.6
Commit: f815127294c06320204d9affa4f35da7ec3a710d
Parents: 16edd93
Author: Jeff Zhang 
Authored: Wed Dec 16 10:32:32 2015 -0800
Committer: Shivaram Venkataraman 
Committed: Wed Dec 16 10:48:54 2015 -0800

--
 R/pkg/R/DataFrame.R | 10 +-
 docs/sparkr.md  |  9 -
 2 files changed, 13 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f8151272/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 7292433..0cfa12b9 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -1925,7 +1925,7 @@ setMethod("except",
 #' @param df A SparkSQL DataFrame
 #' @param path A name for the table
 #' @param source A name for external data source
-#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode
+#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode (it 
is 'error' by default)
 #'
 #' @family DataFrame functions
 #' @rdname write.df
@@ -1942,7 +1942,7 @@ setMethod("except",
 #' }
 setMethod("write.df",
   signature(df = "DataFrame", path = "character"),
-  function(df, path, source = NULL, mode = "append", ...){
+  function(df, path, source = NULL, mode = "error", ...){
 if (is.null(source)) {
   sqlContext <- get(".sparkRSQLsc", envir = .sparkREnv)
   source <- callJMethod(sqlContext, "getConf", 
"spark.sql.sources.default",
@@ -1967,7 +1967,7 @@ setMethod("write.df",
 #' @export
 setMethod("saveDF",
   signature(df = "DataFrame", path = "character"),
-  function(df, path, source = NULL, mode = "append", ...){
+  function(df, path, source = NULL, mode = "error", ...){
 write.df(df, path, source, mode, ...)
   })
 
@@ -1990,7 +1990,7 @@ setMethod("saveDF",
 #' @param df A SparkSQL DataFrame
 #' @param tableName A name for the table
 #' @param source A name for external data source
-#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode
+#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode (it 
is 'error' by default)
 #'
 #' @family DataFrame functions
 #' @rdname saveAsTable
@@ -2007,7 +2007,7 @@ setMethod("saveDF",
 setMethod("saveAsTable",
   signature(df = "DataFrame", tableName = "character", source = 
"character",
 mode = "character"),
-  function(df, tableName, source = NULL, mode="append", ...){
+  function(df, tableName, source = NULL, mode="error", ...){
 if (is.null(source)) {
   sqlContext <- get(".sparkRSQLsc", envir = .sparkREnv)
   source <- callJMethod(sqlContext, "getConf", 
"spark.sql.sources.default",

http://git-wip-us.apache.org/repos/asf/spark/blob/f8151272/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 0114878..9ddd2ed 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -148,7 +148,7 @@ printSchema(people)
 
 
 The data sources API can also be used to save out DataFrames into multiple 
file formats. For example we can save the DataFrame from the previous example
-to a Parquet file using `write.df`
+to a Parquet file using `write.df` (Until Spark 1.6, the default mode for 
writes was `append`. It was changed in Spark 1.7 to `error` to match the Scala 
API)
 
 
 {% highlight r %}
@@ -387,3 +387,10 @@ The following functions are masked by the SparkR package:
 Since part of SparkR is modeled on the `dplyr` package, certain functions in 
SparkR share the same names with those in `dplyr`. Depending on the load order 
of the two packages, some functions from the package loaded first are masked by 
those in the package loaded after. In such case, prefix such calls with the 
package name, for instance, `SparkR::cume_dist(x)` or `dplyr::cume_dist(x)`.
   
 You can inspect the search path in R with 
[`search()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/search.html)
+
+
+# Migration Guide
+
+## Upgrading From SparkR 1.6 to 1.7
+
+ - Until Spark

spark git commit: [SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode.

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master 26d70bd2b -> ad8c1f0b8


[SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos 
cluster mode.

SPARK_HOME is now causing problem with Mesos cluster mode since spark-submit 
script has been changed recently to take precendence when running spark-class 
scripts to look in SPARK_HOME if it's defined.

We should skip passing SPARK_HOME from the Spark client in cluster mode with 
Mesos, since Mesos shouldn't use this configuration but should use 
spark.executor.home instead.

Author: Timothy Chen 

Closes #10332 from tnachen/scheduler_ui.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad8c1f0b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad8c1f0b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad8c1f0b

Branch: refs/heads/master
Commit: ad8c1f0b840284d05da737fb2cc5ebf8848f4490
Parents: 26d70bd
Author: Timothy Chen 
Authored: Wed Dec 16 10:54:15 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 10:54:15 2015 -0800

--
 .../org/apache/spark/deploy/rest/mesos/MesosRestServer.scala  | 7 ++-
 .../spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala   | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ad8c1f0b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala 
b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala
index 868cc35..24510db 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala
@@ -94,7 +94,12 @@ private[mesos] class MesosSubmitRequestServlet(
 val driverMemory = sparkProperties.get("spark.driver.memory")
 val driverCores = sparkProperties.get("spark.driver.cores")
 val appArgs = request.appArgs
-val environmentVariables = request.environmentVariables
+// We don't want to pass down SPARK_HOME when launching Spark apps
+// with Mesos cluster mode since it's populated by default on the client 
and it will
+// cause spark-submit script to look for files in SPARK_HOME instead.
+// We only need the ability to specify where to find spark-submit script
+// which user can user spark.executor.home or spark.home configurations.
+val environmentVariables = 
request.environmentVariables.filter(!_.equals("SPARK_HOME"))
 val name = 
request.sparkProperties.get("spark.app.name").getOrElse(mainClass)
 
 // Construct driver description

http://git-wip-us.apache.org/repos/asf/spark/blob/ad8c1f0b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
index 721861f..573355b 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
@@ -34,7 +34,7 @@ import org.apache.spark.util.Utils
 
 /**
  * Shared trait for implementing a Mesos Scheduler. This holds common state 
and helper
- * methods and Mesos scheduler will use.
+ * methods the Mesos scheduler will use.
  */
 private[mesos] trait MesosSchedulerUtils extends Logging {
   // Lock used to wait for scheduler to be registered


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 e5b85713d -> e1adf6d7d


[SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting 
k-means

This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.

Author: Yu ISHIKAWA 

Closes #9952 from yu-iskw/SPARK-6518.

(cherry picked from commit 7b6dc29d0ebbfb3bb941130f8542120b6bc3e234)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e1adf6d7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e1adf6d7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e1adf6d7

Branch: refs/heads/branch-1.6
Commit: e1adf6d7d1c755fb16a0030e66ce9cff348c3de8
Parents: e5b8571
Author: Yu ISHIKAWA 
Authored: Wed Dec 16 10:55:42 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 10:55:54 2015 -0800

--
 docs/mllib-clustering.md| 35 ++
 docs/mllib-guide.md |  1 +
 .../mllib/JavaBisectingKMeansExample.java   | 69 
 .../examples/mllib/BisectingKMeansExample.scala | 60 +
 4 files changed, 165 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e1adf6d7/docs/mllib-clustering.md
--
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 48d64cd..93cd0c1 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -718,6 +718,41 @@ sameModel = LDAModel.load(sc, "myModelPath")
 
 
 
+## Bisecting k-means
+
+Bisecting K-means can often be much faster than regular K-means, but it will 
generally produce a different clustering.
+
+Bisecting k-means is a kind of [hierarchical 
clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
+Hierarchical clustering is one of the most commonly used  method of cluster 
analysis which seeks to build a hierarchy of clusters.
+Strategies for hierarchical clustering generally fall into two types:
+
+- Agglomerative: This is a "bottom up" approach: each observation starts in 
its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
+- Divisive: This is a "top down" approach: all observations start in one 
cluster, and splits are performed recursively as one moves down the hierarchy.
+
+Bisecting k-means algorithm is a kind of divisive algorithms.
+The implementation in MLlib has the following parameters:
+
+* *k*: the desired number of leaf clusters (default: 4). The actual number 
could be smaller if there are no divisible leaf clusters.
+* *maxIterations*: the max number of k-means iterations to split clusters 
(default: 20)
+* *minDivisibleClusterSize*: the minimum number of points (if >= 1.0) or the 
minimum proportion of points (if < 1.0) of a divisible cluster (default: 1)
+* *seed*: a random seed (default: hash value of the class name)
+
+**Examples**
+
+
+
+Refer to the [`BisectingKMeans` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) 
and [`BisectingKMeansModel` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel)
 for details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %}
+
+
+
+Refer to the [`BisectingKMeans` Java 
docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html) and 
[`BisectingKMeansModel` Java 
docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for 
details on the API.
+
+{% include_example 
java/org/apache/spark/examples/mllib/JavaBisectingKMeansExample.java %}
+
+
+
 ## Streaming k-means
 
 When data arrive in a stream, we may want to estimate clusters dynamically,

http://git-wip-us.apache.org/repos/asf/spark/blob/e1adf6d7/docs/mllib-guide.md
--
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 7fef6b5..680ed48 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -49,6 +49,7 @@ We list major functionality from both below, with links to 
detailed guides.
   * [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
   * [power iteration clustering 
(PIC)](mllib-clustering.html#power-iteration-clustering-pic)
   * [latent Dirichlet allocation 
(LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
+  * [bisecting k-means](mllib-clustering.html#bisecting-kmeans)
   * [streaming k-means](mllib-clustering.html#streaming-k-means)
 * [Dimensionality reduction](mllib-dimensionality-reduction.html)
   * [singular value decomposition

spark git commit: [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master ad8c1f0b8 -> 7b6dc29d0


[SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting 
k-means

This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.

Author: Yu ISHIKAWA 

Closes #9952 from yu-iskw/SPARK-6518.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7b6dc29d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7b6dc29d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7b6dc29d

Branch: refs/heads/master
Commit: 7b6dc29d0ebbfb3bb941130f8542120b6bc3e234
Parents: ad8c1f0
Author: Yu ISHIKAWA 
Authored: Wed Dec 16 10:55:42 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 10:55:42 2015 -0800

--
 docs/mllib-clustering.md| 35 ++
 docs/mllib-guide.md |  1 +
 .../mllib/JavaBisectingKMeansExample.java   | 69 
 .../examples/mllib/BisectingKMeansExample.scala | 60 +
 4 files changed, 165 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7b6dc29d/docs/mllib-clustering.md
--
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 48d64cd..93cd0c1 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -718,6 +718,41 @@ sameModel = LDAModel.load(sc, "myModelPath")
 
 
 
+## Bisecting k-means
+
+Bisecting K-means can often be much faster than regular K-means, but it will 
generally produce a different clustering.
+
+Bisecting k-means is a kind of [hierarchical 
clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
+Hierarchical clustering is one of the most commonly used  method of cluster 
analysis which seeks to build a hierarchy of clusters.
+Strategies for hierarchical clustering generally fall into two types:
+
+- Agglomerative: This is a "bottom up" approach: each observation starts in 
its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
+- Divisive: This is a "top down" approach: all observations start in one 
cluster, and splits are performed recursively as one moves down the hierarchy.
+
+Bisecting k-means algorithm is a kind of divisive algorithms.
+The implementation in MLlib has the following parameters:
+
+* *k*: the desired number of leaf clusters (default: 4). The actual number 
could be smaller if there are no divisible leaf clusters.
+* *maxIterations*: the max number of k-means iterations to split clusters 
(default: 20)
+* *minDivisibleClusterSize*: the minimum number of points (if >= 1.0) or the 
minimum proportion of points (if < 1.0) of a divisible cluster (default: 1)
+* *seed*: a random seed (default: hash value of the class name)
+
+**Examples**
+
+
+
+Refer to the [`BisectingKMeans` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) 
and [`BisectingKMeansModel` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel)
 for details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %}
+
+
+
+Refer to the [`BisectingKMeans` Java 
docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html) and 
[`BisectingKMeansModel` Java 
docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for 
details on the API.
+
+{% include_example 
java/org/apache/spark/examples/mllib/JavaBisectingKMeansExample.java %}
+
+
+
 ## Streaming k-means
 
 When data arrive in a stream, we may want to estimate clusters dynamically,

http://git-wip-us.apache.org/repos/asf/spark/blob/7b6dc29d/docs/mllib-guide.md
--
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 7fef6b5..680ed48 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -49,6 +49,7 @@ We list major functionality from both below, with links to 
detailed guides.
   * [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
   * [power iteration clustering 
(PIC)](mllib-clustering.html#power-iteration-clustering-pic)
   * [latent Dirichlet allocation 
(LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
+  * [bisecting k-means](mllib-clustering.html#bisecting-kmeans)
   * [streaming k-means](mllib-clustering.html#streaming-k-means)
 * [Dimensionality reduction](mllib-dimensionality-reduction.html)
   * [singular value decomposition 
(SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)

spark git commit: [SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode.

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 f81512729 -> e5b85713d


[SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos 
cluster mode.

SPARK_HOME is now causing problem with Mesos cluster mode since spark-submit 
script has been changed recently to take precendence when running spark-class 
scripts to look in SPARK_HOME if it's defined.

We should skip passing SPARK_HOME from the Spark client in cluster mode with 
Mesos, since Mesos shouldn't use this configuration but should use 
spark.executor.home instead.

Author: Timothy Chen 

Closes #10332 from tnachen/scheduler_ui.

(cherry picked from commit ad8c1f0b840284d05da737fb2cc5ebf8848f4490)
Signed-off-by: Andrew Or 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e5b85713
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e5b85713
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e5b85713

Branch: refs/heads/branch-1.6
Commit: e5b85713d8a0dbbb1a0a07481f5afa6c5098147f
Parents: f815127
Author: Timothy Chen 
Authored: Wed Dec 16 10:54:15 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 10:55:25 2015 -0800

--
 .../org/apache/spark/deploy/rest/mesos/MesosRestServer.scala  | 7 ++-
 .../spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala   | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e5b85713/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala 
b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala
index 868cc35..7c01ae4 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala
@@ -94,7 +94,12 @@ private[mesos] class MesosSubmitRequestServlet(
 val driverMemory = sparkProperties.get("spark.driver.memory")
 val driverCores = sparkProperties.get("spark.driver.cores")
 val appArgs = request.appArgs
-val environmentVariables = request.environmentVariables
+// We don't want to pass down SPARK_HOME when launching Spark apps with 
Mesos cluster mode
+// since it's populated by default on the client and it will cause 
spark-submit script to
+// look for files in SPARK_HOME instead. We only need the ability to 
specify where to find
+// spark-submit script which user can user spark.executor.home or 
spark.home configurations
+// (SPARK-12345).
+val environmentVariables = 
request.environmentVariables.filter(!_.equals("SPARK_HOME"))
 val name = 
request.sparkProperties.get("spark.app.name").getOrElse(mainClass)
 
 // Construct driver description

http://git-wip-us.apache.org/repos/asf/spark/blob/e5b85713/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
index 721861f..573355b 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
@@ -34,7 +34,7 @@ import org.apache.spark.util.Utils
 
 /**
  * Shared trait for implementing a Mesos Scheduler. This holds common state 
and helper
- * methods and Mesos scheduler will use.
+ * methods the Mesos scheduler will use.
  */
 private[mesos] trait MesosSchedulerUtils extends Logging {
   // Lock used to wait for scheduler to be registered


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12309][ML] Use sqlContext from MLlibTestSparkContext for spark.ml test suites

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 860dc7f2f -> d252b2d54


[SPARK-12309][ML] Use sqlContext from MLlibTestSparkContext for spark.ml test 
suites

Use ```sqlContext``` from ```MLlibTestSparkContext``` rather than creating new 
one for spark.ml test suites. I have checked thoroughly and found there are 
four test cases need to update.

cc mengxr jkbradley

Author: Yanbo Liang 

Closes #10279 from yanboliang/spark-12309.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d252b2d5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d252b2d5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d252b2d5

Branch: refs/heads/master
Commit: d252b2d544a75f6c5523be3492494955050acf50
Parents: 860dc7f
Author: Yanbo Liang 
Authored: Wed Dec 16 11:07:54 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 11:07:54 2015 -0800

--
 .../scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala| 4 +---
 .../test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala | 3 +--
 .../scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala| 4 +---
 mllib/src/test/scala/org/apache/spark/ml/impl/TreeTests.scala| 2 +-
 .../scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala   | 3 +--
 5 files changed, 5 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d252b2d5/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala
index 09183fe..035bfc0 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala
@@ -21,13 +21,11 @@ import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
 import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import org.apache.spark.mllib.util.MLlibTestSparkContext
-import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.Row
 
 class MinMaxScalerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
 
   test("MinMaxScaler fit basic case") {
-val sqlContext = new SQLContext(sc)
-
 val data = Array(
   Vectors.dense(1, 0, Long.MinValue),
   Vectors.dense(2, 0, 0),

http://git-wip-us.apache.org/repos/asf/spark/blob/d252b2d5/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala
index de3d438..4688339 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala
@@ -22,7 +22,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest
 import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.mllib.util.TestingUtils._
-import org.apache.spark.sql.{DataFrame, Row, SQLContext}
+import org.apache.spark.sql.{DataFrame, Row}
 
 
 class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
@@ -61,7 +61,6 @@ class NormalizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   Vectors.sparse(3, Seq())
 )
 
-val sqlContext = new SQLContext(sc)
 dataFrame = sqlContext.createDataFrame(sc.parallelize(data, 
2).map(NormalizerSuite.FeatureData))
 normalizer = new Normalizer()
   .setInputCol("features")

http://git-wip-us.apache.org/repos/asf/spark/blob/d252b2d5/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala
index 74706a2..8acc336 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala
@@ -24,7 +24,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest
 import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.types.StructType
-import org.apache.spark.sql.{DataFrame, Row, SQLContext}
+import org.apache.spark.sql.{DataFrame, Row}
 
 class

spark git commit: [SPARK-8745] [SQL] remove GenerateProjection

2015-12-16 Thread davies

Repository: spark
Updated Branches:
  refs/heads/master a6325fc40 -> 54c512ba9


[SPARK-8745] [SQL] remove GenerateProjection

cc rxin

Author: Davies Liu 

Closes #10316 from davies/remove_generate_projection.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/54c512ba
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/54c512ba
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/54c512ba

Branch: refs/heads/master
Commit: 54c512ba906edfc25b8081ad67498e99d884452b
Parents: a6325fc
Author: Davies Liu 
Authored: Wed Dec 16 10:22:48 2015 -0800
Committer: Davies Liu 
Committed: Wed Dec 16 10:22:48 2015 -0800

--
 .../codegen/GenerateProjection.scala| 238 ---
 .../codegen/GenerateSafeProjection.scala|   4 +
 .../expressions/CodeGenerationSuite.scala   |   5 +-
 .../expressions/ExpressionEvalHelper.scala  |  39 +--
 .../expressions/MathFunctionsSuite.scala|   4 +-
 .../codegen/CodegenExpressionCachingSuite.scala |  18 --
 .../spark/sql/execution/local/ExpandNode.scala  |   4 +-
 .../spark/sql/execution/local/LocalNode.scala   |  18 --
 8 files changed, 11 insertions(+), 319 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/54c512ba/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala
deleted file mode 100644
index f229f20..000
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala
+++ /dev/null
@@ -1,238 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.catalyst.expressions.codegen
-
-import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.types._
-
-/**
- * Java can not access Projection (in package object)
- */
-abstract class BaseProjection extends Projection {}
-
-abstract class CodeGenMutableRow extends MutableRow with BaseGenericInternalRow
-
-/**
- * Generates bytecode that produces a new [[InternalRow]] object based on a 
fixed set of input
- * [[Expression Expressions]] and a given input [[InternalRow]].  The returned 
[[InternalRow]]
- * object is custom generated based on the output types of the [[Expression]] 
to avoid boxing of
- * primitive values.
- */
-object GenerateProjection extends CodeGenerator[Seq[Expression], Projection] {
-
-  protected def canonicalize(in: Seq[Expression]): Seq[Expression] =
-in.map(ExpressionCanonicalizer.execute)
-
-  protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): 
Seq[Expression] =
-in.map(BindReferences.bindReference(_, inputSchema))
-
-  // Make Mutablility optional...
-  protected def create(expressions: Seq[Expression]): Projection = {
-val ctx = newCodeGenContext()
-val columns = expressions.zipWithIndex.map {
-  case (e, i) =>
-s"private ${ctx.javaType(e.dataType)} c$i = 
${ctx.defaultValue(e.dataType)};\n"
-}.mkString("\n")
-
-val initColumns = expressions.zipWithIndex.map {
-  case (e, i) =>
-val eval = e.gen(ctx)
-s"""
-{
-  // column$i
-  ${eval.code}
-  nullBits[$i] = ${eval.isNull};
-  if (!${eval.isNull}) {
-c$i = ${eval.value};
-  }
-}
-"""
-}.mkString("\n")
-
-val getCases = (0 until expressions.size).map { i =>
-  s"case $i: return c$i;"
-}.mkString("\n")
-
-val updateCases = expressions.zipWithIndex.map { case (e, i) =>
-  s"case $i: { c$i = (${ctx.boxedType(e.dataType)})value; return;}"
-}.mkString("\n")
-
-val specificAccessorFunctions = ctx.primitiveTypes.map

spark git commit: [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib

2015-12-16 Thread davies

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 04e868b63 -> 552b38f87


[SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib

MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext.

Author: Davies Liu 

Closes #10338 from davies/create_context.

(cherry picked from commit 27b98e99d21a0cc34955337f82a71a18f9220ab2)
Signed-off-by: Davies Liu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/552b38f8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/552b38f8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/552b38f8

Branch: refs/heads/branch-1.6
Commit: 552b38f87fc0f6fab61b1e5405be58908b7f5544
Parents: 04e868b
Author: Davies Liu 
Authored: Wed Dec 16 15:48:11 2015 -0800
Committer: Davies Liu 
Committed: Wed Dec 16 15:48:21 2015 -0800

--
 python/pyspark/mllib/common.py |  6 +++---
 python/pyspark/mllib/evaluation.py | 10 +-
 python/pyspark/mllib/feature.py|  4 +---
 3 files changed, 9 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/552b38f8/python/pyspark/mllib/common.py
--
diff --git a/python/pyspark/mllib/common.py b/python/pyspark/mllib/common.py
index a439a48..9fda1b1 100644
--- a/python/pyspark/mllib/common.py
+++ b/python/pyspark/mllib/common.py
@@ -102,7 +102,7 @@ def _java2py(sc, r, encoding="bytes"):
 return RDD(jrdd, sc)
 
 if clsName == 'DataFrame':
-return DataFrame(r, SQLContext(sc))
+return DataFrame(r, SQLContext.getOrCreate(sc))
 
 if clsName in _picklable_classes:
 r = sc._jvm.SerDe.dumps(r)
@@ -125,7 +125,7 @@ def callJavaFunc(sc, func, *args):
 
 def callMLlibFunc(name, *args):
 """ Call API in PythonMLLibAPI """
-sc = SparkContext._active_spark_context
+sc = SparkContext.getOrCreate()
 api = getattr(sc._jvm.PythonMLLibAPI(), name)
 return callJavaFunc(sc, api, *args)
 
@@ -135,7 +135,7 @@ class JavaModelWrapper(object):
 Wrapper for the model in JVM
 """
 def __init__(self, java_model):
-self._sc = SparkContext._active_spark_context
+self._sc = SparkContext.getOrCreate()
 self._java_model = java_model
 
 def __del__(self):

http://git-wip-us.apache.org/repos/asf/spark/blob/552b38f8/python/pyspark/mllib/evaluation.py
--
diff --git a/python/pyspark/mllib/evaluation.py 
b/python/pyspark/mllib/evaluation.py
index 8c87ee9..22e68ea 100644
--- a/python/pyspark/mllib/evaluation.py
+++ b/python/pyspark/mllib/evaluation.py
@@ -44,7 +44,7 @@ class BinaryClassificationMetrics(JavaModelWrapper):
 
 def __init__(self, scoreAndLabels):
 sc = scoreAndLabels.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df = sql_ctx.createDataFrame(scoreAndLabels, schema=StructType([
 StructField("score", DoubleType(), nullable=False),
 StructField("label", DoubleType(), nullable=False)]))
@@ -103,7 +103,7 @@ class RegressionMetrics(JavaModelWrapper):
 
 def __init__(self, predictionAndObservations):
 sc = predictionAndObservations.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df = sql_ctx.createDataFrame(predictionAndObservations, 
schema=StructType([
 StructField("prediction", DoubleType(), nullable=False),
 StructField("observation", DoubleType(), nullable=False)]))
@@ -197,7 +197,7 @@ class MulticlassMetrics(JavaModelWrapper):
 
 def __init__(self, predictionAndLabels):
 sc = predictionAndLabels.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df = sql_ctx.createDataFrame(predictionAndLabels, schema=StructType([
 StructField("prediction", DoubleType(), nullable=False),
 StructField("label", DoubleType(), nullable=False)]))
@@ -338,7 +338,7 @@ class RankingMetrics(JavaModelWrapper):
 
 def __init__(self, predictionAndLabels):
 sc = predictionAndLabels.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df = sql_ctx.createDataFrame(predictionAndLabels,
  
schema=sql_ctx._inferSchema(predictionAndLabels))
 java_model = callMLlibFunc("newRankingMetrics", df._jdf)
@@ -424,7 +424,7 @@ class MultilabelMetrics(JavaModelWrapper):
 
 def __init__(self, predictionAndLabels):
 sc = predictionAndLabels.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df =

spark git commit: [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib

2015-12-16 Thread davies

Repository: spark
Updated Branches:
  refs/heads/master 3a44aebd0 -> 27b98e99d


[SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib

MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext.

Author: Davies Liu 

Closes #10338 from davies/create_context.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/27b98e99
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/27b98e99
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/27b98e99

Branch: refs/heads/master
Commit: 27b98e99d21a0cc34955337f82a71a18f9220ab2
Parents: 3a44aeb
Author: Davies Liu 
Authored: Wed Dec 16 15:48:11 2015 -0800
Committer: Davies Liu 
Committed: Wed Dec 16 15:48:11 2015 -0800

--
 python/pyspark/mllib/common.py |  6 +++---
 python/pyspark/mllib/evaluation.py | 10 +-
 python/pyspark/mllib/feature.py|  4 +---
 3 files changed, 9 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/27b98e99/python/pyspark/mllib/common.py
--
diff --git a/python/pyspark/mllib/common.py b/python/pyspark/mllib/common.py
index a439a48..9fda1b1 100644
--- a/python/pyspark/mllib/common.py
+++ b/python/pyspark/mllib/common.py
@@ -102,7 +102,7 @@ def _java2py(sc, r, encoding="bytes"):
 return RDD(jrdd, sc)
 
 if clsName == 'DataFrame':
-return DataFrame(r, SQLContext(sc))
+return DataFrame(r, SQLContext.getOrCreate(sc))
 
 if clsName in _picklable_classes:
 r = sc._jvm.SerDe.dumps(r)
@@ -125,7 +125,7 @@ def callJavaFunc(sc, func, *args):
 
 def callMLlibFunc(name, *args):
 """ Call API in PythonMLLibAPI """
-sc = SparkContext._active_spark_context
+sc = SparkContext.getOrCreate()
 api = getattr(sc._jvm.PythonMLLibAPI(), name)
 return callJavaFunc(sc, api, *args)
 
@@ -135,7 +135,7 @@ class JavaModelWrapper(object):
 Wrapper for the model in JVM
 """
 def __init__(self, java_model):
-self._sc = SparkContext._active_spark_context
+self._sc = SparkContext.getOrCreate()
 self._java_model = java_model
 
 def __del__(self):

http://git-wip-us.apache.org/repos/asf/spark/blob/27b98e99/python/pyspark/mllib/evaluation.py
--
diff --git a/python/pyspark/mllib/evaluation.py 
b/python/pyspark/mllib/evaluation.py
index 8c87ee9..22e68ea 100644
--- a/python/pyspark/mllib/evaluation.py
+++ b/python/pyspark/mllib/evaluation.py
@@ -44,7 +44,7 @@ class BinaryClassificationMetrics(JavaModelWrapper):
 
 def __init__(self, scoreAndLabels):
 sc = scoreAndLabels.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df = sql_ctx.createDataFrame(scoreAndLabels, schema=StructType([
 StructField("score", DoubleType(), nullable=False),
 StructField("label", DoubleType(), nullable=False)]))
@@ -103,7 +103,7 @@ class RegressionMetrics(JavaModelWrapper):
 
 def __init__(self, predictionAndObservations):
 sc = predictionAndObservations.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df = sql_ctx.createDataFrame(predictionAndObservations, 
schema=StructType([
 StructField("prediction", DoubleType(), nullable=False),
 StructField("observation", DoubleType(), nullable=False)]))
@@ -197,7 +197,7 @@ class MulticlassMetrics(JavaModelWrapper):
 
 def __init__(self, predictionAndLabels):
 sc = predictionAndLabels.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df = sql_ctx.createDataFrame(predictionAndLabels, schema=StructType([
 StructField("prediction", DoubleType(), nullable=False),
 StructField("label", DoubleType(), nullable=False)]))
@@ -338,7 +338,7 @@ class RankingMetrics(JavaModelWrapper):
 
 def __init__(self, predictionAndLabels):
 sc = predictionAndLabels.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df = sql_ctx.createDataFrame(predictionAndLabels,
  
schema=sql_ctx._inferSchema(predictionAndLabels))
 java_model = callMLlibFunc("newRankingMetrics", df._jdf)
@@ -424,7 +424,7 @@ class MultilabelMetrics(JavaModelWrapper):
 
 def __init__(self, predictionAndLabels):
 sc = predictionAndLabels.ctx
-sql_ctx = SQLContext(sc)
+sql_ctx = SQLContext.getOrCreate(sc)
 df = sql_ctx.createDataFrame(predictionAndLabels,
  
schema=sql_ctx._inferSchema(predictionAndLabels))
 java_class =

spark git commit: [SPARK-9690][ML][PYTHON] pyspark CrossValidator random seed

2015-12-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 9657ee878 -> 3a44aebd0


[SPARK-9690][ML][PYTHON] pyspark CrossValidator random seed

Extend CrossValidator with HasSeed in PySpark.

This PR replaces [https://github.com/apache/spark/pull/7997]

CC: yanboliang thunterdb mmenestret  Would one of you mind taking a look?  
Thanks!

Author: Joseph K. Bradley 
Author: Martin MENESTRET 

Closes #10268 from jkbradley/pyspark-cv-seed.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a44aebd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a44aebd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a44aebd

Branch: refs/heads/master
Commit: 3a44aebd0c5331f6ff00734fa44ef63f8d18cfbb
Parents: 9657ee8
Author: Martin Menestret 
Authored: Wed Dec 16 14:05:35 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 14:05:35 2015 -0800

--
 python/pyspark/ml/tuning.py | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3a44aebd/python/pyspark/ml/tuning.py
--
diff --git a/python/pyspark/ml/tuning.py b/python/pyspark/ml/tuning.py
index 705ee53..08f8db5 100644
--- a/python/pyspark/ml/tuning.py
+++ b/python/pyspark/ml/tuning.py
@@ -19,8 +19,9 @@ import itertools
 import numpy as np
 
 from pyspark import since
-from pyspark.ml.param import Params, Param
 from pyspark.ml import Estimator, Model
+from pyspark.ml.param import Params, Param
+from pyspark.ml.param.shared import HasSeed
 from pyspark.ml.util import keyword_only
 from pyspark.sql.functions import rand
 
@@ -89,7 +90,7 @@ class ParamGridBuilder(object):
 return [dict(zip(keys, prod)) for prod in 
itertools.product(*grid_values)]
 
 
-class CrossValidator(Estimator):
+class CrossValidator(Estimator, HasSeed):
 """
 K-fold cross validation.
 
@@ -129,9 +130,11 @@ class CrossValidator(Estimator):
 numFolds = Param(Params._dummy(), "numFolds", "number of folds for cross 
validation")
 
 @keyword_only
-def __init__(self, estimator=None, estimatorParamMaps=None, 
evaluator=None, numFolds=3):
+def __init__(self, estimator=None, estimatorParamMaps=None, 
evaluator=None, numFolds=3,
+ seed=None):
 """
-__init__(self, estimator=None, estimatorParamMaps=None, 
evaluator=None, numFolds=3)
+__init__(self, estimator=None, estimatorParamMaps=None, 
evaluator=None, numFolds=3,\
+ seed=None)
 """
 super(CrossValidator, self).__init__()
 #: param for estimator to be cross-validated
@@ -151,9 +154,11 @@ class CrossValidator(Estimator):
 
 @keyword_only
 @since("1.4.0")
-def setParams(self, estimator=None, estimatorParamMaps=None, 
evaluator=None, numFolds=3):
+def setParams(self, estimator=None, estimatorParamMaps=None, 
evaluator=None, numFolds=3,
+  seed=None):
 """
-setParams(self, estimator=None, estimatorParamMaps=None, 
evaluator=None, numFolds=3):
+setParams(self, estimator=None, estimatorParamMaps=None, 
evaluator=None, numFolds=3,\
+  seed=None):
 Sets params for cross validator.
 """
 kwargs = self.setParams._input_kwargs
@@ -225,9 +230,10 @@ class CrossValidator(Estimator):
 numModels = len(epm)
 eva = self.getOrDefault(self.evaluator)
 nFolds = self.getOrDefault(self.numFolds)
+seed = self.getOrDefault(self.seed)
 h = 1.0 / nFolds
 randCol = self.uid + "_rand"
-df = dataset.select("*", rand(0).alias(randCol))
+df = dataset.select("*", rand(seed).alias(randCol))
 metrics = np.zeros(numModels)
 for i in range(nFolds):
 validateLB = i * h


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11677][SQL] ORC filter tests all pass if filters are actually not pushed down.

2015-12-16 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/master edf65cd96 -> 9657ee878


[SPARK-11677][SQL] ORC filter tests all pass if filters are actually not pushed 
down.

Currently ORC filters are not tested properly. All the tests pass even if the 
filters are not pushed down or disabled. In this PR, I add some logics for this.
Since ORC does not filter record by record fully, this checks the count of the 
result and if it contains the expected values.

Author: hyukjinkwon 

Closes #9687 from HyukjinKwon/SPARK-11677.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9657ee87
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9657ee87
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9657ee87

Branch: refs/heads/master
Commit: 9657ee87888422c5596987fe760b49117a0ea4e2
Parents: edf65cd
Author: hyukjinkwon 
Authored: Wed Dec 16 13:24:49 2015 -0800
Committer: Michael Armbrust 
Committed: Wed Dec 16 13:24:49 2015 -0800

--
 .../spark/sql/hive/orc/OrcQuerySuite.scala  | 53 +---
 1 file changed, 36 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9657ee87/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala
index 7efeab5..2156806 100644
--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala
+++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala
@@ -350,28 +350,47 @@ class OrcQuerySuite extends QueryTest with 
BeforeAndAfterAll with OrcTest {
 withTempPath { dir =>
   withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
 import testImplicits._
-
 val path = dir.getCanonicalPath
-sqlContext.range(10).coalesce(1).write.orc(path)
+
+// For field "a", the first column has odds integers. This is to check 
the filtered count
+// when `isNull` is performed. For Field "b", `isNotNull` of ORC file 
filters rows
+// only when all the values are null (maybe this works differently 
when the data
+// or query is complicated). So, simply here a column only having 
`null` is added.
+val data = (0 until 10).map { i =>
+  val maybeInt = if (i % 2 == 0) None else Some(i)
+  val nullValue: Option[String] = None
+  (maybeInt, nullValue)
+}
+createDataFrame(data).toDF("a", "b").write.orc(path)
 val df = sqlContext.read.orc(path)
 
-def checkPredicate(pred: Column, answer: Seq[Long]): Unit = {
-  checkAnswer(df.where(pred), answer.map(Row(_)))
+def checkPredicate(pred: Column, answer: Seq[Row]): Unit = {
+  val sourceDf = stripSparkFilter(df.where(pred))
+  val data = sourceDf.collect().toSet
+  val expectedData = answer.toSet
+
+  // When a filter is pushed to ORC, ORC can apply it to rows. So, we 
can check
+  // the number of rows returned from the ORC to make sure our filter 
pushdown work.
+  // A tricky part is, ORC does not process filter rows fully but 
return some possible
+  // results. So, this checks if the number of result is less than the 
original count
+  // of data, and then checks if it contains the expected data.
+  val isOrcFiltered = sourceDf.count < 10 && 
expectedData.subsetOf(data)
+  assert(isOrcFiltered)
 }
 
-checkPredicate('id === 5, Seq(5L))
-checkPredicate('id <=> 5, Seq(5L))
-checkPredicate('id < 5, 0L to 4L)
-checkPredicate('id <= 5, 0L to 5L)
-checkPredicate('id > 5, 6L to 9L)
-checkPredicate('id >= 5, 5L to 9L)
-checkPredicate('id.isNull, Seq.empty[Long])
-checkPredicate('id.isNotNull, 0L to 9L)
-checkPredicate('id.isin(1L, 3L, 5L), Seq(1L, 3L, 5L))
-checkPredicate('id > 0 && 'id < 3, 1L to 2L)
-checkPredicate('id < 1 || 'id > 8, Seq(0L, 9L))
-checkPredicate(!('id > 3), 0L to 3L)
-checkPredicate(!('id > 0 && 'id < 3), Seq(0L) ++ (3L to 9L))
+checkPredicate('a === 5, List(5).map(Row(_, null)))
+checkPredicate('a <=> 5, List(5).map(Row(_, null)))
+checkPredicate('a < 5, List(1, 3).map(Row(_, null)))
+checkPredicate('a <= 5, List(1, 3, 5).map(Row(_, null)))
+checkPredicate('a > 5, List(7, 9).map(Row(_, null)))
+checkPredicate('a >= 5, List(5, 7, 9).map(Row(_, null)))
+checkPredicate('a.isNull, List(null).map(Row(_, null)))
+checkPredicate('b.isNotNull, List())
+

spark git commit: [SPARK-12164][SQL] Decode the encoded values and then display

2015-12-16 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/master a783a8ed4 -> edf65cd96


[SPARK-12164][SQL] Decode the encoded values and then display

Based on the suggestions from marmbrus cloud-fan in 
https://github.com/apache/spark/pull/10165 , this PR is to print the decoded 
values(user objects) in `Dataset.show`
```scala
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
```
The current output is like
```
+--+
|value  

   |
+--+
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 
107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 
116, -31, 1, 1, -126, 97, 2]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 
107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 
116, -31, 1, 1, -126, 98, 4]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 
107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 
116, -31, 1, 1, -126, 99, 6]|
+--+
```
After the fix, it will be like the below if and only if the users override the 
`toString` function in the class `KryoClassData`
```scala
override def toString: String = s"KryoClassData($a, $b)"
```
```
+---+
|value  |
+---+
|KryoClassData(a, 1)|
|KryoClassData(b, 2)|
|KryoClassData(c, 3)|
+---+
```

If users do not override the `toString` function, the results will be like
```
+---+
|value  |
+---+
|org.apache.spark.sql.KryoClassData68ef|
|org.apache.spark.sql.KryoClassData6915|
|org.apache.spark.sql.KryoClassData693b|
+---+
```

Question: Should we add another optional parameter in the function `show`? It 
will decide if the function `show` will display the hex values or the object 
values?

Author: gatorsmile 

Closes #10215 from gatorsmile/showDecodedValue.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/edf65cd9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/edf65cd9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/edf65cd9

Branch: refs/heads/master
Commit: edf65cd961b913ef54104770630a50fd4b120b4b
Parents: a783a8e
Author: gatorsmile 
Authored: Wed Dec 16 13:22:34 2015 -0800
Committer: Michael Armbrust 
Committed: Wed Dec 16 13:22:34 2015 -0800

--
 .../scala/org/apache/spark/sql/DataFrame.scala  | 50 +--
 .../scala/org/apache/spark/sql/Dataset.scala| 37 ++-
 .../apache/spark/sql/execution/Queryable.scala  | 65 
 .../org/apache/spark/sql/DataFrameSuite.scala   | 15 +
 .../org/apache/spark/sql/DatasetSuite.scala | 14 +
 5 files changed, 133 insertions(+), 48 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/edf65cd9/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
index 497bd48..6250e95 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
@@ -165,13 +165,11 @@ class DataFrame private[sql](
* @param _numRows Number of rows to show
* @param truncate Whether truncate long strings and align cells right
*/
-  private[sql] def showString(_numRows: Int, truncate: Boolean = true): String 
= {
+  override private[sql] def showString(_numRows: Int, truncate: Boolean = 
true): String = {
 val numRows = _numRows.max(0)
-val sb = new StringBuilder
 val takeResult = take(numRows + 1)
 val hasMoreData = takeResult.length > numRows
 val data = takeResult.take(numRows)
-val numCols = schema.fieldNames.length

spark git commit: [MINOR] Add missing interpolation in NettyRPCEnv

2015-12-16 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 552b38f87 -> 638b89bc3


[MINOR] Add missing interpolation in NettyRPCEnv

```
Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException:
Cannot receive any reply in ${timeout.duration}. This timeout is controlled by 
spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
```

Author: Andrew Or 

Closes #10334 from andrewor14/rpc-typo.

(cherry picked from commit 861549acdbc11920cde51fc57752a8bc241064e5)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/638b89bc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/638b89bc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/638b89bc

Branch: refs/heads/branch-1.6
Commit: 638b89bc3b1c421fe11cbaf52649225662d3d3ce
Parents: 552b38f
Author: Andrew Or 
Authored: Wed Dec 16 16:13:48 2015 -0800
Committer: Shixiong Zhu 
Committed: Wed Dec 16 16:13:55 2015 -0800

--
 core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/638b89bc/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala 
b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
index 9d353bb..a53bc5e 100644
--- a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
+++ b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
@@ -239,7 +239,7 @@ private[netty] class NettyRpcEnv(
 val timeoutCancelable = timeoutScheduler.schedule(new Runnable {
   override def run(): Unit = {
 promise.tryFailure(
-  new TimeoutException("Cannot receive any reply in 
${timeout.duration}"))
+  new TimeoutException(s"Cannot receive any reply in 
${timeout.duration}"))
   }
 }, timeout.duration.toNanos, TimeUnit.NANOSECONDS)
 promise.future.onComplete { v =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] Add missing interpolation in NettyRPCEnv

2015-12-16 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 27b98e99d -> 861549acd


[MINOR] Add missing interpolation in NettyRPCEnv

```
Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException:
Cannot receive any reply in ${timeout.duration}. This timeout is controlled by 
spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
```

Author: Andrew Or 

Closes #10334 from andrewor14/rpc-typo.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/861549ac
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/861549ac
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/861549ac

Branch: refs/heads/master
Commit: 861549acdbc11920cde51fc57752a8bc241064e5
Parents: 27b98e9
Author: Andrew Or 
Authored: Wed Dec 16 16:13:48 2015 -0800
Committer: Shixiong Zhu 
Committed: Wed Dec 16 16:13:48 2015 -0800

--
 core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/861549ac/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala 
b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
index f82fd4e..de3db6b 100644
--- a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
+++ b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
@@ -232,7 +232,7 @@ private[netty] class NettyRpcEnv(
 val timeoutCancelable = timeoutScheduler.schedule(new Runnable {
   override def run(): Unit = {
 promise.tryFailure(
-  new TimeoutException("Cannot receive any reply in 
${timeout.duration}"))
+  new TimeoutException(s"Cannot receive any reply in 
${timeout.duration}"))
   }
 }, timeout.duration.toNanos, TimeUnit.NANOSECONDS)
 promise.future.onComplete { v =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master 38d9795a4 -> f590178d7


[SPARK-12365][CORE] Use ShutdownHookManager where 
Runtime.getRuntime.addShutdownHook() is called

SPARK-9886 fixed ExternalBlockStore.scala

This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook()

Author: tedyu 

Closes #10325 from ted-yu/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f590178d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f590178d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f590178d

Branch: refs/heads/master
Commit: f590178d7a06221a93286757c68b23919bee9f03
Parents: 38d9795
Author: tedyu 
Authored: Wed Dec 16 19:02:12 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 19:02:12 2015 -0800

--
 .../spark/deploy/ExternalShuffleService.scala   | 18 +--
 .../deploy/mesos/MesosClusterDispatcher.scala   | 13 ---
 .../apache/spark/util/ShutdownHookManager.scala |  4 
 scalastyle-config.xml   | 12 ++
 .../hive/thriftserver/SparkSQLCLIDriver.scala   | 24 +---
 5 files changed, 38 insertions(+), 33 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f590178d/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala 
b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala
index e8a1e35..7fc96e4 100644
--- a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala
@@ -28,7 +28,7 @@ import org.apache.spark.network.sasl.SaslServerBootstrap
 import org.apache.spark.network.server.{TransportServerBootstrap, 
TransportServer}
 import org.apache.spark.network.shuffle.ExternalShuffleBlockHandler
 import org.apache.spark.network.util.TransportConf
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{ShutdownHookManager, Utils}
 
 /**
  * Provides a server from which Executors can read shuffle files (rather than 
reading directly from
@@ -118,19 +118,13 @@ object ExternalShuffleService extends Logging {
 server = newShuffleService(sparkConf, securityManager)
 server.start()
 
-installShutdownHook()
+ShutdownHookManager.addShutdownHook { () =>
+  logInfo("Shutting down shuffle service.")
+  server.stop()
+  barrier.countDown()
+}
 
 // keep running until the process is terminated
 barrier.await()
   }
-
-  private def installShutdownHook(): Unit = {
-Runtime.getRuntime.addShutdownHook(new Thread("External Shuffle Service 
shutdown thread") {
-  override def run() {
-logInfo("Shutting down shuffle service.")
-server.stop()
-barrier.countDown()
-  }
-})
-  }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/f590178d/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
 
b/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
index 5d4e5b8..389eff5 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
@@ -22,7 +22,7 @@ import java.util.concurrent.CountDownLatch
 import org.apache.spark.deploy.mesos.ui.MesosClusterUI
 import org.apache.spark.deploy.rest.mesos.MesosRestServer
 import org.apache.spark.scheduler.cluster.mesos._
-import org.apache.spark.util.SignalLogger
+import org.apache.spark.util.{ShutdownHookManager, SignalLogger}
 import org.apache.spark.{Logging, SecurityManager, SparkConf}
 
 /*
@@ -103,14 +103,11 @@ private[mesos] object MesosClusterDispatcher extends 
Logging {
 }
 val dispatcher = new MesosClusterDispatcher(dispatcherArgs, conf)
 dispatcher.start()
-val shutdownHook = new Thread() {
-  override def run() {
-logInfo("Shutdown hook is shutting down dispatcher")
-dispatcher.stop()
-dispatcher.awaitShutdown()
-  }
+ShutdownHookManager.addShutdownHook { () =>
+  logInfo("Shutdown hook is shutting down dispatcher")
+  dispatcher.stop()
+  dispatcher.awaitShutdown()
 }
-Runtime.getRuntime.addShutdownHook(shutdownHook)
 dispatcher.awaitShutdown()
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/f590178d/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala

spark git commit: [SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 fb02e4e3b -> 4af64385b


[SPARK-12365][CORE] Use ShutdownHookManager where 
Runtime.getRuntime.addShutdownHook() is called

SPARK-9886 fixed ExternalBlockStore.scala

This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook()

Author: tedyu 

Closes #10325 from ted-yu/master.

(cherry picked from commit f590178d7a06221a93286757c68b23919bee9f03)
Signed-off-by: Andrew Or 

Conflicts:

sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4af64385
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4af64385
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4af64385

Branch: refs/heads/branch-1.6
Commit: 4af64385b085002d94c54d11bbd144f9f026bbd8
Parents: fb02e4e
Author: tedyu 
Authored: Wed Dec 16 19:02:12 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 19:03:30 2015 -0800

--
 .../spark/deploy/ExternalShuffleService.scala | 18 ++
 .../deploy/mesos/MesosClusterDispatcher.scala | 13 +
 .../apache/spark/util/ShutdownHookManager.scala   |  4 
 scalastyle-config.xml | 12 
 4 files changed, 27 insertions(+), 20 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4af64385/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala 
b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala
index e8a1e35..7fc96e4 100644
--- a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala
@@ -28,7 +28,7 @@ import org.apache.spark.network.sasl.SaslServerBootstrap
 import org.apache.spark.network.server.{TransportServerBootstrap, 
TransportServer}
 import org.apache.spark.network.shuffle.ExternalShuffleBlockHandler
 import org.apache.spark.network.util.TransportConf
-import org.apache.spark.util.Utils
+import org.apache.spark.util.{ShutdownHookManager, Utils}
 
 /**
  * Provides a server from which Executors can read shuffle files (rather than 
reading directly from
@@ -118,19 +118,13 @@ object ExternalShuffleService extends Logging {
 server = newShuffleService(sparkConf, securityManager)
 server.start()
 
-installShutdownHook()
+ShutdownHookManager.addShutdownHook { () =>
+  logInfo("Shutting down shuffle service.")
+  server.stop()
+  barrier.countDown()
+}
 
 // keep running until the process is terminated
 barrier.await()
   }
-
-  private def installShutdownHook(): Unit = {
-Runtime.getRuntime.addShutdownHook(new Thread("External Shuffle Service 
shutdown thread") {
-  override def run() {
-logInfo("Shutting down shuffle service.")
-server.stop()
-barrier.countDown()
-  }
-})
-  }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/4af64385/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
 
b/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
index 5d4e5b8..389eff5 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala
@@ -22,7 +22,7 @@ import java.util.concurrent.CountDownLatch
 import org.apache.spark.deploy.mesos.ui.MesosClusterUI
 import org.apache.spark.deploy.rest.mesos.MesosRestServer
 import org.apache.spark.scheduler.cluster.mesos._
-import org.apache.spark.util.SignalLogger
+import org.apache.spark.util.{ShutdownHookManager, SignalLogger}
 import org.apache.spark.{Logging, SecurityManager, SparkConf}
 
 /*
@@ -103,14 +103,11 @@ private[mesos] object MesosClusterDispatcher extends 
Logging {
 }
 val dispatcher = new MesosClusterDispatcher(dispatcherArgs, conf)
 dispatcher.start()
-val shutdownHook = new Thread() {
-  override def run() {
-logInfo("Shutdown hook is shutting down dispatcher")
-dispatcher.stop()
-dispatcher.awaitShutdown()
-  }
+ShutdownHookManager.addShutdownHook { () =>
+  logInfo("Shutdown hook is shutting down dispatcher")
+  dispatcher.stop()
+  dispatcher.awaitShutdown()
 }
-Runtime.getRuntime.addShutdownHook(shutdownHook)

spark git commit: [SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting.

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 4af64385b -> 154567dca


[SPARK-12186][WEB UI] Send the complete request URI including the query string 
when redirecting.

Author: Rohit Agarwal 

Closes #10180 from mindprince/SPARK-12186.

(cherry picked from commit fdb38227564c1af40cbfb97df420b23eb04c002b)
Signed-off-by: Andrew Or 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/154567dc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/154567dc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/154567dc

Branch: refs/heads/branch-1.6
Commit: 154567dca126d4992c9c9b08d71d22e9af43c995
Parents: 4af6438
Author: Rohit Agarwal 
Authored: Wed Dec 16 19:04:33 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 19:04:43 2015 -0800

--
 .../scala/org/apache/spark/deploy/history/HistoryServer.scala| 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/154567dc/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
index d4f327c..f31fef0 100644
--- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
@@ -103,7 +103,9 @@ class HistoryServer(
   // Note we don't use the UI retrieved from the cache; the cache loader 
above will register
   // the app's UI, and all we need to do is redirect the user to the same 
URI that was
   // requested, and the proper data should be served at that point.
-  res.sendRedirect(res.encodeRedirectURL(req.getRequestURI()))
+  // Also, make sure that the redirect url contains the query string 
present in the request.
+  val requestURI = req.getRequestURI + Option(req.getQueryString).map("?" 
+ _).getOrElse("")
+  res.sendRedirect(res.encodeRedirectURL(requestURI))
 }
 
 // SPARK-5983 ensure TRACE is not supported


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting.

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master f590178d7 -> fdb382275


[SPARK-12186][WEB UI] Send the complete request URI including the query string 
when redirecting.

Author: Rohit Agarwal 

Closes #10180 from mindprince/SPARK-12186.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fdb38227
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fdb38227
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fdb38227

Branch: refs/heads/master
Commit: fdb38227564c1af40cbfb97df420b23eb04c002b
Parents: f590178
Author: Rohit Agarwal 
Authored: Wed Dec 16 19:04:33 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 19:04:33 2015 -0800

--
 .../scala/org/apache/spark/deploy/history/HistoryServer.scala| 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fdb38227/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
index d4f327c..f31fef0 100644
--- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
@@ -103,7 +103,9 @@ class HistoryServer(
   // Note we don't use the UI retrieved from the cache; the cache loader 
above will register
   // the app's UI, and all we need to do is redirect the user to the same 
URI that was
   // requested, and the proper data should be served at that point.
-  res.sendRedirect(res.encodeRedirectURL(req.getRequestURI()))
+  // Also, make sure that the redirect url contains the query string 
present in the request.
+  val requestURI = req.getRequestURI + Option(req.getQueryString).map("?" 
+ _).getOrElse("")
+  res.sendRedirect(res.encodeRedirectURL(requestURI))
 }
 
 // SPARK-5983 ensure TRACE is not supported


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12390] Clean up unused serializer parameter in BlockManager

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master d1508dd9b -> 97678edea


[SPARK-12390] Clean up unused serializer parameter in BlockManager

No change in functionality is intended. This only changes internal API.

Author: Andrew Or 

Closes #10343 from andrewor14/clean-bm-serializer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/97678ede
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/97678ede
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/97678ede

Branch: refs/heads/master
Commit: 97678edeaaafc19ea18d044233a952d2e2e89fbc
Parents: d1508dd
Author: Andrew Or 
Authored: Wed Dec 16 20:01:47 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 20:01:47 2015 -0800

--
 .../org/apache/spark/storage/BlockManager.scala | 29 
 .../org/apache/spark/storage/DiskStore.scala| 10 ---
 2 files changed, 11 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/97678ede/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
--
diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
index 540e1ec..6074fc5 100644
--- a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
+++ b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
@@ -1190,20 +1190,16 @@ private[spark] class BlockManager(
   def dataSerializeStream(
   blockId: BlockId,
   outputStream: OutputStream,
-  values: Iterator[Any],
-  serializer: Serializer = defaultSerializer): Unit = {
+  values: Iterator[Any]): Unit = {
 val byteStream = new BufferedOutputStream(outputStream)
-val ser = serializer.newInstance()
+val ser = defaultSerializer.newInstance()
 ser.serializeStream(wrapForCompression(blockId, 
byteStream)).writeAll(values).close()
   }
 
   /** Serializes into a byte buffer. */
-  def dataSerialize(
-  blockId: BlockId,
-  values: Iterator[Any],
-  serializer: Serializer = defaultSerializer): ByteBuffer = {
+  def dataSerialize(blockId: BlockId, values: Iterator[Any]): ByteBuffer = {
 val byteStream = new ByteBufferOutputStream(4096)
-dataSerializeStream(blockId, byteStream, values, serializer)
+dataSerializeStream(blockId, byteStream, values)
 byteStream.toByteBuffer
   }
 
@@ -1211,24 +1207,21 @@ private[spark] class BlockManager(
* Deserializes a ByteBuffer into an iterator of values and disposes of it 
when the end of
* the iterator is reached.
*/
-  def dataDeserialize(
-  blockId: BlockId,
-  bytes: ByteBuffer,
-  serializer: Serializer = defaultSerializer): Iterator[Any] = {
+  def dataDeserialize(blockId: BlockId, bytes: ByteBuffer): Iterator[Any] = {
 bytes.rewind()
-dataDeserializeStream(blockId, new ByteBufferInputStream(bytes, true), 
serializer)
+dataDeserializeStream(blockId, new ByteBufferInputStream(bytes, true))
   }
 
   /**
* Deserializes a InputStream into an iterator of values and disposes of it 
when the end of
* the iterator is reached.
*/
-  def dataDeserializeStream(
-  blockId: BlockId,
-  inputStream: InputStream,
-  serializer: Serializer = defaultSerializer): Iterator[Any] = {
+  def dataDeserializeStream(blockId: BlockId, inputStream: InputStream): 
Iterator[Any] = {
 val stream = new BufferedInputStream(inputStream)
-serializer.newInstance().deserializeStream(wrapForCompression(blockId, 
stream)).asIterator
+defaultSerializer
+  .newInstance()
+  .deserializeStream(wrapForCompression(blockId, stream))
+  .asIterator
   }
 
   def stop(): Unit = {

http://git-wip-us.apache.org/repos/asf/spark/blob/97678ede/core/src/main/scala/org/apache/spark/storage/DiskStore.scala
--
diff --git a/core/src/main/scala/org/apache/spark/storage/DiskStore.scala 
b/core/src/main/scala/org/apache/spark/storage/DiskStore.scala
index c008b9d..6c44771 100644
--- a/core/src/main/scala/org/apache/spark/storage/DiskStore.scala
+++ b/core/src/main/scala/org/apache/spark/storage/DiskStore.scala
@@ -144,16 +144,6 @@ private[spark] class DiskStore(blockManager: BlockManager, 
diskManager: DiskBloc
 getBytes(blockId).map(buffer => blockManager.dataDeserialize(blockId, 
buffer))
   }
 
-  /**
-   * A version of getValues that allows a custom serializer. This is used as 
part of the
-   * shuffle short-circuit code.
-   */
-  def getValues(blockId: BlockId, serializer: Serializer): 
Option[Iterator[Any]] = {
-// TODO: Should bypass getBytes and use a stream based implementation, so 
that
-//

spark git commit: [SPARK-12057][SQL] Prevent failure on corrupt JSON records

2015-12-16 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 4ad08035d -> d509194b8


[SPARK-12057][SQL] Prevent failure on corrupt JSON records

This PR makes JSON parser and schema inference handle more cases where we have 
unparsed records. It is based on #10043. The last commit fixes the failed test 
and updates the logic of schema inference.

Regarding the schema inference change, if we have something like
```
{"f1":1}
[1,2,3]
```
originally, we will get a DF without any column.
After this change, we will get a DF with columns `f1` and `_corrupt_record`. 
Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`.

When merge this PR, please make sure that the author is simplyianm.

JIRA: https://issues.apache.org/jira/browse/SPARK-12057

Closes #10043

Author: Ian Macalinao 
Author: Yin Huai 

Closes #10288 from yhuai/handleCorruptJson.

(cherry picked from commit 9d66c4216ad830812848c657bbcd8cd50949e199)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d509194b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d509194b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d509194b

Branch: refs/heads/branch-1.6
Commit: d509194b81abc3c7bf9563d26560d596e1415627
Parents: 4ad0803
Author: Yin Huai 
Authored: Wed Dec 16 23:18:53 2015 -0800
Committer: Reynold Xin 
Committed: Wed Dec 16 23:19:06 2015 -0800

--
 .../datasources/json/InferSchema.scala  | 37 +---
 .../datasources/json/JacksonParser.scala| 19 ++
 .../execution/datasources/json/JsonSuite.scala  | 37 
 .../datasources/json/TestJsonData.scala |  9 -
 4 files changed, 90 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d509194b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
index 922fd5b..59ba4ae 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
@@ -61,7 +61,10 @@ private[json] object InferSchema {
 StructType(Seq(StructField(columnNameOfCorruptRecords, 
StringType)))
 }
   }
-}.treeAggregate[DataType](StructType(Seq()))(compatibleRootType, 
compatibleRootType)
+}.treeAggregate[DataType](
+  StructType(Seq()))(
+  compatibleRootType(columnNameOfCorruptRecords),
+  compatibleRootType(columnNameOfCorruptRecords))
 
 canonicalizeType(rootType) match {
   case Some(st: StructType) => st
@@ -170,12 +173,38 @@ private[json] object InferSchema {
 case other => Some(other)
   }
 
+  private def withCorruptField(
+  struct: StructType,
+  columnNameOfCorruptRecords: String): StructType = {
+if (!struct.fieldNames.contains(columnNameOfCorruptRecords)) {
+  // If this given struct does not have a column used for corrupt records,
+  // add this field.
+  struct.add(columnNameOfCorruptRecords, StringType, nullable = true)
+} else {
+  // Otherwise, just return this struct.
+  struct
+}
+  }
+
   /**
* Remove top-level ArrayType wrappers and merge the remaining schemas
*/
-  private def compatibleRootType: (DataType, DataType) => DataType = {
-case (ArrayType(ty1, _), ty2) => compatibleRootType(ty1, ty2)
-case (ty1, ArrayType(ty2, _)) => compatibleRootType(ty1, ty2)
+  private def compatibleRootType(
+  columnNameOfCorruptRecords: String): (DataType, DataType) => DataType = {
+// Since we support array of json objects at the top level,
+// we need to check the element type and find the root level data type.
+case (ArrayType(ty1, _), ty2) => 
compatibleRootType(columnNameOfCorruptRecords)(ty1, ty2)
+case (ty1, ArrayType(ty2, _)) => 
compatibleRootType(columnNameOfCorruptRecords)(ty1, ty2)
+// If we see any other data type at the root level, we get records that 
cannot be
+// parsed. So, we use the struct as the data type and add the corrupt 
field to the schema.
+case (struct: StructType, NullType) => struct
+case (NullType, struct: StructType) => struct
+case (struct: StructType, o) if !o.isInstanceOf[StructType] =>
+  withCorruptField(struct, columnNameOfCorruptRecords)
+case (o, struct: StructType) if !o.isInstanceOf[StructType] =>
+  withCorruptField(struct,

spark git commit: [SPARK-12057][SQL] Prevent failure on corrupt JSON records

2015-12-16 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 437583f69 -> 9d66c4216


[SPARK-12057][SQL] Prevent failure on corrupt JSON records

This PR makes JSON parser and schema inference handle more cases where we have 
unparsed records. It is based on #10043. The last commit fixes the failed test 
and updates the logic of schema inference.

Regarding the schema inference change, if we have something like
```
{"f1":1}
[1,2,3]
```
originally, we will get a DF without any column.
After this change, we will get a DF with columns `f1` and `_corrupt_record`. 
Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`.

When merge this PR, please make sure that the author is simplyianm.

JIRA: https://issues.apache.org/jira/browse/SPARK-12057

Closes #10043

Author: Ian Macalinao 
Author: Yin Huai 

Closes #10288 from yhuai/handleCorruptJson.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d66c421
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d66c421
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d66c421

Branch: refs/heads/master
Commit: 9d66c4216ad830812848c657bbcd8cd50949e199
Parents: 437583f
Author: Yin Huai 
Authored: Wed Dec 16 23:18:53 2015 -0800
Committer: Reynold Xin 
Committed: Wed Dec 16 23:18:53 2015 -0800

--
 .../datasources/json/InferSchema.scala  | 37 +---
 .../datasources/json/JacksonParser.scala| 19 ++
 .../execution/datasources/json/JsonSuite.scala  | 37 
 .../datasources/json/TestJsonData.scala |  9 -
 4 files changed, 90 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9d66c421/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
index 922fd5b..59ba4ae 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
@@ -61,7 +61,10 @@ private[json] object InferSchema {
 StructType(Seq(StructField(columnNameOfCorruptRecords, 
StringType)))
 }
   }
-}.treeAggregate[DataType](StructType(Seq()))(compatibleRootType, 
compatibleRootType)
+}.treeAggregate[DataType](
+  StructType(Seq()))(
+  compatibleRootType(columnNameOfCorruptRecords),
+  compatibleRootType(columnNameOfCorruptRecords))
 
 canonicalizeType(rootType) match {
   case Some(st: StructType) => st
@@ -170,12 +173,38 @@ private[json] object InferSchema {
 case other => Some(other)
   }
 
+  private def withCorruptField(
+  struct: StructType,
+  columnNameOfCorruptRecords: String): StructType = {
+if (!struct.fieldNames.contains(columnNameOfCorruptRecords)) {
+  // If this given struct does not have a column used for corrupt records,
+  // add this field.
+  struct.add(columnNameOfCorruptRecords, StringType, nullable = true)
+} else {
+  // Otherwise, just return this struct.
+  struct
+}
+  }
+
   /**
* Remove top-level ArrayType wrappers and merge the remaining schemas
*/
-  private def compatibleRootType: (DataType, DataType) => DataType = {
-case (ArrayType(ty1, _), ty2) => compatibleRootType(ty1, ty2)
-case (ty1, ArrayType(ty2, _)) => compatibleRootType(ty1, ty2)
+  private def compatibleRootType(
+  columnNameOfCorruptRecords: String): (DataType, DataType) => DataType = {
+// Since we support array of json objects at the top level,
+// we need to check the element type and find the root level data type.
+case (ArrayType(ty1, _), ty2) => 
compatibleRootType(columnNameOfCorruptRecords)(ty1, ty2)
+case (ty1, ArrayType(ty2, _)) => 
compatibleRootType(columnNameOfCorruptRecords)(ty1, ty2)
+// If we see any other data type at the root level, we get records that 
cannot be
+// parsed. So, we use the struct as the data type and add the corrupt 
field to the schema.
+case (struct: StructType, NullType) => struct
+case (NullType, struct: StructType) => struct
+case (struct: StructType, o) if !o.isInstanceOf[StructType] =>
+  withCorruptField(struct, columnNameOfCorruptRecords)
+case (o, struct: StructType) if !o.isInstanceOf[StructType] =>
+  withCorruptField(struct, columnNameOfCorruptRecords)
+// If we get anything else, we call compatibleType.
+// Usually, when we reach here, ty1 and ty2 are two

spark git commit: [SPARK-11904][PYSPARK] reduceByKeyAndWindow does not require checkpointing when invFunc is None

2015-12-16 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 97678edea -> 437583f69


[SPARK-11904][PYSPARK] reduceByKeyAndWindow does not require checkpointing when 
invFunc is None

when invFunc is None, `reduceByKeyAndWindow(func, None, winsize, slidesize)` is 
equivalent to

 reduceByKey(func).window(winsize, slidesize).reduceByKey(winsize, 
slidesize)

and no checkpoint is necessary. The corresponding Scala code does exactly that, 
but Python code always creates a windowed stream with obligatory checkpointing. 
The patch fixes this.

I do not know how to unit-test this.

Author: David Tolpin 

Closes #9888 from dtolpin/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/437583f6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/437583f6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/437583f6

Branch: refs/heads/master
Commit: 437583f692e30b8dc03b339a34e92595d7b992ba
Parents: 97678ed
Author: David Tolpin 
Authored: Wed Dec 16 22:10:24 2015 -0800
Committer: Shixiong Zhu 
Committed: Wed Dec 16 22:10:24 2015 -0800

--
 python/pyspark/streaming/dstream.py | 45 
 1 file changed, 23 insertions(+), 22 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/437583f6/python/pyspark/streaming/dstream.py
--
diff --git a/python/pyspark/streaming/dstream.py 
b/python/pyspark/streaming/dstream.py
index f61137c..b994a53 100644
--- a/python/pyspark/streaming/dstream.py
+++ b/python/pyspark/streaming/dstream.py
@@ -542,31 +542,32 @@ class DStream(object):
 
 reduced = self.reduceByKey(func, numPartitions)
 
-def reduceFunc(t, a, b):
-b = b.reduceByKey(func, numPartitions)
-r = a.union(b).reduceByKey(func, numPartitions) if a else b
-if filterFunc:
-r = r.filter(filterFunc)
-return r
-
-def invReduceFunc(t, a, b):
-b = b.reduceByKey(func, numPartitions)
-joined = a.leftOuterJoin(b, numPartitions)
-return joined.mapValues(lambda kv: invFunc(kv[0], kv[1])
-if kv[1] is not None else kv[0])
-
-jreduceFunc = TransformFunction(self._sc, reduceFunc, 
reduced._jrdd_deserializer)
 if invFunc:
+def reduceFunc(t, a, b):
+b = b.reduceByKey(func, numPartitions)
+r = a.union(b).reduceByKey(func, numPartitions) if a else b
+if filterFunc:
+r = r.filter(filterFunc)
+return r
+
+def invReduceFunc(t, a, b):
+b = b.reduceByKey(func, numPartitions)
+joined = a.leftOuterJoin(b, numPartitions)
+return joined.mapValues(lambda kv: invFunc(kv[0], kv[1])
+if kv[1] is not None else kv[0])
+
+jreduceFunc = TransformFunction(self._sc, reduceFunc, 
reduced._jrdd_deserializer)
 jinvReduceFunc = TransformFunction(self._sc, invReduceFunc, 
reduced._jrdd_deserializer)
+if slideDuration is None:
+slideDuration = self._slideDuration
+dstream = self._sc._jvm.PythonReducedWindowedDStream(
+reduced._jdstream.dstream(),
+jreduceFunc, jinvReduceFunc,
+self._ssc._jduration(windowDuration),
+self._ssc._jduration(slideDuration))
+return DStream(dstream.asJavaDStream(), self._ssc, 
self._sc.serializer)
 else:
-jinvReduceFunc = None
-if slideDuration is None:
-slideDuration = self._slideDuration
-dstream = 
self._sc._jvm.PythonReducedWindowedDStream(reduced._jdstream.dstream(),
- jreduceFunc, 
jinvReduceFunc,
- 
self._ssc._jduration(windowDuration),
- 
self._ssc._jduration(slideDuration))
-return DStream(dstream.asJavaDStream(), self._ssc, self._sc.serializer)
+return reduced.window(windowDuration, 
slideDuration).reduceByKey(func, numPartitions)
 
 def updateStateByKey(self, updateFunc, numPartitions=None, 
initialRDD=None):
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 638b89bc3 -> fb02e4e3b


[SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests

`DAGSchedulerEventLoop` normally only logs errors (so it can continue to 
process more events, from other jobs).  However, this is not desirable in the 
tests -- the tests should be able to easily detect any exception, and also 
shouldn't silently succeed if there is an exception.

This was suggested by mateiz on https://github.com/apache/spark/pull/7699.  It 
may have already turned up an issue in "zero split job".

Author: Imran Rashid 

Closes #8466 from squito/SPARK-10248.

(cherry picked from commit 38d9795a4fa07086d65ff705ce86648345618736)
Signed-off-by: Andrew Or 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb02e4e3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb02e4e3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb02e4e3

Branch: refs/heads/branch-1.6
Commit: fb02e4e3bcc50a8f823dfecdb2eef71287225e7b
Parents: 638b89b
Author: Imran Rashid 
Authored: Wed Dec 16 19:01:05 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 19:01:13 2015 -0800

--
 .../apache/spark/scheduler/DAGScheduler.scala   |  5 ++--
 .../spark/scheduler/DAGSchedulerSuite.scala | 28 ++--
 2 files changed, 29 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fb02e4e3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
--
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index e01a960..b805bde 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -802,7 +802,8 @@ class DAGScheduler(
 
   private[scheduler] def cleanUpAfterSchedulerStop() {
 for (job <- activeJobs) {
-  val error = new SparkException("Job cancelled because SparkContext was 
shut down")
+  val error =
+new SparkException(s"Job ${job.jobId} cancelled because SparkContext 
was shut down")
   job.listener.jobFailed(error)
   // Tell the listeners that all of the running stages have ended.  Don't 
bother
   // cancelling the stages because if the DAG scheduler is stopped, the 
entire application
@@ -1291,7 +1292,7 @@ class DAGScheduler(
   case TaskResultLost =>
 // Do nothing here; the TaskScheduler handles these failures and 
resubmits the task.
 
-  case other =>
+  case _: ExecutorLostFailure | TaskKilled | UnknownReason =>
 // Unrecognized failure - also do nothing. If the task fails 
repeatedly, the TaskScheduler
 // will abort the job.
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/fb02e4e3/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
index 653d41f..2869f0f 100644
--- a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
@@ -45,6 +45,13 @@ class DAGSchedulerEventProcessLoopTester(dagScheduler: 
DAGScheduler)
   case NonFatal(e) => onError(e)
 }
   }
+
+  override def onError(e: Throwable): Unit = {
+logError("Error in DAGSchedulerEventLoop: ", e)
+dagScheduler.stop()
+throw e
+  }
+
 }
 
 /**
@@ -300,13 +307,18 @@ class DAGSchedulerSuite
 
   test("zero split job") {
 var numResults = 0
+var failureReason: Option[Exception] = None
 val fakeListener = new JobListener() {
-  override def taskSucceeded(partition: Int, value: Any) = numResults += 1
-  override def jobFailed(exception: Exception) = throw exception
+  override def taskSucceeded(partition: Int, value: Any): Unit = 
numResults += 1
+  override def jobFailed(exception: Exception): Unit = {
+failureReason = Some(exception)
+  }
 }
 val jobId = submit(new MyRDD(sc, 0, Nil), Array(), listener = fakeListener)
 assert(numResults === 0)
 cancel(jobId)
+assert(failureReason.isDefined)
+assert(failureReason.get.getMessage() === "Job 0 cancelled ")
   }
 
   test("run trivial job") {
@@ -1675,6 +1687,18 @@ class DAGSchedulerSuite
 assert(stackTraceString.contains("org.scalatest.FunSuite"))
   }
 
+  test("catch errors in event loop") {
+// this is a test of our testing framework -- make sure errors

spark git commit: [SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master ce5fd4008 -> 38d9795a4


[SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests

`DAGSchedulerEventLoop` normally only logs errors (so it can continue to 
process more events, from other jobs).  However, this is not desirable in the 
tests -- the tests should be able to easily detect any exception, and also 
shouldn't silently succeed if there is an exception.

This was suggested by mateiz on https://github.com/apache/spark/pull/7699.  It 
may have already turned up an issue in "zero split job".

Author: Imran Rashid 

Closes #8466 from squito/SPARK-10248.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/38d9795a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/38d9795a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/38d9795a

Branch: refs/heads/master
Commit: 38d9795a4fa07086d65ff705ce86648345618736
Parents: ce5fd40
Author: Imran Rashid 
Authored: Wed Dec 16 19:01:05 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 19:01:05 2015 -0800

--
 .../apache/spark/scheduler/DAGScheduler.scala   |  5 ++--
 .../spark/scheduler/DAGSchedulerSuite.scala | 28 ++--
 2 files changed, 29 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/38d9795a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
--
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index 8d0e0c8..b128ed5 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -805,7 +805,8 @@ class DAGScheduler(
 
   private[scheduler] def cleanUpAfterSchedulerStop() {
 for (job <- activeJobs) {
-  val error = new SparkException("Job cancelled because SparkContext was 
shut down")
+  val error =
+new SparkException(s"Job ${job.jobId} cancelled because SparkContext 
was shut down")
   job.listener.jobFailed(error)
   // Tell the listeners that all of the running stages have ended.  Don't 
bother
   // cancelling the stages because if the DAG scheduler is stopped, the 
entire application
@@ -1295,7 +1296,7 @@ class DAGScheduler(
   case TaskResultLost =>
 // Do nothing here; the TaskScheduler handles these failures and 
resubmits the task.
 
-  case other =>
+  case _: ExecutorLostFailure | TaskKilled | UnknownReason =>
 // Unrecognized failure - also do nothing. If the task fails 
repeatedly, the TaskScheduler
 // will abort the job.
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/38d9795a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
index 653d41f..2869f0f 100644
--- a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
@@ -45,6 +45,13 @@ class DAGSchedulerEventProcessLoopTester(dagScheduler: 
DAGScheduler)
   case NonFatal(e) => onError(e)
 }
   }
+
+  override def onError(e: Throwable): Unit = {
+logError("Error in DAGSchedulerEventLoop: ", e)
+dagScheduler.stop()
+throw e
+  }
+
 }
 
 /**
@@ -300,13 +307,18 @@ class DAGSchedulerSuite
 
   test("zero split job") {
 var numResults = 0
+var failureReason: Option[Exception] = None
 val fakeListener = new JobListener() {
-  override def taskSucceeded(partition: Int, value: Any) = numResults += 1
-  override def jobFailed(exception: Exception) = throw exception
+  override def taskSucceeded(partition: Int, value: Any): Unit = 
numResults += 1
+  override def jobFailed(exception: Exception): Unit = {
+failureReason = Some(exception)
+  }
 }
 val jobId = submit(new MyRDD(sc, 0, Nil), Array(), listener = fakeListener)
 assert(numResults === 0)
 cancel(jobId)
+assert(failureReason.isDefined)
+assert(failureReason.get.getMessage() === "Job 0 cancelled ")
   }
 
   test("run trivial job") {
@@ -1675,6 +1687,18 @@ class DAGSchedulerSuite
 assert(stackTraceString.contains("org.scalatest.FunSuite"))
   }
 
+  test("catch errors in event loop") {
+// this is a test of our testing framework -- make sure errors in event 
loop don't get ignored
+
+// just run some bad event that will throw an exception -- we'll give a 
null

spark git commit: [SPARK-12386][CORE] Fix NPE when spark.executor.port is set.

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 154567dca -> 4ad08035d


[SPARK-12386][CORE] Fix NPE when spark.executor.port is set.

Author: Marcelo Vanzin 

Closes #10339 from vanzin/SPARK-12386.

(cherry picked from commit d1508dd9b765489913bc948575a69ebab82f217b)
Signed-off-by: Andrew Or 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ad08035
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ad08035
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ad08035

Branch: refs/heads/branch-1.6
Commit: 4ad08035d28b8f103132da9779340c5e64e2d1c2
Parents: 154567d
Author: Marcelo Vanzin 
Authored: Wed Dec 16 19:47:49 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 19:47:57 2015 -0800

--
 core/src/main/scala/org/apache/spark/SparkEnv.scala | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4ad08035/core/src/main/scala/org/apache/spark/SparkEnv.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
b/core/src/main/scala/org/apache/spark/SparkEnv.scala
index 84230e3..52acde1 100644
--- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
+++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
@@ -256,7 +256,12 @@ object SparkEnv extends Logging {
   if (rpcEnv.isInstanceOf[AkkaRpcEnv]) {
 rpcEnv.asInstanceOf[AkkaRpcEnv].actorSystem
   } else {
-val actorSystemPort = if (port == 0) 0 else rpcEnv.address.port + 1
+val actorSystemPort =
+  if (port == 0 || rpcEnv.address == null) {
+port
+  } else {
+rpcEnv.address.port + 1
+  }
 // Create a ActorSystem for legacy codes
 AkkaUtils.createActorSystem(
   actorSystemName + "ActorSystem",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12386][CORE] Fix NPE when spark.executor.port is set.

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master fdb382275 -> d1508dd9b


[SPARK-12386][CORE] Fix NPE when spark.executor.port is set.

Author: Marcelo Vanzin 

Closes #10339 from vanzin/SPARK-12386.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d1508dd9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d1508dd9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d1508dd9

Branch: refs/heads/master
Commit: d1508dd9b765489913bc948575a69ebab82f217b
Parents: fdb3822
Author: Marcelo Vanzin 
Authored: Wed Dec 16 19:47:49 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 19:47:49 2015 -0800

--
 core/src/main/scala/org/apache/spark/SparkEnv.scala | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d1508dd9/core/src/main/scala/org/apache/spark/SparkEnv.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
b/core/src/main/scala/org/apache/spark/SparkEnv.scala
index 84230e3..52acde1 100644
--- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
+++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
@@ -256,7 +256,12 @@ object SparkEnv extends Logging {
   if (rpcEnv.isInstanceOf[AkkaRpcEnv]) {
 rpcEnv.asInstanceOf[AkkaRpcEnv].actorSystem
   } else {
-val actorSystemPort = if (port == 0) 0 else rpcEnv.address.port + 1
+val actorSystemPort =
+  if (port == 0 || rpcEnv.address == null) {
+port
+  } else {
+rpcEnv.address.port + 1
+  }
 // Create a ActorSystem for legacy codes
 AkkaUtils.createActorSystem(
   actorSystemName + "ActorSystem",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: MAINTENANCE: Automated closing of pull requests.

2015-12-16 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master 861549acd -> ce5fd4008


MAINTENANCE: Automated closing of pull requests.

This commit exists to close the following pull requests on Github:

Closes #1217 (requested by ankurdave, srowen)
Closes #4650 (requested by andrewor14)
Closes #5307 (requested by vanzin)
Closes #5664 (requested by andrewor14)
Closes #5713 (requested by marmbrus)
Closes #5722 (requested by andrewor14)
Closes #6685 (requested by srowen)
Closes #7074 (requested by srowen)
Closes #7119 (requested by andrewor14)
Closes #7997 (requested by jkbradley)
Closes #8292 (requested by srowen)
Closes #8975 (requested by andrewor14, vanzin)
Closes #8980 (requested by andrewor14, davies)


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ce5fd400
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ce5fd400
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ce5fd400

Branch: refs/heads/master
Commit: ce5fd4008e890ef8ebc2d3cb703a666783ad6c02
Parents: 861549a
Author: Andrew Or 
Authored: Wed Dec 16 17:05:57 2015 -0800
Committer: Andrew Or 
Committed: Wed Dec 16 17:05:57 2015 -0800

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: Revert "[SPARK-12105] [SQL] add convenient show functions"

2015-12-16 Thread rxin

Revert "[SPARK-12105] [SQL] add convenient show functions"

This reverts commit 31b391019ff6eb5a483f4b3e62fd082de7ff8416.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1a3d0cd9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1a3d0cd9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1a3d0cd9

Branch: refs/heads/master
Commit: 1a3d0cd9f013aee1f03b1c632c91ae0951bccbb0
Parents: 18ea11c
Author: Reynold Xin 
Authored: Wed Dec 16 00:57:34 2015 -0800
Committer: Reynold Xin 
Committed: Wed Dec 16 00:57:34 2015 -0800

--
 .../scala/org/apache/spark/sql/DataFrame.scala  | 25 +++-
 1 file changed, 9 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1a3d0cd9/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
index b69d441..497bd48 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
@@ -161,23 +161,16 @@ class DataFrame private[sql](
   }
 
   /**
-* Compose the string representing rows for output
-*/
-  def showString(): String = {
-showString(20)
-  }
-
-  /**
* Compose the string representing rows for output
-   * @param numRows Number of rows to show
+   * @param _numRows Number of rows to show
* @param truncate Whether truncate long strings and align cells right
*/
-  def showString(numRows: Int, truncate: Boolean = true): String = {
-val _numRows = numRows.max(0)
+  private[sql] def showString(_numRows: Int, truncate: Boolean = true): String 
= {
+val numRows = _numRows.max(0)
 val sb = new StringBuilder
-val takeResult = take(_numRows + 1)
-val hasMoreData = takeResult.length > _numRows
-val data = takeResult.take(_numRows)
+val takeResult = take(numRows + 1)
+val hasMoreData = takeResult.length > numRows
+val data = takeResult.take(numRows)
 val numCols = schema.fieldNames.length
 
 // For array values, replace Seq and Array with square brackets
@@ -231,10 +224,10 @@ class DataFrame private[sql](
 
 sb.append(sep)
 
-// For Data that has more than "_numRows" records
+// For Data that has more than "numRows" records
 if (hasMoreData) {
-  val rowsString = if (_numRows == 1) "row" else "rows"
-  sb.append(s"only showing top $_numRows $rowsString\n")
+  val rowsString = if (numRows == 1) "row" else "rows"
+  sb.append(s"only showing top $numRows $rowsString\n")
 }
 
 sb.toString()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[1/2] spark git commit: Revert "[HOTFIX] Compile error from commit 31b3910"

2015-12-16 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 554d840a9 -> 1a3d0cd9f


Revert "[HOTFIX] Compile error from commit 31b3910"

This reverts commit 840bd2e008da5b22bfa73c587ea2c57666fffc60.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18ea11c3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18ea11c3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18ea11c3

Branch: refs/heads/master
Commit: 18ea11c3a84e5eafd81fa0fe7c09224e79c4e93f
Parents: 554d840
Author: Reynold Xin 
Authored: Wed Dec 16 00:57:07 2015 -0800
Committer: Reynold Xin 
Committed: Wed Dec 16 00:57:07 2015 -0800

--
 sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/18ea11c3/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
index 33b03be..b69d441 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
@@ -234,7 +234,7 @@ class DataFrame private[sql](
 // For Data that has more than "_numRows" records
 if (hasMoreData) {
   val rowsString = if (_numRows == 1) "row" else "rows"
-  sb.append(s"only showing top ${_numRows} $rowsString\n")
+  sb.append(s"only showing top $_numRows $rowsString\n")
 }
 
 sb.toString()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

45 matches

Mail list logo