spark git commit: [SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6
Repository: spark Updated Branches: refs/heads/master 6a880afa8 -> 8148cc7a5 [SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6 No known breaking changes, but some deprecations and changes of behavior. CC: mengxr Author: Joseph K. BradleyCloses #10235 from jkbradley/mllib-guide-update-1.6. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8148cc7a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8148cc7a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8148cc7a Branch: refs/heads/master Commit: 8148cc7a5c9f52c82c2eb7652d9aeba85e72d406 Parents: 6a880af Author: Joseph K. Bradley Authored: Wed Dec 16 11:53:04 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 11:53:04 2015 -0800 -- docs/mllib-guide.md| 38 ++--- docs/mllib-migration-guides.md | 19 +++ 2 files changed, 42 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8148cc7a/docs/mllib-guide.md -- diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 680ed48..7ef91a1 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -74,7 +74,7 @@ We list major functionality from both below, with links to detailed guides. * [Advanced topics](ml-advanced.html) Some techniques are not available yet in spark.ml, most notably dimensionality reduction -Users can seemlessly combine the implementation of these techniques found in `spark.mllib` with the rest of the algorithms found in `spark.ml`. +Users can seamlessly combine the implementation of these techniques found in `spark.mllib` with the rest of the algorithms found in `spark.ml`. # Dependencies @@ -101,24 +101,32 @@ MLlib is under active development. The APIs marked `Experimental`/`DeveloperApi` may change in future releases, and the migration guide below will explain all changes between releases. -## From 1.4 to 1.5 +## From 1.5 to 1.6 -In the `spark.mllib` package, there are no break API changes but several behavior changes: +There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are +deprecations and changes of behavior. -* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005): - `RegressionMetrics.explainedVariance` returns the average regression sum of squares. -* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become - sorted. -* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default - convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4. +Deprecations: -In the `spark.ml` package, there exists one break API change and one behavior change: +* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358): + In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated. +* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592): + In `spark.ml.classification.LogisticRegressionModel` and + `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of + the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to + algorithms. -* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed - from `Params.setDefault` due to a - [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013). -* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is - added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4. +Changes of behavior: + +* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770): + `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6. + Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of + `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the + previous error); for small errors (`< 0.01`), it uses absolute error. +* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069): + `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before + tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the + behavior of the simpler `Tokenizer` transformer. ## Previous Spark versions http://git-wip-us.apache.org/repos/asf/spark/blob/8148cc7a/docs/mllib-migration-guides.md -- diff --git a/docs/mllib-migration-guides.md b/docs/mllib-migration-guides.md index
spark git commit: [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR
Repository: spark Updated Branches: refs/heads/branch-1.6 a2d584ed9 -> ac0e2ea7c [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```. Author: Yanbo LiangCloses #10281 from yanboliang/spark-12310. (cherry picked from commit 22f6cd86fc2e2d6f6ad2c3aae416732c46ebf1b1) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac0e2ea7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac0e2ea7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac0e2ea7 Branch: refs/heads/branch-1.6 Commit: ac0e2ea7c712e91503b02ae3c12fa2fcf5079886 Parents: a2d584e Author: Yanbo Liang Authored: Wed Dec 16 10:34:30 2015 -0800 Committer: Shivaram Venkataraman Committed: Wed Dec 16 10:34:54 2015 -0800 -- R/pkg/NAMESPACE | 4 +- R/pkg/R/DataFrame.R | 51 ++-- R/pkg/R/generics.R| 16 +++- R/pkg/inst/tests/testthat/test_sparkSQL.R | 104 ++--- 4 files changed, 119 insertions(+), 56 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ac0e2ea7/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index cab39d6..ccc01fe 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -92,7 +92,9 @@ exportMethods("arrange", "with", "withColumn", "withColumnRenamed", - "write.df") + "write.df", + "write.json", + "write.parquet") exportClasses("Column") http://git-wip-us.apache.org/repos/asf/spark/blob/ac0e2ea7/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 764597d..7292433 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -596,17 +596,44 @@ setMethod("toJSON", RDD(jrdd, serializedMode = "string") }) -#' saveAsParquetFile +#' write.json +#' +#' Save the contents of a DataFrame as a JSON file (one object per line). Files written out +#' with this method can be read back in as a DataFrame using read.json(). +#' +#' @param x A SparkSQL DataFrame +#' @param path The directory where the file is saved +#' +#' @family DataFrame functions +#' @rdname write.json +#' @name write.json +#' @export +#' @examples +#'\dontrun{ +#' sc <- sparkR.init() +#' sqlContext <- sparkRSQL.init(sc) +#' path <- "path/to/file.json" +#' df <- read.json(sqlContext, path) +#' write.json(df, "/tmp/sparkr-tmp/") +#'} +setMethod("write.json", + signature(x = "DataFrame", path = "character"), + function(x, path) { +write <- callJMethod(x@sdf, "write") +invisible(callJMethod(write, "json", path)) + }) + +#' write.parquet #' #' Save the contents of a DataFrame as a Parquet file, preserving the schema. Files written out -#' with this method can be read back in as a DataFrame using parquetFile(). +#' with this method can be read back in as a DataFrame using read.parquet(). #' #' @param x A SparkSQL DataFrame #' @param path The directory where the file is saved #' #' @family DataFrame functions -#' @rdname saveAsParquetFile -#' @name saveAsParquetFile +#' @rdname write.parquet +#' @name write.parquet #' @export #' @examples #'\dontrun{ @@ -614,12 +641,24 @@ setMethod("toJSON", #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" #' df <- read.json(sqlContext, path) -#' saveAsParquetFile(df, "/tmp/sparkr-tmp/") +#' write.parquet(df, "/tmp/sparkr-tmp1/") +#' saveAsParquetFile(df, "/tmp/sparkr-tmp2/") #'} +setMethod("write.parquet", + signature(x = "DataFrame", path = "character"), + function(x, path) { +write <- callJMethod(x@sdf, "write") +invisible(callJMethod(write, "parquet", path)) + }) + +#' @rdname write.parquet +#' @name saveAsParquetFile +#' @export setMethod("saveAsParquetFile", signature(x = "DataFrame", path = "character"), function(x, path) { -invisible(callJMethod(x@sdf, "saveAsParquetFile", path)) +.Deprecated("write.parquet") +write.parquet(x, path) }) #' Distinct http://git-wip-us.apache.org/repos/asf/spark/blob/ac0e2ea7/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index c383e6e..62be2dd 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -519,10 +519,6 @@
spark git commit: [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR
Repository: spark Updated Branches: refs/heads/master 2eb5af5f0 -> 22f6cd86f [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```. Author: Yanbo LiangCloses #10281 from yanboliang/spark-12310. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/22f6cd86 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/22f6cd86 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/22f6cd86 Branch: refs/heads/master Commit: 22f6cd86fc2e2d6f6ad2c3aae416732c46ebf1b1 Parents: 2eb5af5 Author: Yanbo Liang Authored: Wed Dec 16 10:34:30 2015 -0800 Committer: Shivaram Venkataraman Committed: Wed Dec 16 10:34:30 2015 -0800 -- R/pkg/NAMESPACE | 4 +- R/pkg/R/DataFrame.R | 51 ++-- R/pkg/R/generics.R| 16 +++- R/pkg/inst/tests/testthat/test_sparkSQL.R | 104 ++--- 4 files changed, 119 insertions(+), 56 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/22f6cd86/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index cab39d6..ccc01fe 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -92,7 +92,9 @@ exportMethods("arrange", "with", "withColumn", "withColumnRenamed", - "write.df") + "write.df", + "write.json", + "write.parquet") exportClasses("Column") http://git-wip-us.apache.org/repos/asf/spark/blob/22f6cd86/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 380a13f..0cfa12b9 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -596,17 +596,44 @@ setMethod("toJSON", RDD(jrdd, serializedMode = "string") }) -#' saveAsParquetFile +#' write.json +#' +#' Save the contents of a DataFrame as a JSON file (one object per line). Files written out +#' with this method can be read back in as a DataFrame using read.json(). +#' +#' @param x A SparkSQL DataFrame +#' @param path The directory where the file is saved +#' +#' @family DataFrame functions +#' @rdname write.json +#' @name write.json +#' @export +#' @examples +#'\dontrun{ +#' sc <- sparkR.init() +#' sqlContext <- sparkRSQL.init(sc) +#' path <- "path/to/file.json" +#' df <- read.json(sqlContext, path) +#' write.json(df, "/tmp/sparkr-tmp/") +#'} +setMethod("write.json", + signature(x = "DataFrame", path = "character"), + function(x, path) { +write <- callJMethod(x@sdf, "write") +invisible(callJMethod(write, "json", path)) + }) + +#' write.parquet #' #' Save the contents of a DataFrame as a Parquet file, preserving the schema. Files written out -#' with this method can be read back in as a DataFrame using parquetFile(). +#' with this method can be read back in as a DataFrame using read.parquet(). #' #' @param x A SparkSQL DataFrame #' @param path The directory where the file is saved #' #' @family DataFrame functions -#' @rdname saveAsParquetFile -#' @name saveAsParquetFile +#' @rdname write.parquet +#' @name write.parquet #' @export #' @examples #'\dontrun{ @@ -614,12 +641,24 @@ setMethod("toJSON", #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" #' df <- read.json(sqlContext, path) -#' saveAsParquetFile(df, "/tmp/sparkr-tmp/") +#' write.parquet(df, "/tmp/sparkr-tmp1/") +#' saveAsParquetFile(df, "/tmp/sparkr-tmp2/") #'} +setMethod("write.parquet", + signature(x = "DataFrame", path = "character"), + function(x, path) { +write <- callJMethod(x@sdf, "write") +invisible(callJMethod(write, "parquet", path)) + }) + +#' @rdname write.parquet +#' @name saveAsParquetFile +#' @export setMethod("saveAsParquetFile", signature(x = "DataFrame", path = "character"), function(x, path) { -invisible(callJMethod(x@sdf, "saveAsParquetFile", path)) +.Deprecated("write.parquet") +write.parquet(x, path) }) #' Distinct http://git-wip-us.apache.org/repos/asf/spark/blob/22f6cd86/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index c383e6e..62be2dd 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -519,10 +519,6 @@ setGeneric("sample_frac", #' @export setGeneric("sampleBy", function(x, col, fractions, seed) { standardGeneric("sampleBy") }) -#' @rdname saveAsParquetFile
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.6.0-rc3 [created] 168c89e07 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[1/2] spark git commit: Preparing Spark release v1.6.0-rc3
Repository: spark Updated Branches: refs/heads/branch-1.6 e1adf6d7d -> aee88eb55 Preparing Spark release v1.6.0-rc3 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/168c89e0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/168c89e0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/168c89e0 Branch: refs/heads/branch-1.6 Commit: 168c89e07c51fa24b0bb88582c739cec0acb44d7 Parents: e1adf6d Author: Patrick WendellAuthored: Wed Dec 16 11:23:41 2015 -0800 Committer: Patrick Wendell Committed: Wed Dec 16 11:23:41 2015 -0800 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- docker-integration-tests/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt-assembly/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tags/pom.xml| 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 35 files changed, 35 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 4b60ee0..fbabaa5 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 672e946..1b3e417 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 61744bb..15b8d75 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/docker-integration-tests/pom.xml -- diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml index 39d3f34..d579879 100644 --- a/docker-integration-tests/pom.xml +++ b/docker-integration-tests/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index f5ab2a7..37b15bb 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index dceedcf..295455a 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/168c89e0/external/flume-sink/pom.xml
spark git commit: [SPARK-12364][ML][SPARKR] Add ML example for SparkR
Repository: spark Updated Branches: refs/heads/branch-1.6 dffa6100d -> 04e868b63 [SPARK-12364][ML][SPARKR] Add ML example for SparkR We have DataFrame example for SparkR, we also need to add ML example under ```examples/src/main/r```. cc mengxr jkbradley shivaram Author: Yanbo LiangCloses #10324 from yanboliang/spark-12364. (cherry picked from commit 1a8b2a17db7ab7a213d553079b83274aeebba86f) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/04e868b6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/04e868b6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/04e868b6 Branch: refs/heads/branch-1.6 Commit: 04e868b63bfda5afe5cb1a0d6387fb873ad393ba Parents: dffa610 Author: Yanbo Liang Authored: Wed Dec 16 12:59:22 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 12:59:33 2015 -0800 -- examples/src/main/r/ml.R | 54 +++ 1 file changed, 54 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/04e868b6/examples/src/main/r/ml.R -- diff --git a/examples/src/main/r/ml.R b/examples/src/main/r/ml.R new file mode 100644 index 000..a0c9039 --- /dev/null +++ b/examples/src/main/r/ml.R @@ -0,0 +1,54 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/sparkR examples/src/main/r/ml.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkContext and SQLContext +sc <- sparkR.init(appName="SparkR-ML-example") +sqlContext <- sparkRSQL.init(sc) + +# Train GLM of family 'gaussian' +training1 <- suppressWarnings(createDataFrame(sqlContext, iris)) +test1 <- training1 +model1 <- glm(Sepal_Length ~ Sepal_Width + Species, training1, family = "gaussian") + +# Model summary +summary(model1) + +# Prediction +predictions1 <- predict(model1, test1) +head(select(predictions1, "Sepal_Length", "prediction")) + +# Train GLM of family 'binomial' +training2 <- filter(training1, training1$Species != "setosa") +test2 <- training2 +model2 <- glm(Species ~ Sepal_Length + Sepal_Width, data = training2, family = "binomial") + +# Model summary +summary(model2) + +# Prediction (Currently the output of prediction for binomial GLM is the indexed label, +# we need to transform back to the original string label later) +predictions2 <- predict(model2, test2) +head(select(predictions2, "Species", "prediction")) + +# Stop the SparkContext now +sparkR.stop() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6
Repository: spark Updated Branches: refs/heads/branch-1.6 aee88eb55 -> dffa6100d [SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6 No known breaking changes, but some deprecations and changes of behavior. CC: mengxr Author: Joseph K. BradleyCloses #10235 from jkbradley/mllib-guide-update-1.6. (cherry picked from commit 8148cc7a5c9f52c82c2eb7652d9aeba85e72d406) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dffa6100 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dffa6100 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dffa6100 Branch: refs/heads/branch-1.6 Commit: dffa6100d7d96eb38bf8a56f546d66f7a884b03f Parents: aee88eb Author: Joseph K. Bradley Authored: Wed Dec 16 11:53:04 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 11:53:15 2015 -0800 -- docs/mllib-guide.md| 38 ++--- docs/mllib-migration-guides.md | 19 +++ 2 files changed, 42 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dffa6100/docs/mllib-guide.md -- diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 680ed48..7ef91a1 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -74,7 +74,7 @@ We list major functionality from both below, with links to detailed guides. * [Advanced topics](ml-advanced.html) Some techniques are not available yet in spark.ml, most notably dimensionality reduction -Users can seemlessly combine the implementation of these techniques found in `spark.mllib` with the rest of the algorithms found in `spark.ml`. +Users can seamlessly combine the implementation of these techniques found in `spark.mllib` with the rest of the algorithms found in `spark.ml`. # Dependencies @@ -101,24 +101,32 @@ MLlib is under active development. The APIs marked `Experimental`/`DeveloperApi` may change in future releases, and the migration guide below will explain all changes between releases. -## From 1.4 to 1.5 +## From 1.5 to 1.6 -In the `spark.mllib` package, there are no break API changes but several behavior changes: +There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are +deprecations and changes of behavior. -* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005): - `RegressionMetrics.explainedVariance` returns the average regression sum of squares. -* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become - sorted. -* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default - convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4. +Deprecations: -In the `spark.ml` package, there exists one break API change and one behavior change: +* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358): + In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated. +* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592): + In `spark.ml.classification.LogisticRegressionModel` and + `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of + the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to + algorithms. -* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed - from `Params.setDefault` due to a - [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013). -* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is - added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4. +Changes of behavior: + +* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770): + `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6. + Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of + `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the + previous error); for small errors (`< 0.01`), it uses absolute error. +* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069): + `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before + tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the + behavior of the simpler `Tokenizer` transformer. ## Previous Spark versions http://git-wip-us.apache.org/repos/asf/spark/blob/dffa6100/docs/mllib-migration-guides.md
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.6.0-rc3 [deleted] 00a39d9c0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12361][PYSPARK][TESTS] Should set PYSPARK_DRIVER_PYTHON before Python tests
Repository: spark Updated Branches: refs/heads/master d252b2d54 -> 6a880afa8 [SPARK-12361][PYSPARK][TESTS] Should set PYSPARK_DRIVER_PYTHON before Python tests Although this patch still doesn't solve the issue why the return code is 0 (see JIRA description), it resolves the issue of python version mismatch. Author: Jeff ZhangCloses #10322 from zjffdu/SPARK-12361. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a880afa Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a880afa Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a880afa Branch: refs/heads/master Commit: 6a880afa831348b413ba95b98ff089377b950666 Parents: d252b2d Author: Jeff Zhang Authored: Wed Dec 16 11:29:47 2015 -0800 Committer: Josh Rosen Committed: Wed Dec 16 11:29:51 2015 -0800 -- python/run-tests.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6a880afa/python/run-tests.py -- diff --git a/python/run-tests.py b/python/run-tests.py index f5857f8..ee73eb1 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -56,7 +56,8 @@ LOGGER = logging.getLogger() def run_individual_python_test(test_name, pyspark_python): env = dict(os.environ) -env.update({'SPARK_TESTING': '1', 'PYSPARK_PYTHON': which(pyspark_python)}) +env.update({'SPARK_TESTING': '1', 'PYSPARK_PYTHON': which(pyspark_python), +'PYSPARK_DRIVER_PYTHON': which(pyspark_python)}) LOGGER.debug("Starting test(%s): %s", pyspark_python, test_name) start_time = time.time() try: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12320][SQL] throw exception if the number of fields does not line up for Tuple encoder
Repository: spark Updated Branches: refs/heads/master 1a8b2a17d -> a783a8ed4 [SPARK-12320][SQL] throw exception if the number of fields does not line up for Tuple encoder Author: Wenchen FanCloses #10293 from cloud-fan/err-msg. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a783a8ed Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a783a8ed Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a783a8ed Branch: refs/heads/master Commit: a783a8ed49814a09fde653433a3d6de398ddf888 Parents: 1a8b2a1 Author: Wenchen Fan Authored: Wed Dec 16 13:18:56 2015 -0800 Committer: Michael Armbrust Committed: Wed Dec 16 13:20:12 2015 -0800 -- .../apache/spark/sql/catalyst/dsl/package.scala | 3 +- .../catalyst/encoders/ExpressionEncoder.scala | 36 +++- .../expressions/complexTypeExtractors.scala | 10 ++-- .../encoders/EncoderResolutionSuite.scala | 60 +--- .../catalyst/expressions/ComplexTypeSuite.scala | 2 +- 5 files changed, 93 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a783a8ed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala index e509711..8102c93 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala @@ -227,9 +227,10 @@ package object dsl { AttributeReference(s, mapType, nullable = true)() /** Creates a new AttributeReference of type struct */ - def struct(fields: StructField*): AttributeReference = struct(StructType(fields)) def struct(structType: StructType): AttributeReference = AttributeReference(s, structType, nullable = true)() + def struct(attrs: AttributeReference*): AttributeReference = +struct(StructType.fromAttributes(attrs)) } implicit class DslAttribute(a: AttributeReference) { http://git-wip-us.apache.org/repos/asf/spark/blob/a783a8ed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala index 363178b..7a4401c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala @@ -244,9 +244,41 @@ case class ExpressionEncoder[T]( def resolve( schema: Seq[Attribute], outerScopes: ConcurrentMap[String, AnyRef]): ExpressionEncoder[T] = { -val positionToAttribute = AttributeMap.toIndex(schema) +def fail(st: StructType, maxOrdinal: Int): Unit = { + throw new AnalysisException(s"Try to map ${st.simpleString} to Tuple${maxOrdinal + 1}, " + +"but failed as the number of fields does not line up.\n" + +" - Input schema: " + StructType.fromAttributes(schema).simpleString + "\n" + +" - Target schema: " + this.schema.simpleString) +} + +var maxOrdinal = -1 +fromRowExpression.foreach { + case b: BoundReference => if (b.ordinal > maxOrdinal) maxOrdinal = b.ordinal + case _ => +} +if (maxOrdinal >= 0 && maxOrdinal != schema.length - 1) { + fail(StructType.fromAttributes(schema), maxOrdinal) +} + val unbound = fromRowExpression transform { - case b: BoundReference => positionToAttribute(b.ordinal) + case b: BoundReference => schema(b.ordinal) +} + +val exprToMaxOrdinal = scala.collection.mutable.HashMap.empty[Expression, Int] +unbound.foreach { + case g: GetStructField => +val maxOrdinal = exprToMaxOrdinal.getOrElse(g.child, -1) +if (maxOrdinal < g.ordinal) { + exprToMaxOrdinal.update(g.child, g.ordinal) +} + case _ => +} +exprToMaxOrdinal.foreach { + case (expr, maxOrdinal) => +val schema = expr.dataType.asInstanceOf[StructType] +if (maxOrdinal != schema.length - 1) { + fail(schema, maxOrdinal) +} } val plan = Project(Alias(unbound, "")() :: Nil, LocalRelation(schema)) http://git-wip-us.apache.org/repos/asf/spark/blob/a783a8ed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala
spark git commit: [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml
Repository: spark Updated Branches: refs/heads/master 22f6cd86f -> 26d70bd2b [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml cc jkbradley Author: Yu ISHIKAWACloses #10244 from yu-iskw/SPARK-12215. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/26d70bd2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/26d70bd2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/26d70bd2 Branch: refs/heads/master Commit: 26d70bd2b42617ff731b6e9e6d77933b38597ebe Parents: 22f6cd8 Author: Yu ISHIKAWA Authored: Wed Dec 16 10:43:45 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 10:43:45 2015 -0800 -- docs/ml-clustering.md | 71 .../spark/examples/ml/JavaKMeansExample.java| 8 ++- .../spark/examples/ml/KMeansExample.scala | 49 +++--- 3 files changed, 100 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/26d70bd2/docs/ml-clustering.md -- diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md index a59f7e3..440c455 100644 --- a/docs/ml-clustering.md +++ b/docs/ml-clustering.md @@ -11,6 +11,77 @@ In this section, we introduce the pipeline API for [clustering in mllib](mllib-c * This will become a table of contents (this text will be scraped). {:toc} +## K-means + +[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the +most commonly used clustering algorithms that clusters the data points into a +predefined number of clusters. The MLlib implementation includes a parallelized +variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method +called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). + +`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model. + +### Input Columns + + + + + Param name + Type(s) + Default + Description + + + + + featuresCol + Vector + "features" + Feature vector + + + + +### Output Columns + + + + + Param name + Type(s) + Default + Description + + + + + predictionCol + Int + "prediction" + Predicted cluster center + + + + +### Example + + + + +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details. + +{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %} + + + +Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details. + +{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %} + + + + + ## Latent Dirichlet allocation (LDA) `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`, http://git-wip-us.apache.org/repos/asf/spark/blob/26d70bd2/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java index 47665ff..96481d8 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java @@ -23,6 +23,9 @@ import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.catalyst.expressions.GenericRow; +// $example on$ import org.apache.spark.ml.clustering.KMeansModel; import org.apache.spark.ml.clustering.KMeans; import org.apache.spark.mllib.linalg.Vector; @@ -30,11 +33,10 @@ import org.apache.spark.mllib.linalg.VectorUDT; import org.apache.spark.mllib.linalg.Vectors; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; -import org.apache.spark.sql.SQLContext; -import org.apache.spark.sql.catalyst.expressions.GenericRow; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; +// $example off$ /** @@ -74,6 +76,7 @@ public class JavaKMeansExample { JavaSparkContext jsc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(jsc); +// $example on$ // Loads data JavaRDD points = jsc.textFile(inputFile).map(new ParsePoint()); StructField[] fields = {new StructField("features", new VectorUDT(), false,
spark git commit: [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml
Repository: spark Updated Branches: refs/heads/branch-1.6 ac0e2ea7c -> 16edd933d [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml cc jkbradley Author: Yu ISHIKAWACloses #10244 from yu-iskw/SPARK-12215. (cherry picked from commit 26d70bd2b42617ff731b6e9e6d77933b38597ebe) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16edd933 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16edd933 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16edd933 Branch: refs/heads/branch-1.6 Commit: 16edd933d7323f8b6861409bbd62bc1efe244c14 Parents: ac0e2ea Author: Yu ISHIKAWA Authored: Wed Dec 16 10:43:45 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 10:43:55 2015 -0800 -- docs/ml-clustering.md | 71 .../spark/examples/ml/JavaKMeansExample.java| 8 ++- .../spark/examples/ml/KMeansExample.scala | 49 +++--- 3 files changed, 100 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/16edd933/docs/ml-clustering.md -- diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md index a59f7e3..440c455 100644 --- a/docs/ml-clustering.md +++ b/docs/ml-clustering.md @@ -11,6 +11,77 @@ In this section, we introduce the pipeline API for [clustering in mllib](mllib-c * This will become a table of contents (this text will be scraped). {:toc} +## K-means + +[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the +most commonly used clustering algorithms that clusters the data points into a +predefined number of clusters. The MLlib implementation includes a parallelized +variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method +called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). + +`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model. + +### Input Columns + + + + + Param name + Type(s) + Default + Description + + + + + featuresCol + Vector + "features" + Feature vector + + + + +### Output Columns + + + + + Param name + Type(s) + Default + Description + + + + + predictionCol + Int + "prediction" + Predicted cluster center + + + + +### Example + + + + +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details. + +{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %} + + + +Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details. + +{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %} + + + + + ## Latent Dirichlet allocation (LDA) `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`, http://git-wip-us.apache.org/repos/asf/spark/blob/16edd933/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java index 47665ff..96481d8 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java @@ -23,6 +23,9 @@ import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.catalyst.expressions.GenericRow; +// $example on$ import org.apache.spark.ml.clustering.KMeansModel; import org.apache.spark.ml.clustering.KMeans; import org.apache.spark.mllib.linalg.Vector; @@ -30,11 +33,10 @@ import org.apache.spark.mllib.linalg.VectorUDT; import org.apache.spark.mllib.linalg.Vectors; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; -import org.apache.spark.sql.SQLContext; -import org.apache.spark.sql.catalyst.expressions.GenericRow; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; +// $example off$ /** @@ -74,6 +76,7 @@ public class JavaKMeansExample { JavaSparkContext jsc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(jsc); +// $example on$ // Loads data JavaRDD points =
spark git commit: [SPARK-9694][ML] Add random seed Param to Scala CrossValidator
Repository: spark Updated Branches: refs/heads/master 7b6dc29d0 -> 860dc7f2f [SPARK-9694][ML] Add random seed Param to Scala CrossValidator Add random seed Param to Scala CrossValidator Author: Yanbo LiangCloses #9108 from yanboliang/spark-9694. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/860dc7f2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/860dc7f2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/860dc7f2 Branch: refs/heads/master Commit: 860dc7f2f8dd01f2562ba83b7af27ba29d91cb62 Parents: 7b6dc29 Author: Yanbo Liang Authored: Wed Dec 16 11:05:37 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 11:05:37 2015 -0800 -- .../org/apache/spark/ml/tuning/CrossValidator.scala | 11 --- .../main/scala/org/apache/spark/mllib/util/MLUtils.scala | 8 2 files changed, 16 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/860dc7f2/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala b/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala index 5c09f1a..40f8857 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala @@ -29,8 +29,9 @@ import org.apache.spark.ml.classification.OneVsRestParams import org.apache.spark.ml.evaluation.Evaluator import org.apache.spark.ml.feature.RFormulaModel import org.apache.spark.ml.param._ -import org.apache.spark.ml.util.DefaultParamsReader.Metadata +import org.apache.spark.ml.param.shared.HasSeed import org.apache.spark.ml.util._ +import org.apache.spark.ml.util.DefaultParamsReader.Metadata import org.apache.spark.mllib.util.MLUtils import org.apache.spark.sql.DataFrame import org.apache.spark.sql.types.StructType @@ -39,7 +40,7 @@ import org.apache.spark.sql.types.StructType /** * Params for [[CrossValidator]] and [[CrossValidatorModel]]. */ -private[ml] trait CrossValidatorParams extends ValidatorParams { +private[ml] trait CrossValidatorParams extends ValidatorParams with HasSeed { /** * Param for number of folds for cross validation. Must be >= 2. * Default: 3 @@ -85,6 +86,10 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) @Since("1.2.0") def setNumFolds(value: Int): this.type = set(numFolds, value) + /** @group setParam */ + @Since("2.0.0") + def setSeed(value: Long): this.type = set(seed, value) + @Since("1.4.0") override def fit(dataset: DataFrame): CrossValidatorModel = { val schema = dataset.schema @@ -95,7 +100,7 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) val epm = $(estimatorParamMaps) val numModels = epm.length val metrics = new Array[Double](epm.length) -val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) +val splits = MLUtils.kFold(dataset.rdd, $(numFolds), $(seed)) splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() val validationDataset = sqlCtx.createDataFrame(validation, schema).cache() http://git-wip-us.apache.org/repos/asf/spark/blob/860dc7f2/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 414ea99..4c9151f 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -265,6 +265,14 @@ object MLUtils { */ @Since("1.0.0") def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): Array[(RDD[T], RDD[T])] = { +kFold(rdd, numFolds, seed.toLong) + } + + /** + * Version of [[kFold()]] taking a Long seed. + */ + @Since("2.0.0") + def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Long): Array[(RDD[T], RDD[T])] = { val numFoldsF = numFolds.toFloat (1 to numFolds).map { fold => val sampler = new BernoulliCellSampler[T]((fold - 1) / numFoldsF, fold / numFoldsF, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation
Repository: spark Updated Branches: refs/heads/branch-1.6 fb08f7b78 -> a2d584ed9 [SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation This fixes the sidebar, using a pure CSS mechanism to hide it when the browser's viewport is too narrow. Credit goes to the original author Titan-C (mentioned in the NOTICE). Note that I am not a CSS expert, so I can only address comments up to some extent. Default view: https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png;> When collapsed manually by the user: https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png;> Disappears when column is too narrow: https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png;> Can still be opened by the user if necessary: https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png;> Author: Timothy HunterCloses #10297 from thunterdb/12324. (cherry picked from commit a6325fc401f68d9fa30cc947c44acc9d64ebda7b) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a2d584ed Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a2d584ed Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a2d584ed Branch: refs/heads/branch-1.6 Commit: a2d584ed9ab3c073df057bed5314bdf877a47616 Parents: fb08f7b Author: Timothy Hunter Authored: Wed Dec 16 10:12:33 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 10:12:47 2015 -0800 -- NOTICE| 9 ++- docs/_layouts/global.html | 35 +++ docs/css/main.css | 137 ++--- docs/js/main.js | 2 +- 4 files changed, 149 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a2d584ed/NOTICE -- diff --git a/NOTICE b/NOTICE index 7f7769f..571f8c2 100644 --- a/NOTICE +++ b/NOTICE @@ -606,4 +606,11 @@ Vis.js uses and redistributes the following third-party libraries: - keycharm https://github.com/AlexDM0/keycharm - The MIT License \ No newline at end of file + The MIT License + +=== + +The CSS style for the navigation sidebar of the documentation was originally +submitted by Ãscar Nájera for the scikit-learn project. The scikit-learn project +is distributed under the 3-Clause BSD license. +=== http://git-wip-us.apache.org/repos/asf/spark/blob/a2d584ed/docs/_layouts/global.html -- diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index 0b5b0cd..3089474 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -1,3 +1,4 @@ + @@ -127,20 +128,32 @@ {% if page.url contains "/ml" %} - {% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %} -{% endif %} - +{% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %} + + + +{% if page.displayTitle %} +{{ page.displayTitle }} +{% else %} +{{ page.title }} +{% endif %} + +{{ content }} - - {% if page.displayTitle %} -{{ page.displayTitle }} - {% else %} -{{ page.title }} - {% endif %} + +{% else %} + +{% if page.displayTitle %} +{{ page.displayTitle }} +{% else %} +{{ page.title }} +{% endif %} - {{ content }} +{{ content }} - + +{% endif %} + http://git-wip-us.apache.org/repos/asf/spark/blob/a2d584ed/docs/css/main.css -- diff --git a/docs/css/main.css b/docs/css/main.css index 356b324..175e800 100755 --- a/docs/css/main.css +++ b/docs/css/main.css @@ -40,17 +40,14 @@ } body .container-wrapper { - position: absolute; - width: 100%; - display: flex; -} - -body #content { + background-color: #FFF; + color: #1D1F22; + max-width: 1024px; + margin-top:
spark git commit: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation
Repository: spark Updated Branches: refs/heads/master 1a3d0cd9f -> a6325fc40 [SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation This fixes the sidebar, using a pure CSS mechanism to hide it when the browser's viewport is too narrow. Credit goes to the original author Titan-C (mentioned in the NOTICE). Note that I am not a CSS expert, so I can only address comments up to some extent. Default view: https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png;> When collapsed manually by the user: https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png;> Disappears when column is too narrow: https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png;> Can still be opened by the user if necessary: https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png;> Author: Timothy HunterCloses #10297 from thunterdb/12324. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6325fc4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6325fc4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6325fc4 Branch: refs/heads/master Commit: a6325fc401f68d9fa30cc947c44acc9d64ebda7b Parents: 1a3d0cd Author: Timothy Hunter Authored: Wed Dec 16 10:12:33 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 10:12:33 2015 -0800 -- NOTICE| 9 ++- docs/_layouts/global.html | 35 +++ docs/css/main.css | 137 ++--- docs/js/main.js | 2 +- 4 files changed, 149 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a6325fc4/NOTICE -- diff --git a/NOTICE b/NOTICE index 7f7769f..571f8c2 100644 --- a/NOTICE +++ b/NOTICE @@ -606,4 +606,11 @@ Vis.js uses and redistributes the following third-party libraries: - keycharm https://github.com/AlexDM0/keycharm - The MIT License \ No newline at end of file + The MIT License + +=== + +The CSS style for the navigation sidebar of the documentation was originally +submitted by Ãscar Nájera for the scikit-learn project. The scikit-learn project +is distributed under the 3-Clause BSD license. +=== http://git-wip-us.apache.org/repos/asf/spark/blob/a6325fc4/docs/_layouts/global.html -- diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index 0b5b0cd..3089474 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -1,3 +1,4 @@ + @@ -127,20 +128,32 @@ {% if page.url contains "/ml" %} - {% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %} -{% endif %} - +{% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %} + + + +{% if page.displayTitle %} +{{ page.displayTitle }} +{% else %} +{{ page.title }} +{% endif %} + +{{ content }} - - {% if page.displayTitle %} -{{ page.displayTitle }} - {% else %} -{{ page.title }} - {% endif %} + +{% else %} + +{% if page.displayTitle %} +{{ page.displayTitle }} +{% else %} +{{ page.title }} +{% endif %} - {{ content }} +{{ content }} - + +{% endif %} + http://git-wip-us.apache.org/repos/asf/spark/blob/a6325fc4/docs/css/main.css -- diff --git a/docs/css/main.css b/docs/css/main.css index 356b324..175e800 100755 --- a/docs/css/main.css +++ b/docs/css/main.css @@ -40,17 +40,14 @@ } body .container-wrapper { - position: absolute; - width: 100%; - display: flex; -} - -body #content { + background-color: #FFF; + color: #1D1F22; + max-width: 1024px; + margin-top: 10px; + margin-left: auto; + margin-right: auto; + border-radius: 15px; position: relative; - - line-height: 1.6; /* Inspired by
spark git commit: [SPARK-12318][SPARKR] Save mode in SparkR should be error by default
Repository: spark Updated Branches: refs/heads/master 54c512ba9 -> 2eb5af5f0 [SPARK-12318][SPARKR] Save mode in SparkR should be error by default shivaram Please help review. Author: Jeff ZhangCloses #10290 from zjffdu/SPARK-12318. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2eb5af5f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2eb5af5f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2eb5af5f Branch: refs/heads/master Commit: 2eb5af5f0d3c424dc617bb1a18dd0210ea9ba0bc Parents: 54c512b Author: Jeff Zhang Authored: Wed Dec 16 10:32:32 2015 -0800 Committer: Shivaram Venkataraman Committed: Wed Dec 16 10:32:32 2015 -0800 -- R/pkg/R/DataFrame.R | 10 +- docs/sparkr.md | 9 - 2 files changed, 13 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2eb5af5f/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 764597d..380a13f 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -1886,7 +1886,7 @@ setMethod("except", #' @param df A SparkSQL DataFrame #' @param path A name for the table #' @param source A name for external data source -#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode +#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default) #' #' @family DataFrame functions #' @rdname write.df @@ -1903,7 +1903,7 @@ setMethod("except", #' } setMethod("write.df", signature(df = "DataFrame", path = "character"), - function(df, path, source = NULL, mode = "append", ...){ + function(df, path, source = NULL, mode = "error", ...){ if (is.null(source)) { sqlContext <- get(".sparkRSQLsc", envir = .sparkREnv) source <- callJMethod(sqlContext, "getConf", "spark.sql.sources.default", @@ -1928,7 +1928,7 @@ setMethod("write.df", #' @export setMethod("saveDF", signature(df = "DataFrame", path = "character"), - function(df, path, source = NULL, mode = "append", ...){ + function(df, path, source = NULL, mode = "error", ...){ write.df(df, path, source, mode, ...) }) @@ -1951,7 +1951,7 @@ setMethod("saveDF", #' @param df A SparkSQL DataFrame #' @param tableName A name for the table #' @param source A name for external data source -#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode +#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default) #' #' @family DataFrame functions #' @rdname saveAsTable @@ -1968,7 +1968,7 @@ setMethod("saveDF", setMethod("saveAsTable", signature(df = "DataFrame", tableName = "character", source = "character", mode = "character"), - function(df, tableName, source = NULL, mode="append", ...){ + function(df, tableName, source = NULL, mode="error", ...){ if (is.null(source)) { sqlContext <- get(".sparkRSQLsc", envir = .sparkREnv) source <- callJMethod(sqlContext, "getConf", "spark.sql.sources.default", http://git-wip-us.apache.org/repos/asf/spark/blob/2eb5af5f/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index 0114878..9ddd2ed 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -148,7 +148,7 @@ printSchema(people) The data sources API can also be used to save out DataFrames into multiple file formats. For example we can save the DataFrame from the previous example -to a Parquet file using `write.df` +to a Parquet file using `write.df` (Until Spark 1.6, the default mode for writes was `append`. It was changed in Spark 1.7 to `error` to match the Scala API) {% highlight r %} @@ -387,3 +387,10 @@ The following functions are masked by the SparkR package: Since part of SparkR is modeled on the `dplyr` package, certain functions in SparkR share the same names with those in `dplyr`. Depending on the load order of the two packages, some functions from the package loaded first are masked by those in the package loaded after. In such case, prefix such calls with the package name, for instance, `SparkR::cume_dist(x)` or `dplyr::cume_dist(x)`. You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/search.html) + + +# Migration Guide + +## Upgrading From SparkR 1.6 to 1.7 + + - Until Spark 1.6, the default mode for writes was `append`. It was changed in Spark 1.7 to `error` to match the Scala API.
spark git commit: [SPARK-12318][SPARKR] Save mode in SparkR should be error by default
Repository: spark Updated Branches: refs/heads/branch-1.6 16edd933d -> f81512729 [SPARK-12318][SPARKR] Save mode in SparkR should be error by default shivaram Please help review. Author: Jeff ZhangCloses #10290 from zjffdu/SPARK-12318. (cherry picked from commit 2eb5af5f0d3c424dc617bb1a18dd0210ea9ba0bc) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f8151272 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f8151272 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f8151272 Branch: refs/heads/branch-1.6 Commit: f815127294c06320204d9affa4f35da7ec3a710d Parents: 16edd93 Author: Jeff Zhang Authored: Wed Dec 16 10:32:32 2015 -0800 Committer: Shivaram Venkataraman Committed: Wed Dec 16 10:48:54 2015 -0800 -- R/pkg/R/DataFrame.R | 10 +- docs/sparkr.md | 9 - 2 files changed, 13 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f8151272/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 7292433..0cfa12b9 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -1925,7 +1925,7 @@ setMethod("except", #' @param df A SparkSQL DataFrame #' @param path A name for the table #' @param source A name for external data source -#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode +#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default) #' #' @family DataFrame functions #' @rdname write.df @@ -1942,7 +1942,7 @@ setMethod("except", #' } setMethod("write.df", signature(df = "DataFrame", path = "character"), - function(df, path, source = NULL, mode = "append", ...){ + function(df, path, source = NULL, mode = "error", ...){ if (is.null(source)) { sqlContext <- get(".sparkRSQLsc", envir = .sparkREnv) source <- callJMethod(sqlContext, "getConf", "spark.sql.sources.default", @@ -1967,7 +1967,7 @@ setMethod("write.df", #' @export setMethod("saveDF", signature(df = "DataFrame", path = "character"), - function(df, path, source = NULL, mode = "append", ...){ + function(df, path, source = NULL, mode = "error", ...){ write.df(df, path, source, mode, ...) }) @@ -1990,7 +1990,7 @@ setMethod("saveDF", #' @param df A SparkSQL DataFrame #' @param tableName A name for the table #' @param source A name for external data source -#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode +#' @param mode One of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default) #' #' @family DataFrame functions #' @rdname saveAsTable @@ -2007,7 +2007,7 @@ setMethod("saveDF", setMethod("saveAsTable", signature(df = "DataFrame", tableName = "character", source = "character", mode = "character"), - function(df, tableName, source = NULL, mode="append", ...){ + function(df, tableName, source = NULL, mode="error", ...){ if (is.null(source)) { sqlContext <- get(".sparkRSQLsc", envir = .sparkREnv) source <- callJMethod(sqlContext, "getConf", "spark.sql.sources.default", http://git-wip-us.apache.org/repos/asf/spark/blob/f8151272/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index 0114878..9ddd2ed 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -148,7 +148,7 @@ printSchema(people) The data sources API can also be used to save out DataFrames into multiple file formats. For example we can save the DataFrame from the previous example -to a Parquet file using `write.df` +to a Parquet file using `write.df` (Until Spark 1.6, the default mode for writes was `append`. It was changed in Spark 1.7 to `error` to match the Scala API) {% highlight r %} @@ -387,3 +387,10 @@ The following functions are masked by the SparkR package: Since part of SparkR is modeled on the `dplyr` package, certain functions in SparkR share the same names with those in `dplyr`. Depending on the load order of the two packages, some functions from the package loaded first are masked by those in the package loaded after. In such case, prefix such calls with the package name, for instance, `SparkR::cume_dist(x)` or `dplyr::cume_dist(x)`. You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/search.html) + + +# Migration Guide + +## Upgrading From SparkR 1.6 to 1.7 + + - Until Spark
spark git commit: [SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode.
Repository: spark Updated Branches: refs/heads/master 26d70bd2b -> ad8c1f0b8 [SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode. SPARK_HOME is now causing problem with Mesos cluster mode since spark-submit script has been changed recently to take precendence when running spark-class scripts to look in SPARK_HOME if it's defined. We should skip passing SPARK_HOME from the Spark client in cluster mode with Mesos, since Mesos shouldn't use this configuration but should use spark.executor.home instead. Author: Timothy ChenCloses #10332 from tnachen/scheduler_ui. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad8c1f0b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad8c1f0b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad8c1f0b Branch: refs/heads/master Commit: ad8c1f0b840284d05da737fb2cc5ebf8848f4490 Parents: 26d70bd Author: Timothy Chen Authored: Wed Dec 16 10:54:15 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 10:54:15 2015 -0800 -- .../org/apache/spark/deploy/rest/mesos/MesosRestServer.scala | 7 ++- .../spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ad8c1f0b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala index 868cc35..24510db 100644 --- a/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala +++ b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala @@ -94,7 +94,12 @@ private[mesos] class MesosSubmitRequestServlet( val driverMemory = sparkProperties.get("spark.driver.memory") val driverCores = sparkProperties.get("spark.driver.cores") val appArgs = request.appArgs -val environmentVariables = request.environmentVariables +// We don't want to pass down SPARK_HOME when launching Spark apps +// with Mesos cluster mode since it's populated by default on the client and it will +// cause spark-submit script to look for files in SPARK_HOME instead. +// We only need the ability to specify where to find spark-submit script +// which user can user spark.executor.home or spark.home configurations. +val environmentVariables = request.environmentVariables.filter(!_.equals("SPARK_HOME")) val name = request.sparkProperties.get("spark.app.name").getOrElse(mainClass) // Construct driver description http://git-wip-us.apache.org/repos/asf/spark/blob/ad8c1f0b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala index 721861f..573355b 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala @@ -34,7 +34,7 @@ import org.apache.spark.util.Utils /** * Shared trait for implementing a Mesos Scheduler. This holds common state and helper - * methods and Mesos scheduler will use. + * methods the Mesos scheduler will use. */ private[mesos] trait MesosSchedulerUtils extends Logging { // Lock used to wait for scheduler to be registered - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means
Repository: spark Updated Branches: refs/heads/branch-1.6 e5b85713d -> e1adf6d7d [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means This PR includes only an example code in order to finish it quickly. I'll send another PR for the docs soon. Author: Yu ISHIKAWACloses #9952 from yu-iskw/SPARK-6518. (cherry picked from commit 7b6dc29d0ebbfb3bb941130f8542120b6bc3e234) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e1adf6d7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e1adf6d7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e1adf6d7 Branch: refs/heads/branch-1.6 Commit: e1adf6d7d1c755fb16a0030e66ce9cff348c3de8 Parents: e5b8571 Author: Yu ISHIKAWA Authored: Wed Dec 16 10:55:42 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 10:55:54 2015 -0800 -- docs/mllib-clustering.md| 35 ++ docs/mllib-guide.md | 1 + .../mllib/JavaBisectingKMeansExample.java | 69 .../examples/mllib/BisectingKMeansExample.scala | 60 + 4 files changed, 165 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e1adf6d7/docs/mllib-clustering.md -- diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index 48d64cd..93cd0c1 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -718,6 +718,41 @@ sameModel = LDAModel.load(sc, "myModelPath") +## Bisecting k-means + +Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering. + +Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering). +Hierarchical clustering is one of the most commonly used method of cluster analysis which seeks to build a hierarchy of clusters. +Strategies for hierarchical clustering generally fall into two types: + +- Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. +- Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. + +Bisecting k-means algorithm is a kind of divisive algorithms. +The implementation in MLlib has the following parameters: + +* *k*: the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. +* *maxIterations*: the max number of k-means iterations to split clusters (default: 20) +* *minDivisibleClusterSize*: the minimum number of points (if >= 1.0) or the minimum proportion of points (if < 1.0) of a divisible cluster (default: 1) +* *seed*: a random seed (default: hash value of the class name) + +**Examples** + + + +Refer to the [`BisectingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) and [`BisectingKMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel) for details on the API. + +{% include_example scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %} + + + +Refer to the [`BisectingKMeans` Java docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html) and [`BisectingKMeansModel` Java docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for details on the API. + +{% include_example java/org/apache/spark/examples/mllib/JavaBisectingKMeansExample.java %} + + + ## Streaming k-means When data arrive in a stream, we may want to estimate clusters dynamically, http://git-wip-us.apache.org/repos/asf/spark/blob/e1adf6d7/docs/mllib-guide.md -- diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 7fef6b5..680ed48 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -49,6 +49,7 @@ We list major functionality from both below, with links to detailed guides. * [Gaussian mixture](mllib-clustering.html#gaussian-mixture) * [power iteration clustering (PIC)](mllib-clustering.html#power-iteration-clustering-pic) * [latent Dirichlet allocation (LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda) + * [bisecting k-means](mllib-clustering.html#bisecting-kmeans) * [streaming k-means](mllib-clustering.html#streaming-k-means) * [Dimensionality reduction](mllib-dimensionality-reduction.html) * [singular value decomposition
spark git commit: [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means
Repository: spark Updated Branches: refs/heads/master ad8c1f0b8 -> 7b6dc29d0 [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means This PR includes only an example code in order to finish it quickly. I'll send another PR for the docs soon. Author: Yu ISHIKAWACloses #9952 from yu-iskw/SPARK-6518. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7b6dc29d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7b6dc29d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7b6dc29d Branch: refs/heads/master Commit: 7b6dc29d0ebbfb3bb941130f8542120b6bc3e234 Parents: ad8c1f0 Author: Yu ISHIKAWA Authored: Wed Dec 16 10:55:42 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 10:55:42 2015 -0800 -- docs/mllib-clustering.md| 35 ++ docs/mllib-guide.md | 1 + .../mllib/JavaBisectingKMeansExample.java | 69 .../examples/mllib/BisectingKMeansExample.scala | 60 + 4 files changed, 165 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7b6dc29d/docs/mllib-clustering.md -- diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index 48d64cd..93cd0c1 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -718,6 +718,41 @@ sameModel = LDAModel.load(sc, "myModelPath") +## Bisecting k-means + +Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering. + +Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering). +Hierarchical clustering is one of the most commonly used method of cluster analysis which seeks to build a hierarchy of clusters. +Strategies for hierarchical clustering generally fall into two types: + +- Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. +- Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. + +Bisecting k-means algorithm is a kind of divisive algorithms. +The implementation in MLlib has the following parameters: + +* *k*: the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. +* *maxIterations*: the max number of k-means iterations to split clusters (default: 20) +* *minDivisibleClusterSize*: the minimum number of points (if >= 1.0) or the minimum proportion of points (if < 1.0) of a divisible cluster (default: 1) +* *seed*: a random seed (default: hash value of the class name) + +**Examples** + + + +Refer to the [`BisectingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) and [`BisectingKMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel) for details on the API. + +{% include_example scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %} + + + +Refer to the [`BisectingKMeans` Java docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html) and [`BisectingKMeansModel` Java docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for details on the API. + +{% include_example java/org/apache/spark/examples/mllib/JavaBisectingKMeansExample.java %} + + + ## Streaming k-means When data arrive in a stream, we may want to estimate clusters dynamically, http://git-wip-us.apache.org/repos/asf/spark/blob/7b6dc29d/docs/mllib-guide.md -- diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 7fef6b5..680ed48 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -49,6 +49,7 @@ We list major functionality from both below, with links to detailed guides. * [Gaussian mixture](mllib-clustering.html#gaussian-mixture) * [power iteration clustering (PIC)](mllib-clustering.html#power-iteration-clustering-pic) * [latent Dirichlet allocation (LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda) + * [bisecting k-means](mllib-clustering.html#bisecting-kmeans) * [streaming k-means](mllib-clustering.html#streaming-k-means) * [Dimensionality reduction](mllib-dimensionality-reduction.html) * [singular value decomposition (SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)
spark git commit: [SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode.
Repository: spark Updated Branches: refs/heads/branch-1.6 f81512729 -> e5b85713d [SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode. SPARK_HOME is now causing problem with Mesos cluster mode since spark-submit script has been changed recently to take precendence when running spark-class scripts to look in SPARK_HOME if it's defined. We should skip passing SPARK_HOME from the Spark client in cluster mode with Mesos, since Mesos shouldn't use this configuration but should use spark.executor.home instead. Author: Timothy ChenCloses #10332 from tnachen/scheduler_ui. (cherry picked from commit ad8c1f0b840284d05da737fb2cc5ebf8848f4490) Signed-off-by: Andrew Or Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e5b85713 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e5b85713 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e5b85713 Branch: refs/heads/branch-1.6 Commit: e5b85713d8a0dbbb1a0a07481f5afa6c5098147f Parents: f815127 Author: Timothy Chen Authored: Wed Dec 16 10:54:15 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 10:55:25 2015 -0800 -- .../org/apache/spark/deploy/rest/mesos/MesosRestServer.scala | 7 ++- .../spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e5b85713/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala index 868cc35..7c01ae4 100644 --- a/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala +++ b/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala @@ -94,7 +94,12 @@ private[mesos] class MesosSubmitRequestServlet( val driverMemory = sparkProperties.get("spark.driver.memory") val driverCores = sparkProperties.get("spark.driver.cores") val appArgs = request.appArgs -val environmentVariables = request.environmentVariables +// We don't want to pass down SPARK_HOME when launching Spark apps with Mesos cluster mode +// since it's populated by default on the client and it will cause spark-submit script to +// look for files in SPARK_HOME instead. We only need the ability to specify where to find +// spark-submit script which user can user spark.executor.home or spark.home configurations +// (SPARK-12345). +val environmentVariables = request.environmentVariables.filter(!_.equals("SPARK_HOME")) val name = request.sparkProperties.get("spark.app.name").getOrElse(mainClass) // Construct driver description http://git-wip-us.apache.org/repos/asf/spark/blob/e5b85713/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala index 721861f..573355b 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala @@ -34,7 +34,7 @@ import org.apache.spark.util.Utils /** * Shared trait for implementing a Mesos Scheduler. This holds common state and helper - * methods and Mesos scheduler will use. + * methods the Mesos scheduler will use. */ private[mesos] trait MesosSchedulerUtils extends Logging { // Lock used to wait for scheduler to be registered - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12309][ML] Use sqlContext from MLlibTestSparkContext for spark.ml test suites
Repository: spark Updated Branches: refs/heads/master 860dc7f2f -> d252b2d54 [SPARK-12309][ML] Use sqlContext from MLlibTestSparkContext for spark.ml test suites Use ```sqlContext``` from ```MLlibTestSparkContext``` rather than creating new one for spark.ml test suites. I have checked thoroughly and found there are four test cases need to update. cc mengxr jkbradley Author: Yanbo LiangCloses #10279 from yanboliang/spark-12309. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d252b2d5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d252b2d5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d252b2d5 Branch: refs/heads/master Commit: d252b2d544a75f6c5523be3492494955050acf50 Parents: 860dc7f Author: Yanbo Liang Authored: Wed Dec 16 11:07:54 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 11:07:54 2015 -0800 -- .../scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala| 4 +--- .../test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala | 3 +-- .../scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala| 4 +--- mllib/src/test/scala/org/apache/spark/ml/impl/TreeTests.scala| 2 +- .../scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala | 3 +-- 5 files changed, 5 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d252b2d5/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala index 09183fe..035bfc0 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala @@ -21,13 +21,11 @@ import org.apache.spark.SparkFunSuite import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.util.MLlibTestSparkContext -import org.apache.spark.sql.{Row, SQLContext} +import org.apache.spark.sql.Row class MinMaxScalerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { test("MinMaxScaler fit basic case") { -val sqlContext = new SQLContext(sc) - val data = Array( Vectors.dense(1, 0, Long.MinValue), Vectors.dense(2, 0, 0), http://git-wip-us.apache.org/repos/asf/spark/blob/d252b2d5/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala index de3d438..4688339 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala @@ -22,7 +22,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors} import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.mllib.util.TestingUtils._ -import org.apache.spark.sql.{DataFrame, Row, SQLContext} +import org.apache.spark.sql.{DataFrame, Row} class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { @@ -61,7 +61,6 @@ class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa Vectors.sparse(3, Seq()) ) -val sqlContext = new SQLContext(sc) dataFrame = sqlContext.createDataFrame(sc.parallelize(data, 2).map(NormalizerSuite.FeatureData)) normalizer = new Normalizer() .setInputCol("features") http://git-wip-us.apache.org/repos/asf/spark/blob/d252b2d5/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala index 74706a2..8acc336 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala @@ -24,7 +24,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.types.StructType -import org.apache.spark.sql.{DataFrame, Row, SQLContext} +import org.apache.spark.sql.{DataFrame, Row} class
spark git commit: [SPARK-8745] [SQL] remove GenerateProjection
Repository: spark Updated Branches: refs/heads/master a6325fc40 -> 54c512ba9 [SPARK-8745] [SQL] remove GenerateProjection cc rxin Author: Davies LiuCloses #10316 from davies/remove_generate_projection. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/54c512ba Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/54c512ba Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/54c512ba Branch: refs/heads/master Commit: 54c512ba906edfc25b8081ad67498e99d884452b Parents: a6325fc Author: Davies Liu Authored: Wed Dec 16 10:22:48 2015 -0800 Committer: Davies Liu Committed: Wed Dec 16 10:22:48 2015 -0800 -- .../codegen/GenerateProjection.scala| 238 --- .../codegen/GenerateSafeProjection.scala| 4 + .../expressions/CodeGenerationSuite.scala | 5 +- .../expressions/ExpressionEvalHelper.scala | 39 +-- .../expressions/MathFunctionsSuite.scala| 4 +- .../codegen/CodegenExpressionCachingSuite.scala | 18 -- .../spark/sql/execution/local/ExpandNode.scala | 4 +- .../spark/sql/execution/local/LocalNode.scala | 18 -- 8 files changed, 11 insertions(+), 319 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/54c512ba/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala deleted file mode 100644 index f229f20..000 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala +++ /dev/null @@ -1,238 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.catalyst.expressions.codegen - -import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.types._ - -/** - * Java can not access Projection (in package object) - */ -abstract class BaseProjection extends Projection {} - -abstract class CodeGenMutableRow extends MutableRow with BaseGenericInternalRow - -/** - * Generates bytecode that produces a new [[InternalRow]] object based on a fixed set of input - * [[Expression Expressions]] and a given input [[InternalRow]]. The returned [[InternalRow]] - * object is custom generated based on the output types of the [[Expression]] to avoid boxing of - * primitive values. - */ -object GenerateProjection extends CodeGenerator[Seq[Expression], Projection] { - - protected def canonicalize(in: Seq[Expression]): Seq[Expression] = -in.map(ExpressionCanonicalizer.execute) - - protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): Seq[Expression] = -in.map(BindReferences.bindReference(_, inputSchema)) - - // Make Mutablility optional... - protected def create(expressions: Seq[Expression]): Projection = { -val ctx = newCodeGenContext() -val columns = expressions.zipWithIndex.map { - case (e, i) => -s"private ${ctx.javaType(e.dataType)} c$i = ${ctx.defaultValue(e.dataType)};\n" -}.mkString("\n") - -val initColumns = expressions.zipWithIndex.map { - case (e, i) => -val eval = e.gen(ctx) -s""" -{ - // column$i - ${eval.code} - nullBits[$i] = ${eval.isNull}; - if (!${eval.isNull}) { -c$i = ${eval.value}; - } -} -""" -}.mkString("\n") - -val getCases = (0 until expressions.size).map { i => - s"case $i: return c$i;" -}.mkString("\n") - -val updateCases = expressions.zipWithIndex.map { case (e, i) => - s"case $i: { c$i = (${ctx.boxedType(e.dataType)})value; return;}" -}.mkString("\n") - -val specificAccessorFunctions = ctx.primitiveTypes.map
spark git commit: [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib
Repository: spark Updated Branches: refs/heads/branch-1.6 04e868b63 -> 552b38f87 [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext. Author: Davies LiuCloses #10338 from davies/create_context. (cherry picked from commit 27b98e99d21a0cc34955337f82a71a18f9220ab2) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/552b38f8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/552b38f8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/552b38f8 Branch: refs/heads/branch-1.6 Commit: 552b38f87fc0f6fab61b1e5405be58908b7f5544 Parents: 04e868b Author: Davies Liu Authored: Wed Dec 16 15:48:11 2015 -0800 Committer: Davies Liu Committed: Wed Dec 16 15:48:21 2015 -0800 -- python/pyspark/mllib/common.py | 6 +++--- python/pyspark/mllib/evaluation.py | 10 +- python/pyspark/mllib/feature.py| 4 +--- 3 files changed, 9 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/552b38f8/python/pyspark/mllib/common.py -- diff --git a/python/pyspark/mllib/common.py b/python/pyspark/mllib/common.py index a439a48..9fda1b1 100644 --- a/python/pyspark/mllib/common.py +++ b/python/pyspark/mllib/common.py @@ -102,7 +102,7 @@ def _java2py(sc, r, encoding="bytes"): return RDD(jrdd, sc) if clsName == 'DataFrame': -return DataFrame(r, SQLContext(sc)) +return DataFrame(r, SQLContext.getOrCreate(sc)) if clsName in _picklable_classes: r = sc._jvm.SerDe.dumps(r) @@ -125,7 +125,7 @@ def callJavaFunc(sc, func, *args): def callMLlibFunc(name, *args): """ Call API in PythonMLLibAPI """ -sc = SparkContext._active_spark_context +sc = SparkContext.getOrCreate() api = getattr(sc._jvm.PythonMLLibAPI(), name) return callJavaFunc(sc, api, *args) @@ -135,7 +135,7 @@ class JavaModelWrapper(object): Wrapper for the model in JVM """ def __init__(self, java_model): -self._sc = SparkContext._active_spark_context +self._sc = SparkContext.getOrCreate() self._java_model = java_model def __del__(self): http://git-wip-us.apache.org/repos/asf/spark/blob/552b38f8/python/pyspark/mllib/evaluation.py -- diff --git a/python/pyspark/mllib/evaluation.py b/python/pyspark/mllib/evaluation.py index 8c87ee9..22e68ea 100644 --- a/python/pyspark/mllib/evaluation.py +++ b/python/pyspark/mllib/evaluation.py @@ -44,7 +44,7 @@ class BinaryClassificationMetrics(JavaModelWrapper): def __init__(self, scoreAndLabels): sc = scoreAndLabels.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df = sql_ctx.createDataFrame(scoreAndLabels, schema=StructType([ StructField("score", DoubleType(), nullable=False), StructField("label", DoubleType(), nullable=False)])) @@ -103,7 +103,7 @@ class RegressionMetrics(JavaModelWrapper): def __init__(self, predictionAndObservations): sc = predictionAndObservations.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df = sql_ctx.createDataFrame(predictionAndObservations, schema=StructType([ StructField("prediction", DoubleType(), nullable=False), StructField("observation", DoubleType(), nullable=False)])) @@ -197,7 +197,7 @@ class MulticlassMetrics(JavaModelWrapper): def __init__(self, predictionAndLabels): sc = predictionAndLabels.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df = sql_ctx.createDataFrame(predictionAndLabels, schema=StructType([ StructField("prediction", DoubleType(), nullable=False), StructField("label", DoubleType(), nullable=False)])) @@ -338,7 +338,7 @@ class RankingMetrics(JavaModelWrapper): def __init__(self, predictionAndLabels): sc = predictionAndLabels.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df = sql_ctx.createDataFrame(predictionAndLabels, schema=sql_ctx._inferSchema(predictionAndLabels)) java_model = callMLlibFunc("newRankingMetrics", df._jdf) @@ -424,7 +424,7 @@ class MultilabelMetrics(JavaModelWrapper): def __init__(self, predictionAndLabels): sc = predictionAndLabels.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df =
spark git commit: [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib
Repository: spark Updated Branches: refs/heads/master 3a44aebd0 -> 27b98e99d [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext. Author: Davies LiuCloses #10338 from davies/create_context. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/27b98e99 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/27b98e99 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/27b98e99 Branch: refs/heads/master Commit: 27b98e99d21a0cc34955337f82a71a18f9220ab2 Parents: 3a44aeb Author: Davies Liu Authored: Wed Dec 16 15:48:11 2015 -0800 Committer: Davies Liu Committed: Wed Dec 16 15:48:11 2015 -0800 -- python/pyspark/mllib/common.py | 6 +++--- python/pyspark/mllib/evaluation.py | 10 +- python/pyspark/mllib/feature.py| 4 +--- 3 files changed, 9 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/27b98e99/python/pyspark/mllib/common.py -- diff --git a/python/pyspark/mllib/common.py b/python/pyspark/mllib/common.py index a439a48..9fda1b1 100644 --- a/python/pyspark/mllib/common.py +++ b/python/pyspark/mllib/common.py @@ -102,7 +102,7 @@ def _java2py(sc, r, encoding="bytes"): return RDD(jrdd, sc) if clsName == 'DataFrame': -return DataFrame(r, SQLContext(sc)) +return DataFrame(r, SQLContext.getOrCreate(sc)) if clsName in _picklable_classes: r = sc._jvm.SerDe.dumps(r) @@ -125,7 +125,7 @@ def callJavaFunc(sc, func, *args): def callMLlibFunc(name, *args): """ Call API in PythonMLLibAPI """ -sc = SparkContext._active_spark_context +sc = SparkContext.getOrCreate() api = getattr(sc._jvm.PythonMLLibAPI(), name) return callJavaFunc(sc, api, *args) @@ -135,7 +135,7 @@ class JavaModelWrapper(object): Wrapper for the model in JVM """ def __init__(self, java_model): -self._sc = SparkContext._active_spark_context +self._sc = SparkContext.getOrCreate() self._java_model = java_model def __del__(self): http://git-wip-us.apache.org/repos/asf/spark/blob/27b98e99/python/pyspark/mllib/evaluation.py -- diff --git a/python/pyspark/mllib/evaluation.py b/python/pyspark/mllib/evaluation.py index 8c87ee9..22e68ea 100644 --- a/python/pyspark/mllib/evaluation.py +++ b/python/pyspark/mllib/evaluation.py @@ -44,7 +44,7 @@ class BinaryClassificationMetrics(JavaModelWrapper): def __init__(self, scoreAndLabels): sc = scoreAndLabels.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df = sql_ctx.createDataFrame(scoreAndLabels, schema=StructType([ StructField("score", DoubleType(), nullable=False), StructField("label", DoubleType(), nullable=False)])) @@ -103,7 +103,7 @@ class RegressionMetrics(JavaModelWrapper): def __init__(self, predictionAndObservations): sc = predictionAndObservations.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df = sql_ctx.createDataFrame(predictionAndObservations, schema=StructType([ StructField("prediction", DoubleType(), nullable=False), StructField("observation", DoubleType(), nullable=False)])) @@ -197,7 +197,7 @@ class MulticlassMetrics(JavaModelWrapper): def __init__(self, predictionAndLabels): sc = predictionAndLabels.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df = sql_ctx.createDataFrame(predictionAndLabels, schema=StructType([ StructField("prediction", DoubleType(), nullable=False), StructField("label", DoubleType(), nullable=False)])) @@ -338,7 +338,7 @@ class RankingMetrics(JavaModelWrapper): def __init__(self, predictionAndLabels): sc = predictionAndLabels.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df = sql_ctx.createDataFrame(predictionAndLabels, schema=sql_ctx._inferSchema(predictionAndLabels)) java_model = callMLlibFunc("newRankingMetrics", df._jdf) @@ -424,7 +424,7 @@ class MultilabelMetrics(JavaModelWrapper): def __init__(self, predictionAndLabels): sc = predictionAndLabels.ctx -sql_ctx = SQLContext(sc) +sql_ctx = SQLContext.getOrCreate(sc) df = sql_ctx.createDataFrame(predictionAndLabels, schema=sql_ctx._inferSchema(predictionAndLabels)) java_class =
spark git commit: [SPARK-9690][ML][PYTHON] pyspark CrossValidator random seed
Repository: spark Updated Branches: refs/heads/master 9657ee878 -> 3a44aebd0 [SPARK-9690][ML][PYTHON] pyspark CrossValidator random seed Extend CrossValidator with HasSeed in PySpark. This PR replaces [https://github.com/apache/spark/pull/7997] CC: yanboliang thunterdb mmenestret Would one of you mind taking a look? Thanks! Author: Joseph K. BradleyAuthor: Martin MENESTRET Closes #10268 from jkbradley/pyspark-cv-seed. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a44aebd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a44aebd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a44aebd Branch: refs/heads/master Commit: 3a44aebd0c5331f6ff00734fa44ef63f8d18cfbb Parents: 9657ee8 Author: Martin Menestret Authored: Wed Dec 16 14:05:35 2015 -0800 Committer: Joseph K. Bradley Committed: Wed Dec 16 14:05:35 2015 -0800 -- python/pyspark/ml/tuning.py | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3a44aebd/python/pyspark/ml/tuning.py -- diff --git a/python/pyspark/ml/tuning.py b/python/pyspark/ml/tuning.py index 705ee53..08f8db5 100644 --- a/python/pyspark/ml/tuning.py +++ b/python/pyspark/ml/tuning.py @@ -19,8 +19,9 @@ import itertools import numpy as np from pyspark import since -from pyspark.ml.param import Params, Param from pyspark.ml import Estimator, Model +from pyspark.ml.param import Params, Param +from pyspark.ml.param.shared import HasSeed from pyspark.ml.util import keyword_only from pyspark.sql.functions import rand @@ -89,7 +90,7 @@ class ParamGridBuilder(object): return [dict(zip(keys, prod)) for prod in itertools.product(*grid_values)] -class CrossValidator(Estimator): +class CrossValidator(Estimator, HasSeed): """ K-fold cross validation. @@ -129,9 +130,11 @@ class CrossValidator(Estimator): numFolds = Param(Params._dummy(), "numFolds", "number of folds for cross validation") @keyword_only -def __init__(self, estimator=None, estimatorParamMaps=None, evaluator=None, numFolds=3): +def __init__(self, estimator=None, estimatorParamMaps=None, evaluator=None, numFolds=3, + seed=None): """ -__init__(self, estimator=None, estimatorParamMaps=None, evaluator=None, numFolds=3) +__init__(self, estimator=None, estimatorParamMaps=None, evaluator=None, numFolds=3,\ + seed=None) """ super(CrossValidator, self).__init__() #: param for estimator to be cross-validated @@ -151,9 +154,11 @@ class CrossValidator(Estimator): @keyword_only @since("1.4.0") -def setParams(self, estimator=None, estimatorParamMaps=None, evaluator=None, numFolds=3): +def setParams(self, estimator=None, estimatorParamMaps=None, evaluator=None, numFolds=3, + seed=None): """ -setParams(self, estimator=None, estimatorParamMaps=None, evaluator=None, numFolds=3): +setParams(self, estimator=None, estimatorParamMaps=None, evaluator=None, numFolds=3,\ + seed=None): Sets params for cross validator. """ kwargs = self.setParams._input_kwargs @@ -225,9 +230,10 @@ class CrossValidator(Estimator): numModels = len(epm) eva = self.getOrDefault(self.evaluator) nFolds = self.getOrDefault(self.numFolds) +seed = self.getOrDefault(self.seed) h = 1.0 / nFolds randCol = self.uid + "_rand" -df = dataset.select("*", rand(0).alias(randCol)) +df = dataset.select("*", rand(seed).alias(randCol)) metrics = np.zeros(numModels) for i in range(nFolds): validateLB = i * h - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-11677][SQL] ORC filter tests all pass if filters are actually not pushed down.
Repository: spark Updated Branches: refs/heads/master edf65cd96 -> 9657ee878 [SPARK-11677][SQL] ORC filter tests all pass if filters are actually not pushed down. Currently ORC filters are not tested properly. All the tests pass even if the filters are not pushed down or disabled. In this PR, I add some logics for this. Since ORC does not filter record by record fully, this checks the count of the result and if it contains the expected values. Author: hyukjinkwonCloses #9687 from HyukjinKwon/SPARK-11677. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9657ee87 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9657ee87 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9657ee87 Branch: refs/heads/master Commit: 9657ee87888422c5596987fe760b49117a0ea4e2 Parents: edf65cd Author: hyukjinkwon Authored: Wed Dec 16 13:24:49 2015 -0800 Committer: Michael Armbrust Committed: Wed Dec 16 13:24:49 2015 -0800 -- .../spark/sql/hive/orc/OrcQuerySuite.scala | 53 +--- 1 file changed, 36 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9657ee87/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala index 7efeab5..2156806 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala @@ -350,28 +350,47 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { withTempPath { dir => withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") { import testImplicits._ - val path = dir.getCanonicalPath -sqlContext.range(10).coalesce(1).write.orc(path) + +// For field "a", the first column has odds integers. This is to check the filtered count +// when `isNull` is performed. For Field "b", `isNotNull` of ORC file filters rows +// only when all the values are null (maybe this works differently when the data +// or query is complicated). So, simply here a column only having `null` is added. +val data = (0 until 10).map { i => + val maybeInt = if (i % 2 == 0) None else Some(i) + val nullValue: Option[String] = None + (maybeInt, nullValue) +} +createDataFrame(data).toDF("a", "b").write.orc(path) val df = sqlContext.read.orc(path) -def checkPredicate(pred: Column, answer: Seq[Long]): Unit = { - checkAnswer(df.where(pred), answer.map(Row(_))) +def checkPredicate(pred: Column, answer: Seq[Row]): Unit = { + val sourceDf = stripSparkFilter(df.where(pred)) + val data = sourceDf.collect().toSet + val expectedData = answer.toSet + + // When a filter is pushed to ORC, ORC can apply it to rows. So, we can check + // the number of rows returned from the ORC to make sure our filter pushdown work. + // A tricky part is, ORC does not process filter rows fully but return some possible + // results. So, this checks if the number of result is less than the original count + // of data, and then checks if it contains the expected data. + val isOrcFiltered = sourceDf.count < 10 && expectedData.subsetOf(data) + assert(isOrcFiltered) } -checkPredicate('id === 5, Seq(5L)) -checkPredicate('id <=> 5, Seq(5L)) -checkPredicate('id < 5, 0L to 4L) -checkPredicate('id <= 5, 0L to 5L) -checkPredicate('id > 5, 6L to 9L) -checkPredicate('id >= 5, 5L to 9L) -checkPredicate('id.isNull, Seq.empty[Long]) -checkPredicate('id.isNotNull, 0L to 9L) -checkPredicate('id.isin(1L, 3L, 5L), Seq(1L, 3L, 5L)) -checkPredicate('id > 0 && 'id < 3, 1L to 2L) -checkPredicate('id < 1 || 'id > 8, Seq(0L, 9L)) -checkPredicate(!('id > 3), 0L to 3L) -checkPredicate(!('id > 0 && 'id < 3), Seq(0L) ++ (3L to 9L)) +checkPredicate('a === 5, List(5).map(Row(_, null))) +checkPredicate('a <=> 5, List(5).map(Row(_, null))) +checkPredicate('a < 5, List(1, 3).map(Row(_, null))) +checkPredicate('a <= 5, List(1, 3, 5).map(Row(_, null))) +checkPredicate('a > 5, List(7, 9).map(Row(_, null))) +checkPredicate('a >= 5, List(5, 7, 9).map(Row(_, null))) +checkPredicate('a.isNull, List(null).map(Row(_, null))) +checkPredicate('b.isNotNull, List()) +
spark git commit: [SPARK-12164][SQL] Decode the encoded values and then display
Repository: spark Updated Branches: refs/heads/master a783a8ed4 -> edf65cd96 [SPARK-12164][SQL] Decode the encoded values and then display Based on the suggestions from marmbrus cloud-fan in https://github.com/apache/spark/pull/10165 , this PR is to print the decoded values(user objects) in `Dataset.show` ```scala implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); ``` The current output is like ``` +--+ |value | +--+ |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 97, 2]| |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 98, 4]| |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 99, 6]| +--+ ``` After the fix, it will be like the below if and only if the users override the `toString` function in the class `KryoClassData` ```scala override def toString: String = s"KryoClassData($a, $b)" ``` ``` +---+ |value | +---+ |KryoClassData(a, 1)| |KryoClassData(b, 2)| |KryoClassData(c, 3)| +---+ ``` If users do not override the `toString` function, the results will be like ``` +---+ |value | +---+ |org.apache.spark.sql.KryoClassData68ef| |org.apache.spark.sql.KryoClassData6915| |org.apache.spark.sql.KryoClassData693b| +---+ ``` Question: Should we add another optional parameter in the function `show`? It will decide if the function `show` will display the hex values or the object values? Author: gatorsmileCloses #10215 from gatorsmile/showDecodedValue. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/edf65cd9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/edf65cd9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/edf65cd9 Branch: refs/heads/master Commit: edf65cd961b913ef54104770630a50fd4b120b4b Parents: a783a8e Author: gatorsmile Authored: Wed Dec 16 13:22:34 2015 -0800 Committer: Michael Armbrust Committed: Wed Dec 16 13:22:34 2015 -0800 -- .../scala/org/apache/spark/sql/DataFrame.scala | 50 +-- .../scala/org/apache/spark/sql/Dataset.scala| 37 ++- .../apache/spark/sql/execution/Queryable.scala | 65 .../org/apache/spark/sql/DataFrameSuite.scala | 15 + .../org/apache/spark/sql/DatasetSuite.scala | 14 + 5 files changed, 133 insertions(+), 48 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/edf65cd9/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala index 497bd48..6250e95 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala @@ -165,13 +165,11 @@ class DataFrame private[sql]( * @param _numRows Number of rows to show * @param truncate Whether truncate long strings and align cells right */ - private[sql] def showString(_numRows: Int, truncate: Boolean = true): String = { + override private[sql] def showString(_numRows: Int, truncate: Boolean = true): String = { val numRows = _numRows.max(0) -val sb = new StringBuilder val takeResult = take(numRows + 1) val hasMoreData = takeResult.length > numRows val data = takeResult.take(numRows) -val numCols = schema.fieldNames.length
spark git commit: [MINOR] Add missing interpolation in NettyRPCEnv
Repository: spark Updated Branches: refs/heads/branch-1.6 552b38f87 -> 638b89bc3 [MINOR] Add missing interpolation in NettyRPCEnv ``` Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in ${timeout.duration}. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) ``` Author: Andrew OrCloses #10334 from andrewor14/rpc-typo. (cherry picked from commit 861549acdbc11920cde51fc57752a8bc241064e5) Signed-off-by: Shixiong Zhu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/638b89bc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/638b89bc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/638b89bc Branch: refs/heads/branch-1.6 Commit: 638b89bc3b1c421fe11cbaf52649225662d3d3ce Parents: 552b38f Author: Andrew Or Authored: Wed Dec 16 16:13:48 2015 -0800 Committer: Shixiong Zhu Committed: Wed Dec 16 16:13:55 2015 -0800 -- core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/638b89bc/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala -- diff --git a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala index 9d353bb..a53bc5e 100644 --- a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala +++ b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala @@ -239,7 +239,7 @@ private[netty] class NettyRpcEnv( val timeoutCancelable = timeoutScheduler.schedule(new Runnable { override def run(): Unit = { promise.tryFailure( - new TimeoutException("Cannot receive any reply in ${timeout.duration}")) + new TimeoutException(s"Cannot receive any reply in ${timeout.duration}")) } }, timeout.duration.toNanos, TimeUnit.NANOSECONDS) promise.future.onComplete { v => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] Add missing interpolation in NettyRPCEnv
Repository: spark Updated Branches: refs/heads/master 27b98e99d -> 861549acd [MINOR] Add missing interpolation in NettyRPCEnv ``` Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in ${timeout.duration}. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) ``` Author: Andrew OrCloses #10334 from andrewor14/rpc-typo. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/861549ac Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/861549ac Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/861549ac Branch: refs/heads/master Commit: 861549acdbc11920cde51fc57752a8bc241064e5 Parents: 27b98e9 Author: Andrew Or Authored: Wed Dec 16 16:13:48 2015 -0800 Committer: Shixiong Zhu Committed: Wed Dec 16 16:13:48 2015 -0800 -- core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/861549ac/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala -- diff --git a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala index f82fd4e..de3db6b 100644 --- a/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala +++ b/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala @@ -232,7 +232,7 @@ private[netty] class NettyRpcEnv( val timeoutCancelable = timeoutScheduler.schedule(new Runnable { override def run(): Unit = { promise.tryFailure( - new TimeoutException("Cannot receive any reply in ${timeout.duration}")) + new TimeoutException(s"Cannot receive any reply in ${timeout.duration}")) } }, timeout.duration.toNanos, TimeUnit.NANOSECONDS) promise.future.onComplete { v => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called
Repository: spark Updated Branches: refs/heads/master 38d9795a4 -> f590178d7 [SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called SPARK-9886 fixed ExternalBlockStore.scala This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook() Author: tedyuCloses #10325 from ted-yu/master. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f590178d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f590178d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f590178d Branch: refs/heads/master Commit: f590178d7a06221a93286757c68b23919bee9f03 Parents: 38d9795 Author: tedyu Authored: Wed Dec 16 19:02:12 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 19:02:12 2015 -0800 -- .../spark/deploy/ExternalShuffleService.scala | 18 +-- .../deploy/mesos/MesosClusterDispatcher.scala | 13 --- .../apache/spark/util/ShutdownHookManager.scala | 4 scalastyle-config.xml | 12 ++ .../hive/thriftserver/SparkSQLCLIDriver.scala | 24 +--- 5 files changed, 38 insertions(+), 33 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f590178d/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala index e8a1e35..7fc96e4 100644 --- a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala +++ b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala @@ -28,7 +28,7 @@ import org.apache.spark.network.sasl.SaslServerBootstrap import org.apache.spark.network.server.{TransportServerBootstrap, TransportServer} import org.apache.spark.network.shuffle.ExternalShuffleBlockHandler import org.apache.spark.network.util.TransportConf -import org.apache.spark.util.Utils +import org.apache.spark.util.{ShutdownHookManager, Utils} /** * Provides a server from which Executors can read shuffle files (rather than reading directly from @@ -118,19 +118,13 @@ object ExternalShuffleService extends Logging { server = newShuffleService(sparkConf, securityManager) server.start() -installShutdownHook() +ShutdownHookManager.addShutdownHook { () => + logInfo("Shutting down shuffle service.") + server.stop() + barrier.countDown() +} // keep running until the process is terminated barrier.await() } - - private def installShutdownHook(): Unit = { -Runtime.getRuntime.addShutdownHook(new Thread("External Shuffle Service shutdown thread") { - override def run() { -logInfo("Shutting down shuffle service.") -server.stop() -barrier.countDown() - } -}) - } } http://git-wip-us.apache.org/repos/asf/spark/blob/f590178d/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala b/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala index 5d4e5b8..389eff5 100644 --- a/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala +++ b/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala @@ -22,7 +22,7 @@ import java.util.concurrent.CountDownLatch import org.apache.spark.deploy.mesos.ui.MesosClusterUI import org.apache.spark.deploy.rest.mesos.MesosRestServer import org.apache.spark.scheduler.cluster.mesos._ -import org.apache.spark.util.SignalLogger +import org.apache.spark.util.{ShutdownHookManager, SignalLogger} import org.apache.spark.{Logging, SecurityManager, SparkConf} /* @@ -103,14 +103,11 @@ private[mesos] object MesosClusterDispatcher extends Logging { } val dispatcher = new MesosClusterDispatcher(dispatcherArgs, conf) dispatcher.start() -val shutdownHook = new Thread() { - override def run() { -logInfo("Shutdown hook is shutting down dispatcher") -dispatcher.stop() -dispatcher.awaitShutdown() - } +ShutdownHookManager.addShutdownHook { () => + logInfo("Shutdown hook is shutting down dispatcher") + dispatcher.stop() + dispatcher.awaitShutdown() } -Runtime.getRuntime.addShutdownHook(shutdownHook) dispatcher.awaitShutdown() } } http://git-wip-us.apache.org/repos/asf/spark/blob/f590178d/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala
spark git commit: [SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called
Repository: spark Updated Branches: refs/heads/branch-1.6 fb02e4e3b -> 4af64385b [SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called SPARK-9886 fixed ExternalBlockStore.scala This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook() Author: tedyuCloses #10325 from ted-yu/master. (cherry picked from commit f590178d7a06221a93286757c68b23919bee9f03) Signed-off-by: Andrew Or Conflicts: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4af64385 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4af64385 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4af64385 Branch: refs/heads/branch-1.6 Commit: 4af64385b085002d94c54d11bbd144f9f026bbd8 Parents: fb02e4e Author: tedyu Authored: Wed Dec 16 19:02:12 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 19:03:30 2015 -0800 -- .../spark/deploy/ExternalShuffleService.scala | 18 ++ .../deploy/mesos/MesosClusterDispatcher.scala | 13 + .../apache/spark/util/ShutdownHookManager.scala | 4 scalastyle-config.xml | 12 4 files changed, 27 insertions(+), 20 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4af64385/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala index e8a1e35..7fc96e4 100644 --- a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala +++ b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala @@ -28,7 +28,7 @@ import org.apache.spark.network.sasl.SaslServerBootstrap import org.apache.spark.network.server.{TransportServerBootstrap, TransportServer} import org.apache.spark.network.shuffle.ExternalShuffleBlockHandler import org.apache.spark.network.util.TransportConf -import org.apache.spark.util.Utils +import org.apache.spark.util.{ShutdownHookManager, Utils} /** * Provides a server from which Executors can read shuffle files (rather than reading directly from @@ -118,19 +118,13 @@ object ExternalShuffleService extends Logging { server = newShuffleService(sparkConf, securityManager) server.start() -installShutdownHook() +ShutdownHookManager.addShutdownHook { () => + logInfo("Shutting down shuffle service.") + server.stop() + barrier.countDown() +} // keep running until the process is terminated barrier.await() } - - private def installShutdownHook(): Unit = { -Runtime.getRuntime.addShutdownHook(new Thread("External Shuffle Service shutdown thread") { - override def run() { -logInfo("Shutting down shuffle service.") -server.stop() -barrier.countDown() - } -}) - } } http://git-wip-us.apache.org/repos/asf/spark/blob/4af64385/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala b/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala index 5d4e5b8..389eff5 100644 --- a/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala +++ b/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala @@ -22,7 +22,7 @@ import java.util.concurrent.CountDownLatch import org.apache.spark.deploy.mesos.ui.MesosClusterUI import org.apache.spark.deploy.rest.mesos.MesosRestServer import org.apache.spark.scheduler.cluster.mesos._ -import org.apache.spark.util.SignalLogger +import org.apache.spark.util.{ShutdownHookManager, SignalLogger} import org.apache.spark.{Logging, SecurityManager, SparkConf} /* @@ -103,14 +103,11 @@ private[mesos] object MesosClusterDispatcher extends Logging { } val dispatcher = new MesosClusterDispatcher(dispatcherArgs, conf) dispatcher.start() -val shutdownHook = new Thread() { - override def run() { -logInfo("Shutdown hook is shutting down dispatcher") -dispatcher.stop() -dispatcher.awaitShutdown() - } +ShutdownHookManager.addShutdownHook { () => + logInfo("Shutdown hook is shutting down dispatcher") + dispatcher.stop() + dispatcher.awaitShutdown() } -Runtime.getRuntime.addShutdownHook(shutdownHook)
spark git commit: [SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting.
Repository: spark Updated Branches: refs/heads/branch-1.6 4af64385b -> 154567dca [SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting. Author: Rohit AgarwalCloses #10180 from mindprince/SPARK-12186. (cherry picked from commit fdb38227564c1af40cbfb97df420b23eb04c002b) Signed-off-by: Andrew Or Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/154567dc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/154567dc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/154567dc Branch: refs/heads/branch-1.6 Commit: 154567dca126d4992c9c9b08d71d22e9af43c995 Parents: 4af6438 Author: Rohit Agarwal Authored: Wed Dec 16 19:04:33 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 19:04:43 2015 -0800 -- .../scala/org/apache/spark/deploy/history/HistoryServer.scala| 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/154567dc/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala index d4f327c..f31fef0 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala @@ -103,7 +103,9 @@ class HistoryServer( // Note we don't use the UI retrieved from the cache; the cache loader above will register // the app's UI, and all we need to do is redirect the user to the same URI that was // requested, and the proper data should be served at that point. - res.sendRedirect(res.encodeRedirectURL(req.getRequestURI())) + // Also, make sure that the redirect url contains the query string present in the request. + val requestURI = req.getRequestURI + Option(req.getQueryString).map("?" + _).getOrElse("") + res.sendRedirect(res.encodeRedirectURL(requestURI)) } // SPARK-5983 ensure TRACE is not supported - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting.
Repository: spark Updated Branches: refs/heads/master f590178d7 -> fdb382275 [SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting. Author: Rohit AgarwalCloses #10180 from mindprince/SPARK-12186. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fdb38227 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fdb38227 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fdb38227 Branch: refs/heads/master Commit: fdb38227564c1af40cbfb97df420b23eb04c002b Parents: f590178 Author: Rohit Agarwal Authored: Wed Dec 16 19:04:33 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 19:04:33 2015 -0800 -- .../scala/org/apache/spark/deploy/history/HistoryServer.scala| 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fdb38227/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala index d4f327c..f31fef0 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala @@ -103,7 +103,9 @@ class HistoryServer( // Note we don't use the UI retrieved from the cache; the cache loader above will register // the app's UI, and all we need to do is redirect the user to the same URI that was // requested, and the proper data should be served at that point. - res.sendRedirect(res.encodeRedirectURL(req.getRequestURI())) + // Also, make sure that the redirect url contains the query string present in the request. + val requestURI = req.getRequestURI + Option(req.getQueryString).map("?" + _).getOrElse("") + res.sendRedirect(res.encodeRedirectURL(requestURI)) } // SPARK-5983 ensure TRACE is not supported - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12390] Clean up unused serializer parameter in BlockManager
Repository: spark Updated Branches: refs/heads/master d1508dd9b -> 97678edea [SPARK-12390] Clean up unused serializer parameter in BlockManager No change in functionality is intended. This only changes internal API. Author: Andrew OrCloses #10343 from andrewor14/clean-bm-serializer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/97678ede Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/97678ede Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/97678ede Branch: refs/heads/master Commit: 97678edeaaafc19ea18d044233a952d2e2e89fbc Parents: d1508dd Author: Andrew Or Authored: Wed Dec 16 20:01:47 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 20:01:47 2015 -0800 -- .../org/apache/spark/storage/BlockManager.scala | 29 .../org/apache/spark/storage/DiskStore.scala| 10 --- 2 files changed, 11 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/97678ede/core/src/main/scala/org/apache/spark/storage/BlockManager.scala -- diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala index 540e1ec..6074fc5 100644 --- a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala +++ b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala @@ -1190,20 +1190,16 @@ private[spark] class BlockManager( def dataSerializeStream( blockId: BlockId, outputStream: OutputStream, - values: Iterator[Any], - serializer: Serializer = defaultSerializer): Unit = { + values: Iterator[Any]): Unit = { val byteStream = new BufferedOutputStream(outputStream) -val ser = serializer.newInstance() +val ser = defaultSerializer.newInstance() ser.serializeStream(wrapForCompression(blockId, byteStream)).writeAll(values).close() } /** Serializes into a byte buffer. */ - def dataSerialize( - blockId: BlockId, - values: Iterator[Any], - serializer: Serializer = defaultSerializer): ByteBuffer = { + def dataSerialize(blockId: BlockId, values: Iterator[Any]): ByteBuffer = { val byteStream = new ByteBufferOutputStream(4096) -dataSerializeStream(blockId, byteStream, values, serializer) +dataSerializeStream(blockId, byteStream, values) byteStream.toByteBuffer } @@ -1211,24 +1207,21 @@ private[spark] class BlockManager( * Deserializes a ByteBuffer into an iterator of values and disposes of it when the end of * the iterator is reached. */ - def dataDeserialize( - blockId: BlockId, - bytes: ByteBuffer, - serializer: Serializer = defaultSerializer): Iterator[Any] = { + def dataDeserialize(blockId: BlockId, bytes: ByteBuffer): Iterator[Any] = { bytes.rewind() -dataDeserializeStream(blockId, new ByteBufferInputStream(bytes, true), serializer) +dataDeserializeStream(blockId, new ByteBufferInputStream(bytes, true)) } /** * Deserializes a InputStream into an iterator of values and disposes of it when the end of * the iterator is reached. */ - def dataDeserializeStream( - blockId: BlockId, - inputStream: InputStream, - serializer: Serializer = defaultSerializer): Iterator[Any] = { + def dataDeserializeStream(blockId: BlockId, inputStream: InputStream): Iterator[Any] = { val stream = new BufferedInputStream(inputStream) -serializer.newInstance().deserializeStream(wrapForCompression(blockId, stream)).asIterator +defaultSerializer + .newInstance() + .deserializeStream(wrapForCompression(blockId, stream)) + .asIterator } def stop(): Unit = { http://git-wip-us.apache.org/repos/asf/spark/blob/97678ede/core/src/main/scala/org/apache/spark/storage/DiskStore.scala -- diff --git a/core/src/main/scala/org/apache/spark/storage/DiskStore.scala b/core/src/main/scala/org/apache/spark/storage/DiskStore.scala index c008b9d..6c44771 100644 --- a/core/src/main/scala/org/apache/spark/storage/DiskStore.scala +++ b/core/src/main/scala/org/apache/spark/storage/DiskStore.scala @@ -144,16 +144,6 @@ private[spark] class DiskStore(blockManager: BlockManager, diskManager: DiskBloc getBytes(blockId).map(buffer => blockManager.dataDeserialize(blockId, buffer)) } - /** - * A version of getValues that allows a custom serializer. This is used as part of the - * shuffle short-circuit code. - */ - def getValues(blockId: BlockId, serializer: Serializer): Option[Iterator[Any]] = { -// TODO: Should bypass getBytes and use a stream based implementation, so that -//
spark git commit: [SPARK-12057][SQL] Prevent failure on corrupt JSON records
Repository: spark Updated Branches: refs/heads/branch-1.6 4ad08035d -> d509194b8 [SPARK-12057][SQL] Prevent failure on corrupt JSON records This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian MacalinaoAuthor: Yin Huai Closes #10288 from yhuai/handleCorruptJson. (cherry picked from commit 9d66c4216ad830812848c657bbcd8cd50949e199) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d509194b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d509194b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d509194b Branch: refs/heads/branch-1.6 Commit: d509194b81abc3c7bf9563d26560d596e1415627 Parents: 4ad0803 Author: Yin Huai Authored: Wed Dec 16 23:18:53 2015 -0800 Committer: Reynold Xin Committed: Wed Dec 16 23:19:06 2015 -0800 -- .../datasources/json/InferSchema.scala | 37 +--- .../datasources/json/JacksonParser.scala| 19 ++ .../execution/datasources/json/JsonSuite.scala | 37 .../datasources/json/TestJsonData.scala | 9 - 4 files changed, 90 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d509194b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala index 922fd5b..59ba4ae 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala @@ -61,7 +61,10 @@ private[json] object InferSchema { StructType(Seq(StructField(columnNameOfCorruptRecords, StringType))) } } -}.treeAggregate[DataType](StructType(Seq()))(compatibleRootType, compatibleRootType) +}.treeAggregate[DataType]( + StructType(Seq()))( + compatibleRootType(columnNameOfCorruptRecords), + compatibleRootType(columnNameOfCorruptRecords)) canonicalizeType(rootType) match { case Some(st: StructType) => st @@ -170,12 +173,38 @@ private[json] object InferSchema { case other => Some(other) } + private def withCorruptField( + struct: StructType, + columnNameOfCorruptRecords: String): StructType = { +if (!struct.fieldNames.contains(columnNameOfCorruptRecords)) { + // If this given struct does not have a column used for corrupt records, + // add this field. + struct.add(columnNameOfCorruptRecords, StringType, nullable = true) +} else { + // Otherwise, just return this struct. + struct +} + } + /** * Remove top-level ArrayType wrappers and merge the remaining schemas */ - private def compatibleRootType: (DataType, DataType) => DataType = { -case (ArrayType(ty1, _), ty2) => compatibleRootType(ty1, ty2) -case (ty1, ArrayType(ty2, _)) => compatibleRootType(ty1, ty2) + private def compatibleRootType( + columnNameOfCorruptRecords: String): (DataType, DataType) => DataType = { +// Since we support array of json objects at the top level, +// we need to check the element type and find the root level data type. +case (ArrayType(ty1, _), ty2) => compatibleRootType(columnNameOfCorruptRecords)(ty1, ty2) +case (ty1, ArrayType(ty2, _)) => compatibleRootType(columnNameOfCorruptRecords)(ty1, ty2) +// If we see any other data type at the root level, we get records that cannot be +// parsed. So, we use the struct as the data type and add the corrupt field to the schema. +case (struct: StructType, NullType) => struct +case (NullType, struct: StructType) => struct +case (struct: StructType, o) if !o.isInstanceOf[StructType] => + withCorruptField(struct, columnNameOfCorruptRecords) +case (o, struct: StructType) if !o.isInstanceOf[StructType] => + withCorruptField(struct,
spark git commit: [SPARK-12057][SQL] Prevent failure on corrupt JSON records
Repository: spark Updated Branches: refs/heads/master 437583f69 -> 9d66c4216 [SPARK-12057][SQL] Prevent failure on corrupt JSON records This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian MacalinaoAuthor: Yin Huai Closes #10288 from yhuai/handleCorruptJson. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d66c421 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d66c421 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d66c421 Branch: refs/heads/master Commit: 9d66c4216ad830812848c657bbcd8cd50949e199 Parents: 437583f Author: Yin Huai Authored: Wed Dec 16 23:18:53 2015 -0800 Committer: Reynold Xin Committed: Wed Dec 16 23:18:53 2015 -0800 -- .../datasources/json/InferSchema.scala | 37 +--- .../datasources/json/JacksonParser.scala| 19 ++ .../execution/datasources/json/JsonSuite.scala | 37 .../datasources/json/TestJsonData.scala | 9 - 4 files changed, 90 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9d66c421/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala index 922fd5b..59ba4ae 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala @@ -61,7 +61,10 @@ private[json] object InferSchema { StructType(Seq(StructField(columnNameOfCorruptRecords, StringType))) } } -}.treeAggregate[DataType](StructType(Seq()))(compatibleRootType, compatibleRootType) +}.treeAggregate[DataType]( + StructType(Seq()))( + compatibleRootType(columnNameOfCorruptRecords), + compatibleRootType(columnNameOfCorruptRecords)) canonicalizeType(rootType) match { case Some(st: StructType) => st @@ -170,12 +173,38 @@ private[json] object InferSchema { case other => Some(other) } + private def withCorruptField( + struct: StructType, + columnNameOfCorruptRecords: String): StructType = { +if (!struct.fieldNames.contains(columnNameOfCorruptRecords)) { + // If this given struct does not have a column used for corrupt records, + // add this field. + struct.add(columnNameOfCorruptRecords, StringType, nullable = true) +} else { + // Otherwise, just return this struct. + struct +} + } + /** * Remove top-level ArrayType wrappers and merge the remaining schemas */ - private def compatibleRootType: (DataType, DataType) => DataType = { -case (ArrayType(ty1, _), ty2) => compatibleRootType(ty1, ty2) -case (ty1, ArrayType(ty2, _)) => compatibleRootType(ty1, ty2) + private def compatibleRootType( + columnNameOfCorruptRecords: String): (DataType, DataType) => DataType = { +// Since we support array of json objects at the top level, +// we need to check the element type and find the root level data type. +case (ArrayType(ty1, _), ty2) => compatibleRootType(columnNameOfCorruptRecords)(ty1, ty2) +case (ty1, ArrayType(ty2, _)) => compatibleRootType(columnNameOfCorruptRecords)(ty1, ty2) +// If we see any other data type at the root level, we get records that cannot be +// parsed. So, we use the struct as the data type and add the corrupt field to the schema. +case (struct: StructType, NullType) => struct +case (NullType, struct: StructType) => struct +case (struct: StructType, o) if !o.isInstanceOf[StructType] => + withCorruptField(struct, columnNameOfCorruptRecords) +case (o, struct: StructType) if !o.isInstanceOf[StructType] => + withCorruptField(struct, columnNameOfCorruptRecords) +// If we get anything else, we call compatibleType. +// Usually, when we reach here, ty1 and ty2 are two
spark git commit: [SPARK-11904][PYSPARK] reduceByKeyAndWindow does not require checkpointing when invFunc is None
Repository: spark Updated Branches: refs/heads/master 97678edea -> 437583f69 [SPARK-11904][PYSPARK] reduceByKeyAndWindow does not require checkpointing when invFunc is None when invFunc is None, `reduceByKeyAndWindow(func, None, winsize, slidesize)` is equivalent to reduceByKey(func).window(winsize, slidesize).reduceByKey(winsize, slidesize) and no checkpoint is necessary. The corresponding Scala code does exactly that, but Python code always creates a windowed stream with obligatory checkpointing. The patch fixes this. I do not know how to unit-test this. Author: David TolpinCloses #9888 from dtolpin/master. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/437583f6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/437583f6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/437583f6 Branch: refs/heads/master Commit: 437583f692e30b8dc03b339a34e92595d7b992ba Parents: 97678ed Author: David Tolpin Authored: Wed Dec 16 22:10:24 2015 -0800 Committer: Shixiong Zhu Committed: Wed Dec 16 22:10:24 2015 -0800 -- python/pyspark/streaming/dstream.py | 45 1 file changed, 23 insertions(+), 22 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/437583f6/python/pyspark/streaming/dstream.py -- diff --git a/python/pyspark/streaming/dstream.py b/python/pyspark/streaming/dstream.py index f61137c..b994a53 100644 --- a/python/pyspark/streaming/dstream.py +++ b/python/pyspark/streaming/dstream.py @@ -542,31 +542,32 @@ class DStream(object): reduced = self.reduceByKey(func, numPartitions) -def reduceFunc(t, a, b): -b = b.reduceByKey(func, numPartitions) -r = a.union(b).reduceByKey(func, numPartitions) if a else b -if filterFunc: -r = r.filter(filterFunc) -return r - -def invReduceFunc(t, a, b): -b = b.reduceByKey(func, numPartitions) -joined = a.leftOuterJoin(b, numPartitions) -return joined.mapValues(lambda kv: invFunc(kv[0], kv[1]) -if kv[1] is not None else kv[0]) - -jreduceFunc = TransformFunction(self._sc, reduceFunc, reduced._jrdd_deserializer) if invFunc: +def reduceFunc(t, a, b): +b = b.reduceByKey(func, numPartitions) +r = a.union(b).reduceByKey(func, numPartitions) if a else b +if filterFunc: +r = r.filter(filterFunc) +return r + +def invReduceFunc(t, a, b): +b = b.reduceByKey(func, numPartitions) +joined = a.leftOuterJoin(b, numPartitions) +return joined.mapValues(lambda kv: invFunc(kv[0], kv[1]) +if kv[1] is not None else kv[0]) + +jreduceFunc = TransformFunction(self._sc, reduceFunc, reduced._jrdd_deserializer) jinvReduceFunc = TransformFunction(self._sc, invReduceFunc, reduced._jrdd_deserializer) +if slideDuration is None: +slideDuration = self._slideDuration +dstream = self._sc._jvm.PythonReducedWindowedDStream( +reduced._jdstream.dstream(), +jreduceFunc, jinvReduceFunc, +self._ssc._jduration(windowDuration), +self._ssc._jduration(slideDuration)) +return DStream(dstream.asJavaDStream(), self._ssc, self._sc.serializer) else: -jinvReduceFunc = None -if slideDuration is None: -slideDuration = self._slideDuration -dstream = self._sc._jvm.PythonReducedWindowedDStream(reduced._jdstream.dstream(), - jreduceFunc, jinvReduceFunc, - self._ssc._jduration(windowDuration), - self._ssc._jduration(slideDuration)) -return DStream(dstream.asJavaDStream(), self._ssc, self._sc.serializer) +return reduced.window(windowDuration, slideDuration).reduceByKey(func, numPartitions) def updateStateByKey(self, updateFunc, numPartitions=None, initialRDD=None): """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests
Repository: spark Updated Branches: refs/heads/branch-1.6 638b89bc3 -> fb02e4e3b [SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests `DAGSchedulerEventLoop` normally only logs errors (so it can continue to process more events, from other jobs). However, this is not desirable in the tests -- the tests should be able to easily detect any exception, and also shouldn't silently succeed if there is an exception. This was suggested by mateiz on https://github.com/apache/spark/pull/7699. It may have already turned up an issue in "zero split job". Author: Imran RashidCloses #8466 from squito/SPARK-10248. (cherry picked from commit 38d9795a4fa07086d65ff705ce86648345618736) Signed-off-by: Andrew Or Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb02e4e3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb02e4e3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb02e4e3 Branch: refs/heads/branch-1.6 Commit: fb02e4e3bcc50a8f823dfecdb2eef71287225e7b Parents: 638b89b Author: Imran Rashid Authored: Wed Dec 16 19:01:05 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 19:01:13 2015 -0800 -- .../apache/spark/scheduler/DAGScheduler.scala | 5 ++-- .../spark/scheduler/DAGSchedulerSuite.scala | 28 ++-- 2 files changed, 29 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fb02e4e3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala index e01a960..b805bde 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala @@ -802,7 +802,8 @@ class DAGScheduler( private[scheduler] def cleanUpAfterSchedulerStop() { for (job <- activeJobs) { - val error = new SparkException("Job cancelled because SparkContext was shut down") + val error = +new SparkException(s"Job ${job.jobId} cancelled because SparkContext was shut down") job.listener.jobFailed(error) // Tell the listeners that all of the running stages have ended. Don't bother // cancelling the stages because if the DAG scheduler is stopped, the entire application @@ -1291,7 +1292,7 @@ class DAGScheduler( case TaskResultLost => // Do nothing here; the TaskScheduler handles these failures and resubmits the task. - case other => + case _: ExecutorLostFailure | TaskKilled | UnknownReason => // Unrecognized failure - also do nothing. If the task fails repeatedly, the TaskScheduler // will abort the job. } http://git-wip-us.apache.org/repos/asf/spark/blob/fb02e4e3/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala index 653d41f..2869f0f 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala @@ -45,6 +45,13 @@ class DAGSchedulerEventProcessLoopTester(dagScheduler: DAGScheduler) case NonFatal(e) => onError(e) } } + + override def onError(e: Throwable): Unit = { +logError("Error in DAGSchedulerEventLoop: ", e) +dagScheduler.stop() +throw e + } + } /** @@ -300,13 +307,18 @@ class DAGSchedulerSuite test("zero split job") { var numResults = 0 +var failureReason: Option[Exception] = None val fakeListener = new JobListener() { - override def taskSucceeded(partition: Int, value: Any) = numResults += 1 - override def jobFailed(exception: Exception) = throw exception + override def taskSucceeded(partition: Int, value: Any): Unit = numResults += 1 + override def jobFailed(exception: Exception): Unit = { +failureReason = Some(exception) + } } val jobId = submit(new MyRDD(sc, 0, Nil), Array(), listener = fakeListener) assert(numResults === 0) cancel(jobId) +assert(failureReason.isDefined) +assert(failureReason.get.getMessage() === "Job 0 cancelled ") } test("run trivial job") { @@ -1675,6 +1687,18 @@ class DAGSchedulerSuite assert(stackTraceString.contains("org.scalatest.FunSuite")) } + test("catch errors in event loop") { +// this is a test of our testing framework -- make sure errors
spark git commit: [SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests
Repository: spark Updated Branches: refs/heads/master ce5fd4008 -> 38d9795a4 [SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests `DAGSchedulerEventLoop` normally only logs errors (so it can continue to process more events, from other jobs). However, this is not desirable in the tests -- the tests should be able to easily detect any exception, and also shouldn't silently succeed if there is an exception. This was suggested by mateiz on https://github.com/apache/spark/pull/7699. It may have already turned up an issue in "zero split job". Author: Imran RashidCloses #8466 from squito/SPARK-10248. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/38d9795a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/38d9795a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/38d9795a Branch: refs/heads/master Commit: 38d9795a4fa07086d65ff705ce86648345618736 Parents: ce5fd40 Author: Imran Rashid Authored: Wed Dec 16 19:01:05 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 19:01:05 2015 -0800 -- .../apache/spark/scheduler/DAGScheduler.scala | 5 ++-- .../spark/scheduler/DAGSchedulerSuite.scala | 28 ++-- 2 files changed, 29 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/38d9795a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala index 8d0e0c8..b128ed5 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala @@ -805,7 +805,8 @@ class DAGScheduler( private[scheduler] def cleanUpAfterSchedulerStop() { for (job <- activeJobs) { - val error = new SparkException("Job cancelled because SparkContext was shut down") + val error = +new SparkException(s"Job ${job.jobId} cancelled because SparkContext was shut down") job.listener.jobFailed(error) // Tell the listeners that all of the running stages have ended. Don't bother // cancelling the stages because if the DAG scheduler is stopped, the entire application @@ -1295,7 +1296,7 @@ class DAGScheduler( case TaskResultLost => // Do nothing here; the TaskScheduler handles these failures and resubmits the task. - case other => + case _: ExecutorLostFailure | TaskKilled | UnknownReason => // Unrecognized failure - also do nothing. If the task fails repeatedly, the TaskScheduler // will abort the job. } http://git-wip-us.apache.org/repos/asf/spark/blob/38d9795a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala index 653d41f..2869f0f 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala @@ -45,6 +45,13 @@ class DAGSchedulerEventProcessLoopTester(dagScheduler: DAGScheduler) case NonFatal(e) => onError(e) } } + + override def onError(e: Throwable): Unit = { +logError("Error in DAGSchedulerEventLoop: ", e) +dagScheduler.stop() +throw e + } + } /** @@ -300,13 +307,18 @@ class DAGSchedulerSuite test("zero split job") { var numResults = 0 +var failureReason: Option[Exception] = None val fakeListener = new JobListener() { - override def taskSucceeded(partition: Int, value: Any) = numResults += 1 - override def jobFailed(exception: Exception) = throw exception + override def taskSucceeded(partition: Int, value: Any): Unit = numResults += 1 + override def jobFailed(exception: Exception): Unit = { +failureReason = Some(exception) + } } val jobId = submit(new MyRDD(sc, 0, Nil), Array(), listener = fakeListener) assert(numResults === 0) cancel(jobId) +assert(failureReason.isDefined) +assert(failureReason.get.getMessage() === "Job 0 cancelled ") } test("run trivial job") { @@ -1675,6 +1687,18 @@ class DAGSchedulerSuite assert(stackTraceString.contains("org.scalatest.FunSuite")) } + test("catch errors in event loop") { +// this is a test of our testing framework -- make sure errors in event loop don't get ignored + +// just run some bad event that will throw an exception -- we'll give a null
spark git commit: [SPARK-12386][CORE] Fix NPE when spark.executor.port is set.
Repository: spark Updated Branches: refs/heads/branch-1.6 154567dca -> 4ad08035d [SPARK-12386][CORE] Fix NPE when spark.executor.port is set. Author: Marcelo VanzinCloses #10339 from vanzin/SPARK-12386. (cherry picked from commit d1508dd9b765489913bc948575a69ebab82f217b) Signed-off-by: Andrew Or Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ad08035 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ad08035 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ad08035 Branch: refs/heads/branch-1.6 Commit: 4ad08035d28b8f103132da9779340c5e64e2d1c2 Parents: 154567d Author: Marcelo Vanzin Authored: Wed Dec 16 19:47:49 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 19:47:57 2015 -0800 -- core/src/main/scala/org/apache/spark/SparkEnv.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4ad08035/core/src/main/scala/org/apache/spark/SparkEnv.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala b/core/src/main/scala/org/apache/spark/SparkEnv.scala index 84230e3..52acde1 100644 --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala @@ -256,7 +256,12 @@ object SparkEnv extends Logging { if (rpcEnv.isInstanceOf[AkkaRpcEnv]) { rpcEnv.asInstanceOf[AkkaRpcEnv].actorSystem } else { -val actorSystemPort = if (port == 0) 0 else rpcEnv.address.port + 1 +val actorSystemPort = + if (port == 0 || rpcEnv.address == null) { +port + } else { +rpcEnv.address.port + 1 + } // Create a ActorSystem for legacy codes AkkaUtils.createActorSystem( actorSystemName + "ActorSystem", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12386][CORE] Fix NPE when spark.executor.port is set.
Repository: spark Updated Branches: refs/heads/master fdb382275 -> d1508dd9b [SPARK-12386][CORE] Fix NPE when spark.executor.port is set. Author: Marcelo VanzinCloses #10339 from vanzin/SPARK-12386. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d1508dd9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d1508dd9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d1508dd9 Branch: refs/heads/master Commit: d1508dd9b765489913bc948575a69ebab82f217b Parents: fdb3822 Author: Marcelo Vanzin Authored: Wed Dec 16 19:47:49 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 19:47:49 2015 -0800 -- core/src/main/scala/org/apache/spark/SparkEnv.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d1508dd9/core/src/main/scala/org/apache/spark/SparkEnv.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala b/core/src/main/scala/org/apache/spark/SparkEnv.scala index 84230e3..52acde1 100644 --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala @@ -256,7 +256,12 @@ object SparkEnv extends Logging { if (rpcEnv.isInstanceOf[AkkaRpcEnv]) { rpcEnv.asInstanceOf[AkkaRpcEnv].actorSystem } else { -val actorSystemPort = if (port == 0) 0 else rpcEnv.address.port + 1 +val actorSystemPort = + if (port == 0 || rpcEnv.address == null) { +port + } else { +rpcEnv.address.port + 1 + } // Create a ActorSystem for legacy codes AkkaUtils.createActorSystem( actorSystemName + "ActorSystem", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: MAINTENANCE: Automated closing of pull requests.
Repository: spark Updated Branches: refs/heads/master 861549acd -> ce5fd4008 MAINTENANCE: Automated closing of pull requests. This commit exists to close the following pull requests on Github: Closes #1217 (requested by ankurdave, srowen) Closes #4650 (requested by andrewor14) Closes #5307 (requested by vanzin) Closes #5664 (requested by andrewor14) Closes #5713 (requested by marmbrus) Closes #5722 (requested by andrewor14) Closes #6685 (requested by srowen) Closes #7074 (requested by srowen) Closes #7119 (requested by andrewor14) Closes #7997 (requested by jkbradley) Closes #8292 (requested by srowen) Closes #8975 (requested by andrewor14, vanzin) Closes #8980 (requested by andrewor14, davies) Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ce5fd400 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ce5fd400 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ce5fd400 Branch: refs/heads/master Commit: ce5fd4008e890ef8ebc2d3cb703a666783ad6c02 Parents: 861549a Author: Andrew OrAuthored: Wed Dec 16 17:05:57 2015 -0800 Committer: Andrew Or Committed: Wed Dec 16 17:05:57 2015 -0800 -- -- - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: Revert "[SPARK-12105] [SQL] add convenient show functions"
Revert "[SPARK-12105] [SQL] add convenient show functions" This reverts commit 31b391019ff6eb5a483f4b3e62fd082de7ff8416. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1a3d0cd9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1a3d0cd9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1a3d0cd9 Branch: refs/heads/master Commit: 1a3d0cd9f013aee1f03b1c632c91ae0951bccbb0 Parents: 18ea11c Author: Reynold XinAuthored: Wed Dec 16 00:57:34 2015 -0800 Committer: Reynold Xin Committed: Wed Dec 16 00:57:34 2015 -0800 -- .../scala/org/apache/spark/sql/DataFrame.scala | 25 +++- 1 file changed, 9 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1a3d0cd9/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala index b69d441..497bd48 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala @@ -161,23 +161,16 @@ class DataFrame private[sql]( } /** -* Compose the string representing rows for output -*/ - def showString(): String = { -showString(20) - } - - /** * Compose the string representing rows for output - * @param numRows Number of rows to show + * @param _numRows Number of rows to show * @param truncate Whether truncate long strings and align cells right */ - def showString(numRows: Int, truncate: Boolean = true): String = { -val _numRows = numRows.max(0) + private[sql] def showString(_numRows: Int, truncate: Boolean = true): String = { +val numRows = _numRows.max(0) val sb = new StringBuilder -val takeResult = take(_numRows + 1) -val hasMoreData = takeResult.length > _numRows -val data = takeResult.take(_numRows) +val takeResult = take(numRows + 1) +val hasMoreData = takeResult.length > numRows +val data = takeResult.take(numRows) val numCols = schema.fieldNames.length // For array values, replace Seq and Array with square brackets @@ -231,10 +224,10 @@ class DataFrame private[sql]( sb.append(sep) -// For Data that has more than "_numRows" records +// For Data that has more than "numRows" records if (hasMoreData) { - val rowsString = if (_numRows == 1) "row" else "rows" - sb.append(s"only showing top $_numRows $rowsString\n") + val rowsString = if (numRows == 1) "row" else "rows" + sb.append(s"only showing top $numRows $rowsString\n") } sb.toString() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[1/2] spark git commit: Revert "[HOTFIX] Compile error from commit 31b3910"
Repository: spark Updated Branches: refs/heads/master 554d840a9 -> 1a3d0cd9f Revert "[HOTFIX] Compile error from commit 31b3910" This reverts commit 840bd2e008da5b22bfa73c587ea2c57666fffc60. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18ea11c3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18ea11c3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18ea11c3 Branch: refs/heads/master Commit: 18ea11c3a84e5eafd81fa0fe7c09224e79c4e93f Parents: 554d840 Author: Reynold XinAuthored: Wed Dec 16 00:57:07 2015 -0800 Committer: Reynold Xin Committed: Wed Dec 16 00:57:07 2015 -0800 -- sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/18ea11c3/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala index 33b03be..b69d441 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala @@ -234,7 +234,7 @@ class DataFrame private[sql]( // For Data that has more than "_numRows" records if (hasMoreData) { val rowsString = if (_numRows == 1) "row" else "rows" - sb.append(s"only showing top ${_numRows} $rowsString\n") + sb.append(s"only showing top $_numRows $rowsString\n") } sb.toString() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org