spark git commit: Copy pyspark and SparkR packages to latest release dir too
Repository: spark Updated Branches: refs/heads/branch-2.1 e8f351f9a -> 2c88e1dc3 Copy pyspark and SparkR packages to latest release dir too ## What changes were proposed in this pull request? Copy pyspark and SparkR packages to latest release dir, as per comment [here](https://github.com/apache/spark/pull/16226#discussion_r91664822) Author: Felix CheungCloses #16227 from felixcheung/pyrftp. (cherry picked from commit c074c96dc57bf18b28fafdcac0c768d75c642cba) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2c88e1dc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2c88e1dc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2c88e1dc Branch: refs/heads/branch-2.1 Commit: 2c88e1dc31e1b90605ad8ab85b20b131b4b3c722 Parents: e8f351f Author: Felix Cheung Authored: Thu Dec 8 22:52:34 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 22:53:02 2016 -0800 -- dev/create-release/release-build.sh | 2 ++ 1 file changed, 2 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2c88e1dc/dev/create-release/release-build.sh -- diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 7c77791..c0663b8 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -251,6 +251,8 @@ if [[ "$1" == "package" ]]; then # Put to new directory: LFTP mkdir -p $dest_dir LFTP mput -O $dest_dir 'spark-*' + LFTP mput -O $dest_dir 'pyspark-*' + LFTP mput -O $dest_dir 'SparkR-*' # Delete /latest directory and rename new upload to /latest LFTP "rm -r -f $REMOTE_PARENT_DIR/latest || exit 0" LFTP mv $dest_dir "$REMOTE_PARENT_DIR/latest" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Copy pyspark and SparkR packages to latest release dir too
Repository: spark Updated Branches: refs/heads/master 934035ae7 -> c074c96dc Copy pyspark and SparkR packages to latest release dir too ## What changes were proposed in this pull request? Copy pyspark and SparkR packages to latest release dir, as per comment [here](https://github.com/apache/spark/pull/16226#discussion_r91664822) Author: Felix CheungCloses #16227 from felixcheung/pyrftp. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c074c96d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c074c96d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c074c96d Branch: refs/heads/master Commit: c074c96dc57bf18b28fafdcac0c768d75c642cba Parents: 934035a Author: Felix Cheung Authored: Thu Dec 8 22:52:34 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 22:52:34 2016 -0800 -- dev/create-release/release-build.sh | 2 ++ 1 file changed, 2 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c074c96d/dev/create-release/release-build.sh -- diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 7c77791..c0663b8 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -251,6 +251,8 @@ if [[ "$1" == "package" ]]; then # Put to new directory: LFTP mkdir -p $dest_dir LFTP mput -O $dest_dir 'spark-*' + LFTP mput -O $dest_dir 'pyspark-*' + LFTP mput -O $dest_dir 'SparkR-*' # Delete /latest directory and rename new upload to /latest LFTP "rm -r -f $REMOTE_PARENT_DIR/latest || exit 0" LFTP mv $dest_dir "$REMOTE_PARENT_DIR/latest" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Copy the SparkR source package with LFTP
Repository: spark Updated Branches: refs/heads/branch-2.1 4ceed95b4 -> e8f351f9a Copy the SparkR source package with LFTP This PR adds a line in release-build.sh to copy the SparkR source archive using LFTP Author: Shivaram VenkataramanCloses #16226 from shivaram/fix-sparkr-copy-build. (cherry picked from commit 934035ae7cb648fe61665d8efe0b7aa2bbe4ca47) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e8f351f9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e8f351f9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e8f351f9 Branch: refs/heads/branch-2.1 Commit: e8f351f9a670fc4d43f15c8d7cd57e49fb9ceba2 Parents: 4ceed95b Author: Shivaram Venkataraman Authored: Thu Dec 8 22:21:24 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 22:21:36 2016 -0800 -- dev/create-release/release-build.sh | 1 + 1 file changed, 1 insertion(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e8f351f9/dev/create-release/release-build.sh -- diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 1b05b20..7c77791 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -258,6 +258,7 @@ if [[ "$1" == "package" ]]; then LFTP mkdir -p $dest_dir LFTP mput -O $dest_dir 'spark-*' LFTP mput -O $dest_dir 'pyspark-*' + LFTP mput -O $dest_dir 'SparkR-*' exit 0 fi - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Copy the SparkR source package with LFTP
Repository: spark Updated Branches: refs/heads/master 9338aa4f8 -> 934035ae7 Copy the SparkR source package with LFTP This PR adds a line in release-build.sh to copy the SparkR source archive using LFTP Author: Shivaram VenkataramanCloses #16226 from shivaram/fix-sparkr-copy-build. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/934035ae Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/934035ae Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/934035ae Branch: refs/heads/master Commit: 934035ae7cb648fe61665d8efe0b7aa2bbe4ca47 Parents: 9338aa4 Author: Shivaram Venkataraman Authored: Thu Dec 8 22:21:24 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 22:21:24 2016 -0800 -- dev/create-release/release-build.sh | 1 + 1 file changed, 1 insertion(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/934035ae/dev/create-release/release-build.sh -- diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 1b05b20..7c77791 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -258,6 +258,7 @@ if [[ "$1" == "package" ]]; then LFTP mkdir -p $dest_dir LFTP mput -O $dest_dir 'spark-*' LFTP mput -O $dest_dir 'pyspark-*' + LFTP mput -O $dest_dir 'SparkR-*' exit 0 fi - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18697][BUILD] Upgrade sbt plugins
Repository: spark Updated Branches: refs/heads/master 86a96034c -> 9338aa4f8 [SPARK-18697][BUILD] Upgrade sbt plugins ## What changes were proposed in this pull request? This PR is to upgrade sbt plugins. The following sbt plugins will be upgraded: ``` sbteclipse-plugin: 4.0.0 -> 5.0.1 sbt-mima-plugin: 0.1.11 -> 0.1.12 org.ow2.asm/asm: 5.0.3 -> 5.1 org.ow2.asm/asm-commons: 5.0.3 -> 5.1 ``` ## How was this patch tested? Pass the Jenkins build. Author: Weiqing YangCloses #16223 from weiqingy/SPARK_18697. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9338aa4f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9338aa4f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9338aa4f Branch: refs/heads/master Commit: 9338aa4f89821c5640f7d007a67f9b947aa2bcd4 Parents: 86a9603 Author: Weiqing Yang Authored: Fri Dec 9 14:13:01 2016 +0800 Committer: Sean Owen Committed: Fri Dec 9 14:13:01 2016 +0800 -- project/plugins.sbt | 8 1 file changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9338aa4f/project/plugins.sbt -- diff --git a/project/plugins.sbt b/project/plugins.sbt index 76597d2..84d1239 100644 --- a/project/plugins.sbt +++ b/project/plugins.sbt @@ -1,12 +1,12 @@ addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2") -addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0") +addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.0.1") addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.8.2") addSbtPlugin("org.scalastyle" %% "scalastyle-sbt-plugin" % "0.8.0") -addSbtPlugin("com.typesafe" % "sbt-mima-plugin" % "0.1.11") +addSbtPlugin("com.typesafe" % "sbt-mima-plugin" % "0.1.12") addSbtPlugin("com.alpinenow" % "junit_xml_listener" % "0.5.1") @@ -16,9 +16,9 @@ addSbtPlugin("com.cavorite" % "sbt-avro" % "0.3.2") addSbtPlugin("io.spray" % "sbt-revolver" % "0.8.0") -libraryDependencies += "org.ow2.asm" % "asm" % "5.0.3" +libraryDependencies += "org.ow2.asm" % "asm" % "5.1" -libraryDependencies += "org.ow2.asm" % "asm-commons" % "5.0.3" +libraryDependencies += "org.ow2.asm" % "asm-commons" % "5.1" addSbtPlugin("com.simplytyped" % "sbt-antlr4" % "0.7.11") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18349][SPARKR] Update R API documentation on ml model summary
Repository: spark Updated Branches: refs/heads/branch-2.1 ef5646b4c -> 4ceed95b4 [SPARK-18349][SPARKR] Update R API documentation on ml model summary ## What changes were proposed in this pull request? In this PR, the document of `summary` method is improved in the format: returns summary information of the fitted model, which is a list. The list includes ... Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here. In current document, some `return` have `.` and some don't have. `.` is added to missed ones. Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged. ## How was this patch tested? Manual build. Author: wm...@hotmail.comCloses #16150 from wangmiao1981/audit2. (cherry picked from commit 86a96034ccb47c5bba2cd739d793240afcfc25f6) Signed-off-by: Felix Cheung Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ceed95b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ceed95b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ceed95b Branch: refs/heads/branch-2.1 Commit: 4ceed95b43d0cd9665004865095a40926efcc289 Parents: ef5646b Author: wm...@hotmail.com Authored: Thu Dec 8 22:08:19 2016 -0800 Committer: Felix Cheung Committed: Thu Dec 8 22:08:51 2016 -0800 -- R/pkg/R/mllib.R| 147 R/pkg/inst/tests/testthat/test_mllib.R | 2 + 2 files changed, 86 insertions(+), 63 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4ceed95b/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index 632e4ad..5df843c 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -191,7 +191,7 @@ predict_internal <- function(object, newData) { #' @param regParam regularization parameter for L2 regularization. #' @param ... additional arguments passed to the method. #' @aliases spark.glm,SparkDataFrame,formula-method -#' @return \code{spark.glm} returns a fitted generalized linear model +#' @return \code{spark.glm} returns a fitted generalized linear model. #' @rdname spark.glm #' @name spark.glm #' @export @@ -277,12 +277,12 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDat # Returns the summary of a model produced by glm() or spark.glm(), similarly to R's summary(). #' @param object a fitted generalized linear model. -#' @return \code{summary} returns a summary object of the fitted model, a list of components -#' including at least the coefficients matrix (which includes coefficients, standard error -#' of coefficients, t value and p value), null/residual deviance, null/residual degrees of -#' freedom, AIC and number of iterations IRLS takes. If there are collinear columns -#' in you data, the coefficients matrix only provides coefficients. -#' +#' @return \code{summary} returns summary information of the fitted model, which is a list. +#' The list of components includes at least the \code{coefficients} (coefficients matrix, which includes +#' coefficients, standard error of coefficients, t value and p value), +#' \code{null.deviance} (null/residual degrees of freedom), \code{aic} (AIC) +#' and \code{iter} (number of iterations IRLS takes). If there are collinear columns in the data, +#' the coefficients matrix only provides coefficients. #' @rdname spark.glm #' @export #' @note summary(GeneralizedLinearRegressionModel) since 2.0.0 @@ -328,7 +328,7 @@ setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"), # Prints the summary of GeneralizedLinearRegressionModel #' @rdname spark.glm -#' @param x summary object of fitted generalized linear model returned by \code{summary} function +#' @param x summary object of fitted generalized linear model returned by \code{summary} function. #' @export #' @note print.summary.GeneralizedLinearRegressionModel since 2.0.0 print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { @@ -361,7 +361,7 @@ print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { #' @param newData a SparkDataFrame for testing. #' @return \code{predict} returns a SparkDataFrame containing predicted labels in a column named -#' "prediction" +#' "prediction". #' @rdname spark.glm #' @export #' @note predict(GeneralizedLinearRegressionModel) since 1.5.0 @@ -375,7 +375,7 @@ setMethod("predict", signature(object = "GeneralizedLinearRegressionModel"), #'
spark git commit: [SPARK-18349][SPARKR] Update R API documentation on ml model summary
Repository: spark Updated Branches: refs/heads/master 4ac8b20bf -> 86a96034c [SPARK-18349][SPARKR] Update R API documentation on ml model summary ## What changes were proposed in this pull request? In this PR, the document of `summary` method is improved in the format: returns summary information of the fitted model, which is a list. The list includes ... Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here. In current document, some `return` have `.` and some don't have. `.` is added to missed ones. Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged. ## How was this patch tested? Manual build. Author: wm...@hotmail.comCloses #16150 from wangmiao1981/audit2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86a96034 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86a96034 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86a96034 Branch: refs/heads/master Commit: 86a96034ccb47c5bba2cd739d793240afcfc25f6 Parents: 4ac8b20 Author: wm...@hotmail.com Authored: Thu Dec 8 22:08:19 2016 -0800 Committer: Felix Cheung Committed: Thu Dec 8 22:08:19 2016 -0800 -- R/pkg/R/mllib.R| 147 R/pkg/inst/tests/testthat/test_mllib.R | 2 + 2 files changed, 86 insertions(+), 63 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/86a96034/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index 632e4ad..5df843c 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -191,7 +191,7 @@ predict_internal <- function(object, newData) { #' @param regParam regularization parameter for L2 regularization. #' @param ... additional arguments passed to the method. #' @aliases spark.glm,SparkDataFrame,formula-method -#' @return \code{spark.glm} returns a fitted generalized linear model +#' @return \code{spark.glm} returns a fitted generalized linear model. #' @rdname spark.glm #' @name spark.glm #' @export @@ -277,12 +277,12 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDat # Returns the summary of a model produced by glm() or spark.glm(), similarly to R's summary(). #' @param object a fitted generalized linear model. -#' @return \code{summary} returns a summary object of the fitted model, a list of components -#' including at least the coefficients matrix (which includes coefficients, standard error -#' of coefficients, t value and p value), null/residual deviance, null/residual degrees of -#' freedom, AIC and number of iterations IRLS takes. If there are collinear columns -#' in you data, the coefficients matrix only provides coefficients. -#' +#' @return \code{summary} returns summary information of the fitted model, which is a list. +#' The list of components includes at least the \code{coefficients} (coefficients matrix, which includes +#' coefficients, standard error of coefficients, t value and p value), +#' \code{null.deviance} (null/residual degrees of freedom), \code{aic} (AIC) +#' and \code{iter} (number of iterations IRLS takes). If there are collinear columns in the data, +#' the coefficients matrix only provides coefficients. #' @rdname spark.glm #' @export #' @note summary(GeneralizedLinearRegressionModel) since 2.0.0 @@ -328,7 +328,7 @@ setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"), # Prints the summary of GeneralizedLinearRegressionModel #' @rdname spark.glm -#' @param x summary object of fitted generalized linear model returned by \code{summary} function +#' @param x summary object of fitted generalized linear model returned by \code{summary} function. #' @export #' @note print.summary.GeneralizedLinearRegressionModel since 2.0.0 print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { @@ -361,7 +361,7 @@ print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { #' @param newData a SparkDataFrame for testing. #' @return \code{predict} returns a SparkDataFrame containing predicted labels in a column named -#' "prediction" +#' "prediction". #' @rdname spark.glm #' @export #' @note predict(GeneralizedLinearRegressionModel) since 1.5.0 @@ -375,7 +375,7 @@ setMethod("predict", signature(object = "GeneralizedLinearRegressionModel"), #' @param newData a SparkDataFrame for testing. #' @return \code{predict} returns a SparkDataFrame containing predicted labeled in a
spark git commit: [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution
Repository: spark Updated Branches: refs/heads/branch-2.1 1cafc76ea -> ef5646b4c [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in https://github.com/apache/spark/pull/16014#issuecomment-265867125 Author: Shivaram VenkataramanCloses #16221 from shivaram/fix-sparkr-release-build-name. (cherry picked from commit 4ac8b20bf2f962d9b8b6b209468896758d49efe3) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ef5646b4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ef5646b4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ef5646b4 Branch: refs/heads/branch-2.1 Commit: ef5646b4c6792a96e85d1dd4bb3103ba8306949b Parents: 1cafc76 Author: Shivaram Venkataraman Authored: Thu Dec 8 18:26:54 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 18:27:05 2016 -0800 -- dev/make-distribution.sh | 9 + 1 file changed, 9 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ef5646b4/dev/make-distribution.sh -- diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh index fe281bb..4da7d57 100755 --- a/dev/make-distribution.sh +++ b/dev/make-distribution.sh @@ -222,11 +222,14 @@ fi # Make R package - this is used for both CRAN release and packing R layout into distribution if [ "$MAKE_R" == "true" ]; then echo "Building R source package" + R_PACKAGE_VERSION=`grep Version $SPARK_HOME/R/pkg/DESCRIPTION | awk '{print $NF}'` pushd "$SPARK_HOME/R" > /dev/null # Build source package and run full checks # Install source package to get it to generate vignettes, etc. # Do not source the check-cran.sh - it should be run from where it is for it to set SPARK_HOME NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh + # Make a copy of R source package matching the Spark release version. + cp $SPARK_HOME/R/SparkR_"$R_PACKAGE_VERSION".tar.gz $SPARK_HOME/R/SparkR_"$VERSION".tar.gz popd > /dev/null else echo "Skipping building R source package" @@ -238,6 +241,12 @@ cp "$SPARK_HOME"/conf/*.template "$DISTDIR"/conf cp "$SPARK_HOME/README.md" "$DISTDIR" cp -r "$SPARK_HOME/bin" "$DISTDIR" cp -r "$SPARK_HOME/python" "$DISTDIR" + +# Remove the python distribution from dist/ if we built it +if [ "$MAKE_PIP" == "true" ]; then + rm -f $DISTDIR/python/dist/pyspark-*.tar.gz +fi + cp -r "$SPARK_HOME/sbin" "$DISTDIR" # Copy SparkR if it exists if [ -d "$SPARK_HOME"/R/lib/SparkR ]; then - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution
Repository: spark Updated Branches: refs/heads/master 458fa3325 -> 4ac8b20bf [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in https://github.com/apache/spark/pull/16014#issuecomment-265867125 Author: Shivaram VenkataramanCloses #16221 from shivaram/fix-sparkr-release-build-name. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ac8b20b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ac8b20b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ac8b20b Branch: refs/heads/master Commit: 4ac8b20bf2f962d9b8b6b209468896758d49efe3 Parents: 458fa33 Author: Shivaram Venkataraman Authored: Thu Dec 8 18:26:54 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 18:26:54 2016 -0800 -- dev/make-distribution.sh | 9 + 1 file changed, 9 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4ac8b20b/dev/make-distribution.sh -- diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh index fe281bb..4da7d57 100755 --- a/dev/make-distribution.sh +++ b/dev/make-distribution.sh @@ -222,11 +222,14 @@ fi # Make R package - this is used for both CRAN release and packing R layout into distribution if [ "$MAKE_R" == "true" ]; then echo "Building R source package" + R_PACKAGE_VERSION=`grep Version $SPARK_HOME/R/pkg/DESCRIPTION | awk '{print $NF}'` pushd "$SPARK_HOME/R" > /dev/null # Build source package and run full checks # Install source package to get it to generate vignettes, etc. # Do not source the check-cran.sh - it should be run from where it is for it to set SPARK_HOME NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh + # Make a copy of R source package matching the Spark release version. + cp $SPARK_HOME/R/SparkR_"$R_PACKAGE_VERSION".tar.gz $SPARK_HOME/R/SparkR_"$VERSION".tar.gz popd > /dev/null else echo "Skipping building R source package" @@ -238,6 +241,12 @@ cp "$SPARK_HOME"/conf/*.template "$DISTDIR"/conf cp "$SPARK_HOME/README.md" "$DISTDIR" cp -r "$SPARK_HOME/bin" "$DISTDIR" cp -r "$SPARK_HOME/python" "$DISTDIR" + +# Remove the python distribution from dist/ if we built it +if [ "$MAKE_PIP" == "true" ]; then + rm -f $DISTDIR/python/dist/pyspark-*.tar.gz +fi + cp -r "$SPARK_HOME/sbin" "$DISTDIR" # Copy SparkR if it exists if [ -d "$SPARK_HOME"/R/lib/SparkR ]; then - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles is enabled (branch 2.1)
Repository: spark Updated Branches: refs/heads/branch-2.1 fcd22e538 -> 1cafc76ea [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles is enabled (branch 2.1) ## What changes were proposed in this pull request? Backport #16203 to branch 2.1. ## How was this patch tested? Jennkins Author: Shixiong ZhuCloses #16216 from zsxwing/SPARK-18774-2.1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1cafc76e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1cafc76e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1cafc76e Branch: refs/heads/branch-2.1 Commit: 1cafc76ea1e9eef40b24060d1cd7c4aaf9f16a49 Parents: fcd22e5 Author: Shixiong Zhu Authored: Thu Dec 8 17:58:44 2016 -0800 Committer: Reynold Xin Committed: Thu Dec 8 17:58:44 2016 -0800 -- .../apache/spark/internal/config/package.scala | 3 +- .../scala/org/apache/spark/rdd/HadoopRDD.scala | 30 +++- .../org/apache/spark/rdd/NewHadoopRDD.scala | 50 .../sql/execution/datasources/FileScanRDD.scala | 3 ++ .../org/apache/spark/sql/internal/SQLConf.scala | 3 +- 5 files changed, 57 insertions(+), 32 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1cafc76e/core/src/main/scala/org/apache/spark/internal/config/package.scala -- diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index 4a3e3d5..8ce9883 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -203,7 +203,8 @@ package object config { private[spark] val IGNORE_CORRUPT_FILES = ConfigBuilder("spark.files.ignoreCorruptFiles") .doc("Whether to ignore corrupt files. If true, the Spark jobs will continue to run when " + - "encountering corrupt files and contents that have been read will still be returned.") + "encountering corrupted or non-existing files and contents that have been read will still " + + "be returned.") .booleanConf .createWithDefault(false) http://git-wip-us.apache.org/repos/asf/spark/blob/1cafc76e/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala index 3133a28..b56ebf4 100644 --- a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala @@ -210,12 +210,12 @@ class HadoopRDD[K, V]( override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = { val iter = new NextIterator[(K, V)] { - val split = theSplit.asInstanceOf[HadoopPartition] + private val split = theSplit.asInstanceOf[HadoopPartition] logInfo("Input split: " + split.inputSplit) - val jobConf = getJobConf() + private val jobConf = getJobConf() - val inputMetrics = context.taskMetrics().inputMetrics - val existingBytesRead = inputMetrics.bytesRead + private val inputMetrics = context.taskMetrics().inputMetrics + private val existingBytesRead = inputMetrics.bytesRead // Sets the thread local variable for the file's name split.inputSplit.value match { @@ -225,7 +225,7 @@ class HadoopRDD[K, V]( // Find a function that will return the FileSystem bytes read by this thread. Do this before // creating RecordReader, because RecordReader's constructor might read some bytes - val getBytesReadCallback: Option[() => Long] = split.inputSplit.value match { + private val getBytesReadCallback: Option[() => Long] = split.inputSplit.value match { case _: FileSplit | _: CombineFileSplit => SparkHadoopUtil.get.getFSBytesReadOnThreadCallback() case _ => None @@ -235,23 +235,31 @@ class HadoopRDD[K, V]( // If we do a coalesce, however, we are likely to compute multiple partitions in the same // task and in the same thread, in which case we need to avoid override values written by // previous partitions (SPARK-13071). - def updateBytesRead(): Unit = { + private def updateBytesRead(): Unit = { getBytesReadCallback.foreach { getBytesRead => inputMetrics.setBytesRead(existingBytesRead + getBytesRead()) } } - var reader: RecordReader[K, V] = null - val inputFormat = getInputFormat(jobConf) + private var reader: RecordReader[K, V] = null + private val
spark git commit: [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json
Repository: spark Updated Branches: refs/heads/master 202fcd21c -> 458fa3325 [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json ## What changes were proposed in this pull request? - Changed FileStreamSource to use new FileStreamSourceOffset rather than LongOffset. The field is named as `logOffset` to make it more clear that this is a offset in the file stream log. - Fixed bug in FileStreamSourceLog, the field endId in the FileStreamSourceLog.get(startId, endId) was not being used at all. No test caught it earlier. Only my updated tests caught it. Other minor changes - Dont use batchId in the FileStreamSource, as calling it batch id is extremely miss leading. With multiple sources, it may happen that a new batch has no new data from a file source. So offset of FileStreamSource != batchId after that batch. ## How was this patch tested? Updated unit test. Author: Tathagata DasCloses #16205 from tdas/SPARK-18776. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/458fa332 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/458fa332 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/458fa332 Branch: refs/heads/master Commit: 458fa3325e5f8c21c50e406ac8059d6236f93a9c Parents: 202fcd2 Author: Tathagata Das Authored: Thu Dec 8 17:53:34 2016 -0800 Committer: Tathagata Das Committed: Thu Dec 8 17:53:34 2016 -0800 -- .../sql/kafka010/KafkaSourceOffsetSuite.scala | 2 +- .../execution/streaming/FileStreamSource.scala | 32 ++-- .../streaming/FileStreamSourceLog.scala | 2 +- .../streaming/FileStreamSourceOffset.scala | 53 .../file-source-offset-version-2.1.0-json.txt | 1 + .../file-source-offset-version-2.1.0-long.txt | 1 + .../file-source-offset-version-2.1.0.txt| 1 - .../offset-log-version-2.1.0/0 | 4 +- .../streaming/FileStreamSourceSuite.scala | 2 +- .../execution/streaming/OffsetSeqLogSuite.scala | 2 +- .../sql/streaming/FileStreamSourceSuite.scala | 30 +++ 11 files changed, 96 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/458fa332/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala -- diff --git a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala index 22668fd..10b35c7 100644 --- a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala +++ b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala @@ -90,7 +90,7 @@ class KafkaSourceOffsetSuite extends OffsetSuite with SharedSQLContext { } } - test("read Spark 2.1.0 log format") { + test("read Spark 2.1.0 offset format") { val offset = readFromResource("kafka-source-offset-version-2.1.0.txt") assert(KafkaSourceOffset(offset) === KafkaSourceOffset(("topic1", 0, 456L), ("topic1", 1, 789L), ("topic2", 0, 0L))) http://git-wip-us.apache.org/repos/asf/spark/blob/458fa332/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala index 8494aef..20e0dce 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala @@ -57,7 +57,7 @@ class FileStreamSource( private val metadataLog = new FileStreamSourceLog(FileStreamSourceLog.VERSION, sparkSession, metadataPath) - private var maxBatchId = metadataLog.getLatest().map(_._1).getOrElse(-1L) + private var metadataLogCurrentOffset = metadataLog.getLatest().map(_._1).getOrElse(-1L) /** Maximum number of new files to be considered in each batch */ private val maxFilesPerBatch = sourceOptions.maxFilesPerTrigger @@ -79,7 +79,7 @@ class FileStreamSource( * `synchronized` on this method is for solving race conditions in tests. In the normal usage, * there is no race here, so the cost of `synchronized` should be rare. */ - private def fetchMaxOffset(): LongOffset = synchronized { + private def fetchMaxOffset(): FileStreamSourceOffset = synchronized { // All the new files found - ignore aged files and files
spark git commit: [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json
Repository: spark Updated Branches: refs/heads/branch-2.1 e43209fe2 -> fcd22e538 [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json ## What changes were proposed in this pull request? - Changed FileStreamSource to use new FileStreamSourceOffset rather than LongOffset. The field is named as `logOffset` to make it more clear that this is a offset in the file stream log. - Fixed bug in FileStreamSourceLog, the field endId in the FileStreamSourceLog.get(startId, endId) was not being used at all. No test caught it earlier. Only my updated tests caught it. Other minor changes - Dont use batchId in the FileStreamSource, as calling it batch id is extremely miss leading. With multiple sources, it may happen that a new batch has no new data from a file source. So offset of FileStreamSource != batchId after that batch. ## How was this patch tested? Updated unit test. Author: Tathagata DasCloses #16205 from tdas/SPARK-18776. (cherry picked from commit 458fa3325e5f8c21c50e406ac8059d6236f93a9c) Signed-off-by: Tathagata Das Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fcd22e53 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fcd22e53 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fcd22e53 Branch: refs/heads/branch-2.1 Commit: fcd22e5389a7dffda32be0e143d772f611a0f3d9 Parents: e43209f Author: Tathagata Das Authored: Thu Dec 8 17:53:34 2016 -0800 Committer: Tathagata Das Committed: Thu Dec 8 17:53:45 2016 -0800 -- .../sql/kafka010/KafkaSourceOffsetSuite.scala | 2 +- .../execution/streaming/FileStreamSource.scala | 32 ++-- .../streaming/FileStreamSourceLog.scala | 2 +- .../streaming/FileStreamSourceOffset.scala | 53 .../file-source-offset-version-2.1.0-json.txt | 1 + .../file-source-offset-version-2.1.0-long.txt | 1 + .../file-source-offset-version-2.1.0.txt| 1 - .../offset-log-version-2.1.0/0 | 4 +- .../streaming/FileStreamSourceSuite.scala | 2 +- .../execution/streaming/OffsetSeqLogSuite.scala | 2 +- .../sql/streaming/FileStreamSourceSuite.scala | 30 +++ 11 files changed, 96 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fcd22e53/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala -- diff --git a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala index 22668fd..10b35c7 100644 --- a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala +++ b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala @@ -90,7 +90,7 @@ class KafkaSourceOffsetSuite extends OffsetSuite with SharedSQLContext { } } - test("read Spark 2.1.0 log format") { + test("read Spark 2.1.0 offset format") { val offset = readFromResource("kafka-source-offset-version-2.1.0.txt") assert(KafkaSourceOffset(offset) === KafkaSourceOffset(("topic1", 0, 456L), ("topic1", 1, 789L), ("topic2", 0, 0L))) http://git-wip-us.apache.org/repos/asf/spark/blob/fcd22e53/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala index 8494aef..20e0dce 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala @@ -57,7 +57,7 @@ class FileStreamSource( private val metadataLog = new FileStreamSourceLog(FileStreamSourceLog.VERSION, sparkSession, metadataPath) - private var maxBatchId = metadataLog.getLatest().map(_._1).getOrElse(-1L) + private var metadataLogCurrentOffset = metadataLog.getLatest().map(_._1).getOrElse(-1L) /** Maximum number of new files to be considered in each batch */ private val maxFilesPerBatch = sourceOptions.maxFilesPerTrigger @@ -79,7 +79,7 @@ class FileStreamSource( * `synchronized` on this method is for solving race conditions in tests. In the normal usage, * there is no race here, so the cost of `synchronized` should be rare. */ - private def fetchMaxOffset(): LongOffset = synchronized
spark git commit: [SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6
Repository: spark Updated Branches: refs/heads/master 3261e25da -> 202fcd21c [SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6 This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in https://github.com/apache/spark/pull/16014#issuecomment-265843991 Author: Shivaram VenkataramanCloses #16218 from shivaram/fix-sparkr-release-build. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/202fcd21 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/202fcd21 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/202fcd21 Branch: refs/heads/master Commit: 202fcd21ce01393fa6dfaa1c2126e18e9b85ee96 Parents: 3261e25 Author: Shivaram Venkataraman Authored: Thu Dec 8 13:01:46 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 13:01:46 2016 -0800 -- dev/create-release/release-build.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/202fcd21/dev/create-release/release-build.sh -- diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 8863ee6..1b05b20 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -238,10 +238,10 @@ if [[ "$1" == "package" ]]; then FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos" make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" & make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" & - make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" & + make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" "withr" & make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" "withpip" & make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn -Pmesos" "3037" & - make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" "withr" & + make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" & wait rm -rf spark-$SPARK_VERSION-bin-*/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6
Repository: spark Updated Branches: refs/heads/branch-2.1 9483242f4 -> e43209fe2 [SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6 This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in https://github.com/apache/spark/pull/16014#issuecomment-265843991 Author: Shivaram VenkataramanCloses #16218 from shivaram/fix-sparkr-release-build. (cherry picked from commit 202fcd21ce01393fa6dfaa1c2126e18e9b85ee96) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e43209fe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e43209fe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e43209fe Branch: refs/heads/branch-2.1 Commit: e43209fe2a69fb239dff8bc1a18297d3696f0dcd Parents: 9483242 Author: Shivaram Venkataraman Authored: Thu Dec 8 13:01:46 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 13:01:54 2016 -0800 -- dev/create-release/release-build.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e43209fe/dev/create-release/release-build.sh -- diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 8863ee6..1b05b20 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -238,10 +238,10 @@ if [[ "$1" == "package" ]]; then FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos" make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" & make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" & - make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" & + make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" "withr" & make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" "withpip" & make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn -Pmesos" "3037" & - make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" "withr" & + make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" & wait rm -rf spark-$SPARK_VERSION-bin-*/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18760][SQL] Consistent format specification for FileFormats
Repository: spark Updated Branches: refs/heads/branch-2.1 a03564418 -> 9483242f4 [SPARK-18760][SQL] Consistent format specification for FileFormats ## What changes were proposed in this pull request? This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest): Before: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ``` After: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ``` Also closes #14680. ## How was this patch tested? Verified in spark-shell. Author: Reynold XinCloses #16187 from rxin/SPARK-18760. (cherry picked from commit 5f894d23a54ea99f75f8b722e111e5270f7f80cf) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9483242f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9483242f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9483242f Branch: refs/heads/branch-2.1 Commit: 9483242f4c6cc13001e5a967810718b26beb2361 Parents: a035644 Author: Reynold Xin Authored: Thu Dec 8 12:52:05 2016 -0800 Committer: Reynold Xin Committed: Thu Dec 8 12:52:21 2016 -0800 -- .../sql/execution/datasources/parquet/ParquetFileFormat.scala | 2 +- .../spark/sql/execution/datasources/text/TextFileFormat.scala | 2 ++ .../apache/spark/sql/streaming/FileStreamSourceSuite.scala| 7 --- 3 files changed, 7 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9483242f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala index 031a0fe..0965ffe 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala @@ -61,7 +61,7 @@ class ParquetFileFormat override def shortName(): String = "parquet" - override def toString: String = "ParquetFormat" + override def toString: String = "Parquet" override def hashCode(): Int = getClass.hashCode() http://git-wip-us.apache.org/repos/asf/spark/blob/9483242f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala index 8e04396..3e89082 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala @@ -43,6 +43,8 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { override def shortName(): String = "text" + override def toString: String = "Text" + private def verifySchema(schema: StructType): Unit = { if (schema.size != 1) { throw new AnalysisException( http://git-wip-us.apache.org/repos/asf/spark/blob/9483242f/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala index 7b6fe83..267c462 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala @@ -31,7 +31,8 @@ import org.apache.spark.sql.test.SharedSQLContext import org.apache.spark.sql.types._ import org.apache.spark.util.Utils -class FileStreamSourceTest extends StreamTest with SharedSQLContext with PrivateMethodTester { +abstract class FileStreamSourceTest + extends StreamTest with SharedSQLContext with
spark git commit: [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
Repository: spark Updated Branches: refs/heads/master c3d3a9d0e -> 26432df9c [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext ## What changes were proposed in this pull request? When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit. - ContextCleaner.keepCleaning - LiveListenerBus.listenerThread.run - TaskSchedulerImpl.start This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now. ## How was this patch tested? Jenkins Author: Shixiong ZhuCloses #16178 from zsxwing/fix-stop-deadlock. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/26432df9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/26432df9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/26432df9 Branch: refs/heads/master Commit: 26432df9cc6ffe569583aa628c6ecd7050b38316 Parents: c3d3a9d Author: Shixiong Zhu Authored: Thu Dec 8 11:54:04 2016 -0800 Committer: Shixiong Zhu Committed: Thu Dec 8 11:54:04 2016 -0800 -- .../scala/org/apache/spark/SparkContext.scala | 35 +++- .../scala/org/apache/spark/rpc/RpcEnv.scala | 5 --- .../org/apache/spark/rpc/netty/Dispatcher.scala | 1 - .../apache/spark/rpc/netty/NettyRpcEnv.scala| 5 --- .../apache/spark/scheduler/DAGScheduler.scala | 2 +- .../cluster/StandaloneSchedulerBackend.scala| 2 +- .../scala/org/apache/spark/util/Utils.scala | 2 +- .../org/apache/spark/rpc/RpcEnvSuite.scala | 13 8 files changed, 23 insertions(+), 42 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/26432df9/core/src/main/scala/org/apache/spark/SparkContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index be4dae1..b42820a 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -1760,25 +1760,30 @@ class SparkContext(config: SparkConf) extends Logging { def listJars(): Seq[String] = addedJars.keySet.toSeq /** - * Shut down the SparkContext. + * When stopping SparkContext inside Spark components, it's easy to cause dead-lock since Spark + * may wait for some internal threads to finish. It's better to use this method to stop + * SparkContext instead. */ - def stop(): Unit = { -if (env.rpcEnv.isInRPCThread) { - // `stop` will block until all RPC threads exit, so we cannot call stop inside a RPC thread. - // We should launch a new thread to call `stop` to avoid dead-lock. - new Thread("stop-spark-context") { -setDaemon(true) - -override def run(): Unit = { - _stop() + private[spark] def stopInNewThread(): Unit = { +new Thread("stop-spark-context") { + setDaemon(true) + + override def run(): Unit = { +try { + SparkContext.this.stop() +} catch { + case e: Throwable => +logError(e.getMessage, e) +throw e } - }.start() -} else { - _stop() -} + } +}.start() } - private def _stop() { + /** + * Shut down the SparkContext. + */ + def stop(): Unit = { if (LiveListenerBus.withinListenerThread.value) { throw new SparkException( s"Cannot stop SparkContext within listener thread of ${LiveListenerBus.name}") http://git-wip-us.apache.org/repos/asf/spark/blob/26432df9/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala -- diff --git a/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala b/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala index bbc4163..530743c 100644 --- a/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala +++ b/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala @@ -146,11 +146,6 @@ private[spark] abstract class RpcEnv(conf: SparkConf) { * @param uri URI with location of the file. */ def openChannel(uri: String): ReadableByteChannel - - /** - * Return if the current thread is a RPC thread. - */ - def isInRPCThread: Boolean } /** http://git-wip-us.apache.org/repos/asf/spark/blob/26432df9/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala -- diff --git a/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala
spark git commit: [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
Repository: spark Updated Branches: refs/heads/branch-2.1 d69df9073 -> a03564418 [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext ## What changes were proposed in this pull request? When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit. - ContextCleaner.keepCleaning - LiveListenerBus.listenerThread.run - TaskSchedulerImpl.start This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now. ## How was this patch tested? Jenkins Author: Shixiong ZhuCloses #16178 from zsxwing/fix-stop-deadlock. (cherry picked from commit 26432df9cc6ffe569583aa628c6ecd7050b38316) Signed-off-by: Shixiong Zhu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a0356441 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a0356441 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a0356441 Branch: refs/heads/branch-2.1 Commit: a035644182646a2160ac16ecd6c7f4d98be2caad Parents: d69df90 Author: Shixiong Zhu Authored: Thu Dec 8 11:54:04 2016 -0800 Committer: Shixiong Zhu Committed: Thu Dec 8 11:54:10 2016 -0800 -- .../scala/org/apache/spark/SparkContext.scala | 35 +++- .../scala/org/apache/spark/rpc/RpcEnv.scala | 5 --- .../org/apache/spark/rpc/netty/Dispatcher.scala | 1 - .../apache/spark/rpc/netty/NettyRpcEnv.scala| 5 --- .../apache/spark/scheduler/DAGScheduler.scala | 2 +- .../cluster/StandaloneSchedulerBackend.scala| 2 +- .../scala/org/apache/spark/util/Utils.scala | 2 +- .../org/apache/spark/rpc/RpcEnvSuite.scala | 13 8 files changed, 23 insertions(+), 42 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a0356441/core/src/main/scala/org/apache/spark/SparkContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index b8414b5..8f8392f 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -1757,25 +1757,30 @@ class SparkContext(config: SparkConf) extends Logging { def listJars(): Seq[String] = addedJars.keySet.toSeq /** - * Shut down the SparkContext. + * When stopping SparkContext inside Spark components, it's easy to cause dead-lock since Spark + * may wait for some internal threads to finish. It's better to use this method to stop + * SparkContext instead. */ - def stop(): Unit = { -if (env.rpcEnv.isInRPCThread) { - // `stop` will block until all RPC threads exit, so we cannot call stop inside a RPC thread. - // We should launch a new thread to call `stop` to avoid dead-lock. - new Thread("stop-spark-context") { -setDaemon(true) - -override def run(): Unit = { - _stop() + private[spark] def stopInNewThread(): Unit = { +new Thread("stop-spark-context") { + setDaemon(true) + + override def run(): Unit = { +try { + SparkContext.this.stop() +} catch { + case e: Throwable => +logError(e.getMessage, e) +throw e } - }.start() -} else { - _stop() -} + } +}.start() } - private def _stop() { + /** + * Shut down the SparkContext. + */ + def stop(): Unit = { if (LiveListenerBus.withinListenerThread.value) { throw new SparkException( s"Cannot stop SparkContext within listener thread of ${LiveListenerBus.name}") http://git-wip-us.apache.org/repos/asf/spark/blob/a0356441/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala -- diff --git a/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala b/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala index bbc4163..530743c 100644 --- a/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala +++ b/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala @@ -146,11 +146,6 @@ private[spark] abstract class RpcEnv(conf: SparkConf) { * @param uri URI with location of the file. */ def openChannel(uri: String): ReadableByteChannel - - /** - * Return if the current thread is a RPC thread. - */ - def isInRPCThread: Boolean } /** http://git-wip-us.apache.org/repos/asf/spark/blob/a0356441/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala
spark git commit: [SPARK-18590][SPARKR] build R source package when making distribution
Repository: spark Updated Branches: refs/heads/branch-2.1 e0173f14e -> d69df9073 [SPARK-18590][SPARKR] build R source package when making distribution This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. Manually, CI. Author: Felix CheungCloses #16014 from felixcheung/rdist. (cherry picked from commit c3d3a9d0e85b834abef87069e4edd27db87fc607) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d69df907 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d69df907 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d69df907 Branch: refs/heads/branch-2.1 Commit: d69df9073274f7ab3a3598bb182a3233fd7775cd Parents: e0173f1 Author: Felix Cheung Authored: Thu Dec 8 11:29:31 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 11:31:24 2016 -0800 -- R/CRAN_RELEASE.md | 2 +- R/check-cran.sh | 19 ++- R/install-dev.sh| 2 +- R/pkg/.Rbuildignore | 3 +++ R/pkg/DESCRIPTION | 13 ++--- R/pkg/NAMESPACE | 2 +- dev/create-release/release-build.sh | 27 +++ dev/make-distribution.sh| 25 + 8 files changed, 74 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d69df907/R/CRAN_RELEASE.md -- diff --git a/R/CRAN_RELEASE.md b/R/CRAN_RELEASE.md index bea8f9f..d6084c7 100644 --- a/R/CRAN_RELEASE.md +++ b/R/CRAN_RELEASE.md @@ -7,7 +7,7 @@ To release SparkR as a package to CRAN, we would use the `devtools` package. Ple First, check that the `Version:` field in the `pkg/DESCRIPTION` file is updated. Also, check for stale files not under source control. -Note that while `check-cran.sh` is running `R CMD check`, it is doing so with `--no-manual --no-vignettes`, which skips a few vignettes or PDF checks - therefore it will be preferred to run `R CMD check` on the source package built manually before uploading a release. +Note that while `run-tests.sh` runs `check-cran.sh` (which runs `R CMD check`), it is doing so with `--no-manual --no-vignettes`, which skips a few vignettes or PDF checks - therefore it will be preferred to run `R CMD check` on the source package built manually before uploading a release. Also note that for CRAN checks for pdf vignettes to success, `qpdf` tool must be there (to install it, eg. `yum -q -y install qpdf`). To upload a release, we would need to update the `cran-comments.md`. This should generally contain the results from running the `check-cran.sh` script along with comments on status of all `WARNING` (should not be
spark git commit: [SPARK-18590][SPARKR] build R source package when making distribution
Repository: spark Updated Branches: refs/heads/master 3c68944b2 -> c3d3a9d0e [SPARK-18590][SPARKR] build R source package when making distribution ## What changes were proposed in this pull request? This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. ### more details These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. ## How was this patch tested? Manually, CI. Author: Felix CheungCloses #16014 from felixcheung/rdist. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c3d3a9d0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c3d3a9d0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c3d3a9d0 Branch: refs/heads/master Commit: c3d3a9d0e85b834abef87069e4edd27db87fc607 Parents: 3c68944 Author: Felix Cheung Authored: Thu Dec 8 11:29:31 2016 -0800 Committer: Shivaram Venkataraman Committed: Thu Dec 8 11:29:31 2016 -0800 -- R/CRAN_RELEASE.md | 2 +- R/check-cran.sh | 19 ++- R/install-dev.sh| 2 +- R/pkg/.Rbuildignore | 3 +++ R/pkg/DESCRIPTION | 13 ++--- R/pkg/NAMESPACE | 2 +- dev/create-release/release-build.sh | 27 +++ dev/make-distribution.sh| 25 + 8 files changed, 74 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c3d3a9d0/R/CRAN_RELEASE.md -- diff --git a/R/CRAN_RELEASE.md b/R/CRAN_RELEASE.md index bea8f9f..d6084c7 100644 --- a/R/CRAN_RELEASE.md +++ b/R/CRAN_RELEASE.md @@ -7,7 +7,7 @@ To release SparkR as a package to CRAN, we would use the `devtools` package. Ple First, check that the `Version:` field in the `pkg/DESCRIPTION` file is updated. Also, check for stale files not under source control. -Note that while `check-cran.sh` is running `R CMD check`, it is doing so with `--no-manual --no-vignettes`, which skips a few vignettes or PDF checks - therefore it will be preferred to run `R CMD check` on the source package built manually before uploading a release. +Note that while `run-tests.sh` runs `check-cran.sh` (which runs `R CMD check`), it is doing so with `--no-manual --no-vignettes`, which skips a few vignettes or PDF checks - therefore it will be preferred to run `R CMD check` on the source package built manually before uploading a release. Also note that for CRAN checks for pdf vignettes to success, `qpdf` tool must be there (to install it, eg. `yum -q -y install qpdf`). To upload a release, we would need to update the `cran-comments.md`. This should generally contain the results from running the `check-cran.sh` script along with comments on status of all `WARNING` (should not be any) or `NOTE`. As a part of
spark git commit: [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records
Repository: spark Updated Branches: refs/heads/branch-2.1 726217eb7 -> e0173f14e [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip. Author: Andrew RayCloses #16121 from aray/fix-cartesian. (cherry picked from commit 3c68944b229aaaeeaee3efcbae3e3be9a2914855) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e0173f14 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e0173f14 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e0173f14 Branch: refs/heads/branch-2.1 Commit: e0173f14e3ea28d83c1c46bf97f7d3755960a8fc Parents: 726217e Author: Andrew Ray Authored: Thu Dec 8 11:08:12 2016 -0800 Committer: Davies Liu Committed: Thu Dec 8 11:08:27 2016 -0800 -- python/pyspark/serializers.py | 58 +++--- python/pyspark/tests.py | 18 2 files changed, 53 insertions(+), 23 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e0173f14/python/pyspark/serializers.py -- diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py index 2a13269..c4f2f08 100644 --- a/python/pyspark/serializers.py +++ b/python/pyspark/serializers.py @@ -61,7 +61,7 @@ import itertools if sys.version < '3': import cPickle as pickle protocol = 2 -from itertools import izip as zip +from itertools import izip as zip, imap as map else: import pickle protocol = 3 @@ -96,7 +96,12 @@ class Serializer(object): raise NotImplementedError def _load_stream_without_unbatching(self, stream): -return self.load_stream(stream) +""" +Return an iterator of deserialized batches (lists) of objects from the input stream. +if the serializer does not operate on batches the default implementation returns an +iterator of single element lists. +""" +return map(lambda x: [x], self.load_stream(stream)) # Note: our notion of "equality" is that output generated by # equal serializers can be deserialized using the same serializer. @@ -278,50 +283,57 @@ class AutoBatchedSerializer(BatchedSerializer): return "AutoBatchedSerializer(%s)" % self.serializer -class CartesianDeserializer(FramedSerializer): +class CartesianDeserializer(Serializer): """ Deserializes the JavaRDD cartesian() of two PythonRDDs. +Due to pyspark batching we cannot simply use the result of the Java RDD cartesian, +we additionally need to do the cartesian within each pair of batches. """ def __init__(self, key_ser, val_ser): -FramedSerializer.__init__(self) self.key_ser = key_ser self.val_ser = val_ser -def prepare_keys_values(self, stream): -key_stream = self.key_ser._load_stream_without_unbatching(stream) -val_stream = self.val_ser._load_stream_without_unbatching(stream) -key_is_batched = isinstance(self.key_ser, BatchedSerializer) -val_is_batched = isinstance(self.val_ser, BatchedSerializer) -for (keys, vals) in zip(key_stream, val_stream): -keys = keys if key_is_batched else [keys] -vals = vals if val_is_batched else [vals] -yield (keys, vals) +def _load_stream_without_unbatching(self, stream): +key_batch_stream = self.key_ser._load_stream_without_unbatching(stream) +val_batch_stream = self.val_ser._load_stream_without_unbatching(stream) +for (key_batch, val_batch) in
spark git commit: [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records
Repository: spark Updated Branches: refs/heads/master ed8869ebb -> 3c68944b2 [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip. Author: Andrew RayCloses #16121 from aray/fix-cartesian. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3c68944b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3c68944b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3c68944b Branch: refs/heads/master Commit: 3c68944b229aaaeeaee3efcbae3e3be9a2914855 Parents: ed8869e Author: Andrew Ray Authored: Thu Dec 8 11:08:12 2016 -0800 Committer: Davies Liu Committed: Thu Dec 8 11:08:12 2016 -0800 -- python/pyspark/serializers.py | 58 +++--- python/pyspark/tests.py | 18 2 files changed, 53 insertions(+), 23 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3c68944b/python/pyspark/serializers.py -- diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py index 2a13269..c4f2f08 100644 --- a/python/pyspark/serializers.py +++ b/python/pyspark/serializers.py @@ -61,7 +61,7 @@ import itertools if sys.version < '3': import cPickle as pickle protocol = 2 -from itertools import izip as zip +from itertools import izip as zip, imap as map else: import pickle protocol = 3 @@ -96,7 +96,12 @@ class Serializer(object): raise NotImplementedError def _load_stream_without_unbatching(self, stream): -return self.load_stream(stream) +""" +Return an iterator of deserialized batches (lists) of objects from the input stream. +if the serializer does not operate on batches the default implementation returns an +iterator of single element lists. +""" +return map(lambda x: [x], self.load_stream(stream)) # Note: our notion of "equality" is that output generated by # equal serializers can be deserialized using the same serializer. @@ -278,50 +283,57 @@ class AutoBatchedSerializer(BatchedSerializer): return "AutoBatchedSerializer(%s)" % self.serializer -class CartesianDeserializer(FramedSerializer): +class CartesianDeserializer(Serializer): """ Deserializes the JavaRDD cartesian() of two PythonRDDs. +Due to pyspark batching we cannot simply use the result of the Java RDD cartesian, +we additionally need to do the cartesian within each pair of batches. """ def __init__(self, key_ser, val_ser): -FramedSerializer.__init__(self) self.key_ser = key_ser self.val_ser = val_ser -def prepare_keys_values(self, stream): -key_stream = self.key_ser._load_stream_without_unbatching(stream) -val_stream = self.val_ser._load_stream_without_unbatching(stream) -key_is_batched = isinstance(self.key_ser, BatchedSerializer) -val_is_batched = isinstance(self.val_ser, BatchedSerializer) -for (keys, vals) in zip(key_stream, val_stream): -keys = keys if key_is_batched else [keys] -vals = vals if val_is_batched else [vals] -yield (keys, vals) +def _load_stream_without_unbatching(self, stream): +key_batch_stream = self.key_ser._load_stream_without_unbatching(stream) +val_batch_stream = self.val_ser._load_stream_without_unbatching(stream) +for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream): +# for correctness with repeated cartesian/zip this must be returned as
spark git commit: [SPARK-8617][WEBUI] HistoryServer: Include in-progress files during cleanup
Repository: spark Updated Branches: refs/heads/master b44d1b8fc -> ed8869ebb [SPARK-8617][WEBUI] HistoryServer: Include in-progress files during cleanup ## What changes were proposed in this pull request? - Removed the`attempt.completed ` filter so cleaner would include the orphan inprogress files. - Use loading time for inprogress files as lastUpdated. Keep using the modTime for completed files. First one will prevent deletion of inprogress job files. Second one will ensure that lastUpdated time won't change for completed jobs in an event of HistoryServer reboot. ## How was this patch tested? Added new unittests and via existing tests. Author: Ergin SeyfeCloses #16165 from seyfe/clear_old_inprogress_files. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ed8869eb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ed8869eb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ed8869eb Branch: refs/heads/master Commit: ed8869ebbf39783b16daba2e2498a2bc1889306f Parents: b44d1b8 Author: Ergin Seyfe Authored: Thu Dec 8 10:21:09 2016 -0800 Committer: Marcelo Vanzin Committed: Thu Dec 8 10:21:09 2016 -0800 -- .../deploy/history/FsHistoryProvider.scala | 10 ++-- .../deploy/history/FsHistoryProviderSuite.scala | 50 +++- 2 files changed, 55 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ed8869eb/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala index 8ef69b1..3011ed0 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala @@ -446,9 +446,13 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) } val logPath = fileStatus.getPath() - val appCompleted = isApplicationCompleted(fileStatus) + // Use loading time as lastUpdated since some filesystems don't update modifiedTime + // each time file is updated. However use modifiedTime for completed jobs so lastUpdated + // won't change whenever HistoryServer restarts and reloads the file. + val lastUpdated = if (appCompleted) fileStatus.getModificationTime else clock.getTimeMillis() + val appListener = replay(fileStatus, appCompleted, new ReplayListenerBus(), eventsFilter) // Without an app ID, new logs will render incorrectly in the listing page, so do not list or @@ -461,7 +465,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) appListener.appAttemptId, appListener.startTime.getOrElse(-1L), appListener.endTime.getOrElse(-1L), - fileStatus.getModificationTime(), + lastUpdated, appListener.sparkUser.getOrElse(NOT_STARTED), appCompleted, fileStatus.getLen() @@ -546,7 +550,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) val appsToRetain = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]() def shouldClean(attempt: FsApplicationAttemptInfo): Boolean = { -now - attempt.lastUpdated > maxAge && attempt.completed +now - attempt.lastUpdated > maxAge } // Scan all logs from the log directory. http://git-wip-us.apache.org/repos/asf/spark/blob/ed8869eb/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala b/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala index 2c41c43..027f412 100644 --- a/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala @@ -66,7 +66,8 @@ class FsHistoryProviderSuite extends SparkFunSuite with BeforeAndAfter with Matc } test("Parse application logs") { -val provider = new FsHistoryProvider(createTestConf()) +val clock = new ManualClock(12345678) +val provider = new FsHistoryProvider(createTestConf(), clock) // Write a new-style application log. val newAppComplete = newLogFile("new1", None, inProgress = false) @@ -109,12 +110,15 @@ class FsHistoryProviderSuite extends SparkFunSuite with BeforeAndAfter with Matc List(ApplicationAttemptInfo(None, start, end, lastMod, user, completed)))
spark git commit: [SPARK-18662][HOTFIX] Add new resource-managers directories to SparkLauncher.
Repository: spark Updated Branches: refs/heads/master 6a5a7254d -> b44d1b8fc [SPARK-18662][HOTFIX] Add new resource-managers directories to SparkLauncher. These directories are added to the classpath of applications when testing or using SPARK_PREPEND_CLASSES, otherwise updated classes are not seen. Also, add the mesos directory which was missing. Author: Marcelo VanzinCloses #16202 from vanzin/SPARK-18662. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b44d1b8f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b44d1b8f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b44d1b8f Branch: refs/heads/master Commit: b44d1b8fcf00b238df434cf70ad09460b27adf07 Parents: 6a5a725 Author: Marcelo Vanzin Authored: Thu Dec 8 09:48:33 2016 -0800 Committer: Marcelo Vanzin Committed: Thu Dec 8 09:48:33 2016 -0800 -- .../java/org/apache/spark/launcher/AbstractCommandBuilder.java | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b44d1b8f/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java -- diff --git a/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java b/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java index c748808..ba43659 100644 --- a/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java +++ b/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java @@ -158,12 +158,13 @@ abstract class AbstractCommandBuilder { "launcher", "mllib", "repl", +"resource-managers/mesos", +"resource-managers/yarn", "sql/catalyst", "sql/core", "sql/hive", "sql/hive-thriftserver", -"streaming", -"yarn" +"streaming" ); if (prependClasses) { if (!isTesting) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark
Repository: spark Updated Branches: refs/heads/branch-2.1 9095c152e -> 726217eb7 [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark ## What changes were proposed in this pull request? `input_file_name` doesn't return filename when working with UDF in PySpark. An example shows the problem: from pyspark.sql.functions import * from pyspark.sql.types import * def filename(path): return path sourceFile = udf(filename, StringType()) spark.read.json("tmp.json").select(sourceFile(input_file_name())).show() +---+ |filename(input_file_name())| +---+ | | +---+ The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching processing of PythonUDF. Currently we group rows first and then evaluate expressions on the rows. If the data is less than the required number of rows for a group, the iterator will be consumed to the end before the evaluation. However, once the iterator reaches the end, we will unset input filename. So the input_file_name expression can't return correct filename. This patch fixes the approach to group the batch of rows. We evaluate the expression first and then group evaluated results to batch. ## How was this patch tested? Added unit test to PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi HsiehCloses #16115 from viirya/fix-py-udf-input-filename. (cherry picked from commit 6a5a7254dc37952505989e9e580a14543adb730c) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/726217eb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/726217eb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/726217eb Branch: refs/heads/branch-2.1 Commit: 726217eb7f783e10571a043546694b5b3c90ac77 Parents: 9095c15 Author: Liang-Chi Hsieh Authored: Thu Dec 8 23:22:18 2016 +0800 Committer: Wenchen Fan Committed: Thu Dec 8 23:22:40 2016 +0800 -- python/pyspark/sql/tests.py | 8 + .../execution/python/BatchEvalPythonExec.scala | 35 +--- 2 files changed, 24 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/726217eb/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 50df68b..66320bd 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -412,6 +412,14 @@ class SQLTests(ReusedPySparkTestCase): res.explain(True) self.assertEqual(res.collect(), [Row(id=0, copy=0)]) +def test_udf_with_input_file_name(self): +from pyspark.sql.functions import udf, input_file_name +from pyspark.sql.types import StringType +sourceFile = udf(lambda path: path, StringType()) +filePath = "python/test_support/sql/people1.json" +row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first() +self.assertTrue(row[0].find("people1.json") != -1) + def test_basic_functions(self): rdd = self.sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}']) df = self.spark.read.json(rdd) http://git-wip-us.apache.org/repos/asf/spark/blob/726217eb/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala index dcaf2c7..7a5ac48 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala @@ -119,26 +119,23 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], chi val pickle = new Pickler(needConversion) // Input iterator to Python: input rows are grouped so we send them in batches to Python. // For each row, add it to the queue. - val inputIterator = iter.grouped(100).map { inputRows => -val toBePickled = inputRows.map { inputRow => - queue.add(inputRow.asInstanceOf[UnsafeRow]) - val row = projection(inputRow) - if (needConversion) { -EvaluatePython.toJava(row, schema) - } else { -// fast path for these types that does not need conversion in
spark git commit: [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark
Repository: spark Updated Branches: refs/heads/master 7f3c778fd -> 6a5a7254d [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark ## What changes were proposed in this pull request? `input_file_name` doesn't return filename when working with UDF in PySpark. An example shows the problem: from pyspark.sql.functions import * from pyspark.sql.types import * def filename(path): return path sourceFile = udf(filename, StringType()) spark.read.json("tmp.json").select(sourceFile(input_file_name())).show() +---+ |filename(input_file_name())| +---+ | | +---+ The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching processing of PythonUDF. Currently we group rows first and then evaluate expressions on the rows. If the data is less than the required number of rows for a group, the iterator will be consumed to the end before the evaluation. However, once the iterator reaches the end, we will unset input filename. So the input_file_name expression can't return correct filename. This patch fixes the approach to group the batch of rows. We evaluate the expression first and then group evaluated results to batch. ## How was this patch tested? Added unit test to PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi HsiehCloses #16115 from viirya/fix-py-udf-input-filename. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a5a7254 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a5a7254 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a5a7254 Branch: refs/heads/master Commit: 6a5a7254dc37952505989e9e580a14543adb730c Parents: 7f3c778 Author: Liang-Chi Hsieh Authored: Thu Dec 8 23:22:18 2016 +0800 Committer: Wenchen Fan Committed: Thu Dec 8 23:22:18 2016 +0800 -- python/pyspark/sql/tests.py | 8 + .../execution/python/BatchEvalPythonExec.scala | 35 +--- 2 files changed, 24 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6a5a7254/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 50df68b..66320bd 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -412,6 +412,14 @@ class SQLTests(ReusedPySparkTestCase): res.explain(True) self.assertEqual(res.collect(), [Row(id=0, copy=0)]) +def test_udf_with_input_file_name(self): +from pyspark.sql.functions import udf, input_file_name +from pyspark.sql.types import StringType +sourceFile = udf(lambda path: path, StringType()) +filePath = "python/test_support/sql/people1.json" +row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first() +self.assertTrue(row[0].find("people1.json") != -1) + def test_basic_functions(self): rdd = self.sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}']) df = self.spark.read.json(rdd) http://git-wip-us.apache.org/repos/asf/spark/blob/6a5a7254/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala index dcaf2c7..7a5ac48 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala @@ -119,26 +119,23 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], chi val pickle = new Pickler(needConversion) // Input iterator to Python: input rows are grouped so we send them in batches to Python. // For each row, add it to the queue. - val inputIterator = iter.grouped(100).map { inputRows => -val toBePickled = inputRows.map { inputRow => - queue.add(inputRow.asInstanceOf[UnsafeRow]) - val row = projection(inputRow) - if (needConversion) { -EvaluatePython.toJava(row, schema) - } else { -// fast path for these types that does not need conversion in Python -val fields = new Array[Any](row.numFields) -var i = 0 -while (i < row.numFields) { -
spark git commit: [SPARK-18718][TESTS] Skip some test failures due to path length limitation and fix tests to pass on Windows
Repository: spark Updated Branches: refs/heads/master 9bf8f3cd4 -> 7f3c778fd [SPARK-18718][TESTS] Skip some test failures due to path length limitation and fix tests to pass on Windows ## What changes were proposed in this pull request? There are some tests failed on Windows due to the wrong format of path and the limitation of path length as below: This PR proposes both to fix the failed tests by fixing the path for the tests below: - `InsertSuite` ``` Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.sources.InsertSuite *** ABORTED *** (12 seconds, 547 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectssparkarget mpspark-177945ef-9128-42b4-8c07-de31f78bbbd6; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) ``` - `PathOptionSuite` ``` - path option also exist for write path *** FAILED *** (1 second, 93 milliseconds) "C:[projectsspark arget mp]spark-5ab34a58-df8d-..." did not equal "C:[\projects\spark\target\tmp\]spark-5ab34a58-df8d-..." (PathOptionSuite.scala:93) org.scalatest.exceptions.TestFailedException: at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) ... ``` - `UDFSuite` ``` - SPARK-8005 input_file_name *** FAILED *** (2 seconds, 234 milliseconds) "file:///C:/projects/spark/target/tmp/spark-e4e5720a-2006-48f9-8b11-797bf59794bf/part-1-26fb05e4-603d-471d-ae9d-b9549e0c7765.snappy.parquet" did not contain "C:\projects\spark\target\tmp\spark-e4e5720a-2006-48f9-8b11-797bf59794bf" (UDFSuite.scala:67) org.scalatest.exceptions.TestFailedException: at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) ... ``` and to skip the tests belows which are being failed on Windows due to path length limitation. - `SparkLauncherSuite` ``` Test org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher failed: java.lang.AssertionError: expected:<0> but was:<1>, took 0.062 sec at org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177) ... ``` The stderr from the process is `The filename or extension is too long` which is equivalent to the one below. - `BroadcastJoinSuite` ``` 04:09:40.882 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor java.io.IOException: Cannot run program "C:\Progra~1\Java\jdk1.8.0\bin\java" (in directory "C:\projects\spark\work\app-20161205040542-\51658"): CreateProcess error=206, The filename or extension is too long at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167) at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73) Caused by: java.io.IOException: CreateProcess error=206, The filename or extension is too long at java.lang.ProcessImpl.create(Native Method) at java.lang.ProcessImpl.(ProcessImpl.java:386) at java.lang.ProcessImpl.start(ProcessImpl.java:137) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 2 more 04:09:40.929 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor (appearently infinite same error messages) ... ``` ## How was this patch tested? Manually tested via AppVeyor. **Before** `InsertSuite`: https://ci.appveyor.com/project/spark-test/spark/build/148-InsertSuite-pr `PathOptionSuite`: https://ci.appveyor.com/project/spark-test/spark/build/139-PathOptionSuite-pr `UDFSuite`: https://ci.appveyor.com/project/spark-test/spark/build/143-UDFSuite-pr `SparkLauncherSuite`: https://ci.appveyor.com/project/spark-test/spark/build/141-SparkLauncherSuite-pr `BroadcastJoinSuite`: https://ci.appveyor.com/project/spark-test/spark/build/145-BroadcastJoinSuite-pr **After** `PathOptionSuite`: https://ci.appveyor.com/project/spark-test/spark/build/140-PathOptionSuite-pr `SparkLauncherSuite`: https://ci.appveyor.com/project/spark-test/spark/build/142-SparkLauncherSuite-pr `UDFSuite`: https://ci.appveyor.com/project/spark-test/spark/build/144-UDFSuite-pr `InsertSuite`: https://ci.appveyor.com/project/spark-test/spark/build/147-InsertSuite-pr `BroadcastJoinSuite`: https://ci.appveyor.com/project/spark-test/spark/build/149-BroadcastJoinSuite-pr Author: hyukjinkwonCloses #16147 from HyukjinKwon/fix-tests.
spark git commit: [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide
Repository: spark Updated Branches: refs/heads/branch-2.1 48aa6775d -> 9095c152e [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide ## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo LiangCloses #16148 from yanboliang/spark-18325. (cherry picked from commit 9bf8f3cd4f62f921c32fb50b8abf49576a80874f) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9095c152 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9095c152 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9095c152 Branch: refs/heads/branch-2.1 Commit: 9095c152e7fedf469dcc4887f5b6a1882cd74c28 Parents: 48aa677 Author: Yanbo Liang Authored: Thu Dec 8 06:19:38 2016 -0800 Committer: Yanbo Liang Committed: Thu Dec 8 06:20:28 2016 -0800 -- docs/ml-classification-regression.md | 67 +++- docs/ml-clustering.md| 18 +++- docs/ml-collaborative-filtering.md | 8 ++ docs/sparkr.md | 46 examples/src/main/r/ml.R | 148 -- examples/src/main/r/ml/als.R | 45 examples/src/main/r/ml/gaussianMixture.R | 42 examples/src/main/r/ml/gbt.R | 63 +++ examples/src/main/r/ml/glm.R | 57 ++ examples/src/main/r/ml/isoreg.R | 42 examples/src/main/r/ml/kmeans.R | 44 examples/src/main/r/ml/kstest.R | 39 +++ examples/src/main/r/ml/lda.R | 46 examples/src/main/r/ml/logit.R | 63 +++ examples/src/main/r/ml/ml.R | 65 +++ examples/src/main/r/ml/mlp.R | 48 + examples/src/main/r/ml/naiveBayes.R | 41 +++ examples/src/main/r/ml/randomForest.R| 63 +++ examples/src/main/r/ml/survreg.R | 43 19 files changed, 810 insertions(+), 178 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9095c152/docs/ml-classification-regression.md -- diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index 557a53c..2ffea64 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -75,6 +75,13 @@ More details on parameters can be found in the [Python API documentation](api/py {% include_example python/ml/logistic_regression_with_elastic_net.py %} + + +More details on parameters can be found in the [R API documentation](api/R/spark.logit.html). + +{% include_example binomial r/ml/logit.R %} + + The `spark.ml` implementation of logistic regression also supports @@ -171,6 +178,13 @@ model with elastic net regularization. {% include_example python/ml/multiclass_logistic_regression_with_elastic_net.py %} + + +More details on parameters can be found in the [R API documentation](api/R/spark.logit.html). + +{% include_example multinomial r/ml/logit.R %} + + @@ -242,6 +256,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat {% include_example python/ml/random_forest_classifier_example.py %} + + + +Refer to the [R API docs](api/R/spark.randomForest.html) for more details. + +{% include_example classification r/ml/randomForest.R %} + + ## Gradient-boosted tree classifier @@ -275,6 +297,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat {% include_example python/ml/gradient_boosted_tree_classifier_example.py %} + + + +Refer to the [R API docs](api/R/spark.gbt.html) for more details. + +{% include_example classification r/ml/gbt.R %} + + ## Multilayer perceptron classifier @@ -324,6 +354,13 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat {% include_example python/ml/multilayer_perceptron_classification.py %} + + +Refer to the [R API docs](api/R/spark.mlp.html) for more details. + +{% include_example r/ml/mlp.R %} + + @@ -400,7 +437,7 @@ Refer to the
spark git commit: [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide
Repository: spark Updated Branches: refs/heads/master b47b892e4 -> 9bf8f3cd4 [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide ## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo LiangCloses #16148 from yanboliang/spark-18325. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9bf8f3cd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9bf8f3cd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9bf8f3cd Branch: refs/heads/master Commit: 9bf8f3cd4f62f921c32fb50b8abf49576a80874f Parents: b47b892 Author: Yanbo Liang Authored: Thu Dec 8 06:19:38 2016 -0800 Committer: Yanbo Liang Committed: Thu Dec 8 06:19:38 2016 -0800 -- docs/ml-classification-regression.md | 67 +++- docs/ml-clustering.md| 18 +++- docs/ml-collaborative-filtering.md | 8 ++ docs/sparkr.md | 46 examples/src/main/r/ml.R | 148 -- examples/src/main/r/ml/als.R | 45 examples/src/main/r/ml/gaussianMixture.R | 42 examples/src/main/r/ml/gbt.R | 63 +++ examples/src/main/r/ml/glm.R | 57 ++ examples/src/main/r/ml/isoreg.R | 42 examples/src/main/r/ml/kmeans.R | 44 examples/src/main/r/ml/kstest.R | 39 +++ examples/src/main/r/ml/lda.R | 46 examples/src/main/r/ml/logit.R | 63 +++ examples/src/main/r/ml/ml.R | 65 +++ examples/src/main/r/ml/mlp.R | 48 + examples/src/main/r/ml/naiveBayes.R | 41 +++ examples/src/main/r/ml/randomForest.R| 63 +++ examples/src/main/r/ml/survreg.R | 43 19 files changed, 810 insertions(+), 178 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9bf8f3cd/docs/ml-classification-regression.md -- diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index bb9390f..782ee58 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -75,6 +75,13 @@ More details on parameters can be found in the [Python API documentation](api/py {% include_example python/ml/logistic_regression_with_elastic_net.py %} + + +More details on parameters can be found in the [R API documentation](api/R/spark.logit.html). + +{% include_example binomial r/ml/logit.R %} + + The `spark.ml` implementation of logistic regression also supports @@ -171,6 +178,13 @@ model with elastic net regularization. {% include_example python/ml/multiclass_logistic_regression_with_elastic_net.py %} + + +More details on parameters can be found in the [R API documentation](api/R/spark.logit.html). + +{% include_example multinomial r/ml/logit.R %} + + @@ -242,6 +256,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat {% include_example python/ml/random_forest_classifier_example.py %} + + + +Refer to the [R API docs](api/R/spark.randomForest.html) for more details. + +{% include_example classification r/ml/randomForest.R %} + + ## Gradient-boosted tree classifier @@ -275,6 +297,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat {% include_example python/ml/gradient_boosted_tree_classifier_example.py %} + + + +Refer to the [R API docs](api/R/spark.gbt.html) for more details. + +{% include_example classification r/ml/gbt.R %} + + ## Multilayer perceptron classifier @@ -324,6 +354,13 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat {% include_example python/ml/multilayer_perceptron_classification.py %} + + +Refer to the [R API docs](api/R/spark.mlp.html) for more details. + +{% include_example r/ml/mlp.R %} + + @@ -400,7 +437,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat Refer to the [R API docs](api/R/spark.naiveBayes.html) for