spark git commit: Copy pyspark and SparkR packages to latest release dir too

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 e8f351f9a -> 2c88e1dc3


Copy pyspark and SparkR packages to latest release dir too

## What changes were proposed in this pull request?

Copy pyspark and SparkR packages to latest release dir, as per comment 
[here](https://github.com/apache/spark/pull/16226#discussion_r91664822)

Author: Felix Cheung 

Closes #16227 from felixcheung/pyrftp.

(cherry picked from commit c074c96dc57bf18b28fafdcac0c768d75c642cba)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2c88e1dc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2c88e1dc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2c88e1dc

Branch: refs/heads/branch-2.1
Commit: 2c88e1dc31e1b90605ad8ab85b20b131b4b3c722
Parents: e8f351f
Author: Felix Cheung 
Authored: Thu Dec 8 22:52:34 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 22:53:02 2016 -0800

--
 dev/create-release/release-build.sh | 2 ++
 1 file changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2c88e1dc/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 7c77791..c0663b8 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -251,6 +251,8 @@ if [[ "$1" == "package" ]]; then
   # Put to new directory:
   LFTP mkdir -p $dest_dir
   LFTP mput -O $dest_dir 'spark-*'
+  LFTP mput -O $dest_dir 'pyspark-*'
+  LFTP mput -O $dest_dir 'SparkR-*'
   # Delete /latest directory and rename new upload to /latest
   LFTP "rm -r -f $REMOTE_PARENT_DIR/latest || exit 0"
   LFTP mv $dest_dir "$REMOTE_PARENT_DIR/latest"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: Copy pyspark and SparkR packages to latest release dir too

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master 934035ae7 -> c074c96dc


Copy pyspark and SparkR packages to latest release dir too

## What changes were proposed in this pull request?

Copy pyspark and SparkR packages to latest release dir, as per comment 
[here](https://github.com/apache/spark/pull/16226#discussion_r91664822)

Author: Felix Cheung 

Closes #16227 from felixcheung/pyrftp.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c074c96d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c074c96d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c074c96d

Branch: refs/heads/master
Commit: c074c96dc57bf18b28fafdcac0c768d75c642cba
Parents: 934035a
Author: Felix Cheung 
Authored: Thu Dec 8 22:52:34 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 22:52:34 2016 -0800

--
 dev/create-release/release-build.sh | 2 ++
 1 file changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c074c96d/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 7c77791..c0663b8 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -251,6 +251,8 @@ if [[ "$1" == "package" ]]; then
   # Put to new directory:
   LFTP mkdir -p $dest_dir
   LFTP mput -O $dest_dir 'spark-*'
+  LFTP mput -O $dest_dir 'pyspark-*'
+  LFTP mput -O $dest_dir 'SparkR-*'
   # Delete /latest directory and rename new upload to /latest
   LFTP "rm -r -f $REMOTE_PARENT_DIR/latest || exit 0"
   LFTP mv $dest_dir "$REMOTE_PARENT_DIR/latest"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: Copy the SparkR source package with LFTP

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 4ceed95b4 -> e8f351f9a


Copy the SparkR source package with LFTP

This PR adds a line in release-build.sh to copy the SparkR source archive using 
LFTP

Author: Shivaram Venkataraman 

Closes #16226 from shivaram/fix-sparkr-copy-build.

(cherry picked from commit 934035ae7cb648fe61665d8efe0b7aa2bbe4ca47)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e8f351f9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e8f351f9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e8f351f9

Branch: refs/heads/branch-2.1
Commit: e8f351f9a670fc4d43f15c8d7cd57e49fb9ceba2
Parents: 4ceed95b
Author: Shivaram Venkataraman 
Authored: Thu Dec 8 22:21:24 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 22:21:36 2016 -0800

--
 dev/create-release/release-build.sh | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e8f351f9/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 1b05b20..7c77791 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -258,6 +258,7 @@ if [[ "$1" == "package" ]]; then
   LFTP mkdir -p $dest_dir
   LFTP mput -O $dest_dir 'spark-*'
   LFTP mput -O $dest_dir 'pyspark-*'
+  LFTP mput -O $dest_dir 'SparkR-*'
   exit 0
 fi
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: Copy the SparkR source package with LFTP

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master 9338aa4f8 -> 934035ae7


Copy the SparkR source package with LFTP

This PR adds a line in release-build.sh to copy the SparkR source archive using 
LFTP

Author: Shivaram Venkataraman 

Closes #16226 from shivaram/fix-sparkr-copy-build.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/934035ae
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/934035ae
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/934035ae

Branch: refs/heads/master
Commit: 934035ae7cb648fe61665d8efe0b7aa2bbe4ca47
Parents: 9338aa4
Author: Shivaram Venkataraman 
Authored: Thu Dec 8 22:21:24 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 22:21:24 2016 -0800

--
 dev/create-release/release-build.sh | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/934035ae/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 1b05b20..7c77791 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -258,6 +258,7 @@ if [[ "$1" == "package" ]]; then
   LFTP mkdir -p $dest_dir
   LFTP mput -O $dest_dir 'spark-*'
   LFTP mput -O $dest_dir 'pyspark-*'
+  LFTP mput -O $dest_dir 'SparkR-*'
   exit 0
 fi
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18697][BUILD] Upgrade sbt plugins

2016-12-08 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 86a96034c -> 9338aa4f8


[SPARK-18697][BUILD] Upgrade sbt plugins

## What changes were proposed in this pull request?

This PR is to upgrade sbt plugins. The following sbt plugins will be upgraded:
```
sbteclipse-plugin: 4.0.0 -> 5.0.1
sbt-mima-plugin: 0.1.11 -> 0.1.12
org.ow2.asm/asm: 5.0.3 -> 5.1
org.ow2.asm/asm-commons: 5.0.3 -> 5.1
```
## How was this patch tested?
Pass the Jenkins build.

Author: Weiqing Yang 

Closes #16223 from weiqingy/SPARK_18697.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9338aa4f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9338aa4f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9338aa4f

Branch: refs/heads/master
Commit: 9338aa4f89821c5640f7d007a67f9b947aa2bcd4
Parents: 86a9603
Author: Weiqing Yang 
Authored: Fri Dec 9 14:13:01 2016 +0800
Committer: Sean Owen 
Committed: Fri Dec 9 14:13:01 2016 +0800

--
 project/plugins.sbt | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9338aa4f/project/plugins.sbt
--
diff --git a/project/plugins.sbt b/project/plugins.sbt
index 76597d2..84d1239 100644
--- a/project/plugins.sbt
+++ b/project/plugins.sbt
@@ -1,12 +1,12 @@
 addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")
 
-addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")
+addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.0.1")
 
 addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.8.2")
 
 addSbtPlugin("org.scalastyle" %% "scalastyle-sbt-plugin" % "0.8.0")
 
-addSbtPlugin("com.typesafe" % "sbt-mima-plugin" % "0.1.11")
+addSbtPlugin("com.typesafe" % "sbt-mima-plugin" % "0.1.12")
 
 addSbtPlugin("com.alpinenow" % "junit_xml_listener" % "0.5.1")
 
@@ -16,9 +16,9 @@ addSbtPlugin("com.cavorite" % "sbt-avro" % "0.3.2")
 
 addSbtPlugin("io.spray" % "sbt-revolver" % "0.8.0")
 
-libraryDependencies += "org.ow2.asm"  % "asm" % "5.0.3"
+libraryDependencies += "org.ow2.asm"  % "asm" % "5.1"
 
-libraryDependencies += "org.ow2.asm"  % "asm-commons" % "5.0.3"
+libraryDependencies += "org.ow2.asm"  % "asm-commons" % "5.1"
 
 addSbtPlugin("com.simplytyped" % "sbt-antlr4" % "0.7.11")
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18349][SPARKR] Update R API documentation on ml model summary

2016-12-08 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 ef5646b4c -> 4ceed95b4


[SPARK-18349][SPARKR] Update R API documentation on ml model summary

## What changes were proposed in this pull request?
In this PR, the document of `summary` method is improved in the format:

returns summary information of the fitted model, which is a list. The list 
includes ...

Since `summary` in R is mainly about the model, which is not the same as 
`summary` object on scala side, if there is one, the scala API doc is not 
pointed here.

In current document, some `return` have `.` and some don't have. `.` is added 
to missed ones.

Since spark.logit `summary` has a big refactoring, this PR doesn't include this 
one. It will be changed when the `spark.logit` PR is merged.

## How was this patch tested?

Manual build.

Author: wm...@hotmail.com 

Closes #16150 from wangmiao1981/audit2.

(cherry picked from commit 86a96034ccb47c5bba2cd739d793240afcfc25f6)
Signed-off-by: Felix Cheung 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ceed95b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ceed95b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ceed95b

Branch: refs/heads/branch-2.1
Commit: 4ceed95b43d0cd9665004865095a40926efcc289
Parents: ef5646b
Author: wm...@hotmail.com 
Authored: Thu Dec 8 22:08:19 2016 -0800
Committer: Felix Cheung 
Committed: Thu Dec 8 22:08:51 2016 -0800

--
 R/pkg/R/mllib.R| 147 
 R/pkg/inst/tests/testthat/test_mllib.R |   2 +
 2 files changed, 86 insertions(+), 63 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4ceed95b/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 632e4ad..5df843c 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -191,7 +191,7 @@ predict_internal <- function(object, newData) {
 #' @param regParam regularization parameter for L2 regularization.
 #' @param ... additional arguments passed to the method.
 #' @aliases spark.glm,SparkDataFrame,formula-method
-#' @return \code{spark.glm} returns a fitted generalized linear model
+#' @return \code{spark.glm} returns a fitted generalized linear model.
 #' @rdname spark.glm
 #' @name spark.glm
 #' @export
@@ -277,12 +277,12 @@ setMethod("glm", signature(formula = "formula", family = 
"ANY", data = "SparkDat
 #  Returns the summary of a model produced by glm() or spark.glm(), similarly 
to R's summary().
 
 #' @param object a fitted generalized linear model.
-#' @return \code{summary} returns a summary object of the fitted model, a list 
of components
-#' including at least the coefficients matrix (which includes 
coefficients, standard error
-#' of coefficients, t value and p value), null/residual deviance, 
null/residual degrees of
-#' freedom, AIC and number of iterations IRLS takes. If there are 
collinear columns
-#' in you data, the coefficients matrix only provides coefficients.
-#'
+#' @return \code{summary} returns summary information of the fitted model, 
which is a list.
+#' The list of components includes at least the \code{coefficients} 
(coefficients matrix, which includes
+#' coefficients, standard error of coefficients, t value and p value),
+#' \code{null.deviance} (null/residual degrees of freedom), \code{aic} 
(AIC)
+#' and \code{iter} (number of iterations IRLS takes). If there are 
collinear columns in the data,
+#' the coefficients matrix only provides coefficients.
 #' @rdname spark.glm
 #' @export
 #' @note summary(GeneralizedLinearRegressionModel) since 2.0.0
@@ -328,7 +328,7 @@ setMethod("summary", signature(object = 
"GeneralizedLinearRegressionModel"),
 #  Prints the summary of GeneralizedLinearRegressionModel
 
 #' @rdname spark.glm
-#' @param x summary object of fitted generalized linear model returned by 
\code{summary} function
+#' @param x summary object of fitted generalized linear model returned by 
\code{summary} function.
 #' @export
 #' @note print.summary.GeneralizedLinearRegressionModel since 2.0.0
 print.summary.GeneralizedLinearRegressionModel <- function(x, ...) {
@@ -361,7 +361,7 @@ print.summary.GeneralizedLinearRegressionModel <- 
function(x, ...) {
 
 #' @param newData a SparkDataFrame for testing.
 #' @return \code{predict} returns a SparkDataFrame containing predicted labels 
in a column named
-#' "prediction"
+#' "prediction".
 #' @rdname spark.glm
 #' @export
 #' @note predict(GeneralizedLinearRegressionModel) since 1.5.0
@@ -375,7 +375,7 @@ setMethod("predict", signature(object = 
"GeneralizedLinearRegressionModel"),
 
 #' 

spark git commit: [SPARK-18349][SPARKR] Update R API documentation on ml model summary

2016-12-08 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/master 4ac8b20bf -> 86a96034c


[SPARK-18349][SPARKR] Update R API documentation on ml model summary

## What changes were proposed in this pull request?
In this PR, the document of `summary` method is improved in the format:

returns summary information of the fitted model, which is a list. The list 
includes ...

Since `summary` in R is mainly about the model, which is not the same as 
`summary` object on scala side, if there is one, the scala API doc is not 
pointed here.

In current document, some `return` have `.` and some don't have. `.` is added 
to missed ones.

Since spark.logit `summary` has a big refactoring, this PR doesn't include this 
one. It will be changed when the `spark.logit` PR is merged.

## How was this patch tested?

Manual build.

Author: wm...@hotmail.com 

Closes #16150 from wangmiao1981/audit2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/86a96034
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/86a96034
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/86a96034

Branch: refs/heads/master
Commit: 86a96034ccb47c5bba2cd739d793240afcfc25f6
Parents: 4ac8b20
Author: wm...@hotmail.com 
Authored: Thu Dec 8 22:08:19 2016 -0800
Committer: Felix Cheung 
Committed: Thu Dec 8 22:08:19 2016 -0800

--
 R/pkg/R/mllib.R| 147 
 R/pkg/inst/tests/testthat/test_mllib.R |   2 +
 2 files changed, 86 insertions(+), 63 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/86a96034/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 632e4ad..5df843c 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -191,7 +191,7 @@ predict_internal <- function(object, newData) {
 #' @param regParam regularization parameter for L2 regularization.
 #' @param ... additional arguments passed to the method.
 #' @aliases spark.glm,SparkDataFrame,formula-method
-#' @return \code{spark.glm} returns a fitted generalized linear model
+#' @return \code{spark.glm} returns a fitted generalized linear model.
 #' @rdname spark.glm
 #' @name spark.glm
 #' @export
@@ -277,12 +277,12 @@ setMethod("glm", signature(formula = "formula", family = 
"ANY", data = "SparkDat
 #  Returns the summary of a model produced by glm() or spark.glm(), similarly 
to R's summary().
 
 #' @param object a fitted generalized linear model.
-#' @return \code{summary} returns a summary object of the fitted model, a list 
of components
-#' including at least the coefficients matrix (which includes 
coefficients, standard error
-#' of coefficients, t value and p value), null/residual deviance, 
null/residual degrees of
-#' freedom, AIC and number of iterations IRLS takes. If there are 
collinear columns
-#' in you data, the coefficients matrix only provides coefficients.
-#'
+#' @return \code{summary} returns summary information of the fitted model, 
which is a list.
+#' The list of components includes at least the \code{coefficients} 
(coefficients matrix, which includes
+#' coefficients, standard error of coefficients, t value and p value),
+#' \code{null.deviance} (null/residual degrees of freedom), \code{aic} 
(AIC)
+#' and \code{iter} (number of iterations IRLS takes). If there are 
collinear columns in the data,
+#' the coefficients matrix only provides coefficients.
 #' @rdname spark.glm
 #' @export
 #' @note summary(GeneralizedLinearRegressionModel) since 2.0.0
@@ -328,7 +328,7 @@ setMethod("summary", signature(object = 
"GeneralizedLinearRegressionModel"),
 #  Prints the summary of GeneralizedLinearRegressionModel
 
 #' @rdname spark.glm
-#' @param x summary object of fitted generalized linear model returned by 
\code{summary} function
+#' @param x summary object of fitted generalized linear model returned by 
\code{summary} function.
 #' @export
 #' @note print.summary.GeneralizedLinearRegressionModel since 2.0.0
 print.summary.GeneralizedLinearRegressionModel <- function(x, ...) {
@@ -361,7 +361,7 @@ print.summary.GeneralizedLinearRegressionModel <- 
function(x, ...) {
 
 #' @param newData a SparkDataFrame for testing.
 #' @return \code{predict} returns a SparkDataFrame containing predicted labels 
in a column named
-#' "prediction"
+#' "prediction".
 #' @rdname spark.glm
 #' @export
 #' @note predict(GeneralizedLinearRegressionModel) since 1.5.0
@@ -375,7 +375,7 @@ setMethod("predict", signature(object = 
"GeneralizedLinearRegressionModel"),
 
 #' @param newData a SparkDataFrame for testing.
 #' @return \code{predict} returns a SparkDataFrame containing predicted 
labeled in a 

spark git commit: [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 1cafc76ea -> ef5646b4c


[SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip 
tar.gz from distribution

## What changes were proposed in this pull request?

Fixes name of R source package so that the `cp` in release-build.sh works 
correctly.

Issue discussed in 
https://github.com/apache/spark/pull/16014#issuecomment-265867125

Author: Shivaram Venkataraman 

Closes #16221 from shivaram/fix-sparkr-release-build-name.

(cherry picked from commit 4ac8b20bf2f962d9b8b6b209468896758d49efe3)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ef5646b4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ef5646b4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ef5646b4

Branch: refs/heads/branch-2.1
Commit: ef5646b4c6792a96e85d1dd4bb3103ba8306949b
Parents: 1cafc76
Author: Shivaram Venkataraman 
Authored: Thu Dec 8 18:26:54 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 18:27:05 2016 -0800

--
 dev/make-distribution.sh | 9 +
 1 file changed, 9 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ef5646b4/dev/make-distribution.sh
--
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
index fe281bb..4da7d57 100755
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -222,11 +222,14 @@ fi
 # Make R package - this is used for both CRAN release and packing R layout 
into distribution
 if [ "$MAKE_R" == "true" ]; then
   echo "Building R source package"
+  R_PACKAGE_VERSION=`grep Version $SPARK_HOME/R/pkg/DESCRIPTION | awk '{print 
$NF}'`
   pushd "$SPARK_HOME/R" > /dev/null
   # Build source package and run full checks
   # Install source package to get it to generate vignettes, etc.
   # Do not source the check-cran.sh - it should be run from where it is for it 
to set SPARK_HOME
   NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh
+  # Make a copy of R source package matching the Spark release version.
+  cp $SPARK_HOME/R/SparkR_"$R_PACKAGE_VERSION".tar.gz 
$SPARK_HOME/R/SparkR_"$VERSION".tar.gz
   popd > /dev/null
 else
   echo "Skipping building R source package"
@@ -238,6 +241,12 @@ cp "$SPARK_HOME"/conf/*.template "$DISTDIR"/conf
 cp "$SPARK_HOME/README.md" "$DISTDIR"
 cp -r "$SPARK_HOME/bin" "$DISTDIR"
 cp -r "$SPARK_HOME/python" "$DISTDIR"
+
+# Remove the python distribution from dist/ if we built it
+if [ "$MAKE_PIP" == "true" ]; then
+  rm -f $DISTDIR/python/dist/pyspark-*.tar.gz
+fi
+
 cp -r "$SPARK_HOME/sbin" "$DISTDIR"
 # Copy SparkR if it exists
 if [ -d "$SPARK_HOME"/R/lib/SparkR ]; then


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master 458fa3325 -> 4ac8b20bf


[SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip 
tar.gz from distribution

## What changes were proposed in this pull request?

Fixes name of R source package so that the `cp` in release-build.sh works 
correctly.

Issue discussed in 
https://github.com/apache/spark/pull/16014#issuecomment-265867125

Author: Shivaram Venkataraman 

Closes #16221 from shivaram/fix-sparkr-release-build-name.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ac8b20b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ac8b20b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ac8b20b

Branch: refs/heads/master
Commit: 4ac8b20bf2f962d9b8b6b209468896758d49efe3
Parents: 458fa33
Author: Shivaram Venkataraman 
Authored: Thu Dec 8 18:26:54 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 18:26:54 2016 -0800

--
 dev/make-distribution.sh | 9 +
 1 file changed, 9 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4ac8b20b/dev/make-distribution.sh
--
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
index fe281bb..4da7d57 100755
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -222,11 +222,14 @@ fi
 # Make R package - this is used for both CRAN release and packing R layout 
into distribution
 if [ "$MAKE_R" == "true" ]; then
   echo "Building R source package"
+  R_PACKAGE_VERSION=`grep Version $SPARK_HOME/R/pkg/DESCRIPTION | awk '{print 
$NF}'`
   pushd "$SPARK_HOME/R" > /dev/null
   # Build source package and run full checks
   # Install source package to get it to generate vignettes, etc.
   # Do not source the check-cran.sh - it should be run from where it is for it 
to set SPARK_HOME
   NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh
+  # Make a copy of R source package matching the Spark release version.
+  cp $SPARK_HOME/R/SparkR_"$R_PACKAGE_VERSION".tar.gz 
$SPARK_HOME/R/SparkR_"$VERSION".tar.gz
   popd > /dev/null
 else
   echo "Skipping building R source package"
@@ -238,6 +241,12 @@ cp "$SPARK_HOME"/conf/*.template "$DISTDIR"/conf
 cp "$SPARK_HOME/README.md" "$DISTDIR"
 cp -r "$SPARK_HOME/bin" "$DISTDIR"
 cp -r "$SPARK_HOME/python" "$DISTDIR"
+
+# Remove the python distribution from dist/ if we built it
+if [ "$MAKE_PIP" == "true" ]; then
+  rm -f $DISTDIR/python/dist/pyspark-*.tar.gz
+fi
+
 cp -r "$SPARK_HOME/sbin" "$DISTDIR"
 # Copy SparkR if it exists
 if [ -d "$SPARK_HOME"/R/lib/SparkR ]; then


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles is enabled (branch 2.1)

2016-12-08 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 fcd22e538 -> 1cafc76ea


[SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles is 
enabled (branch 2.1)

## What changes were proposed in this pull request?

Backport #16203 to branch 2.1.

## How was this patch tested?

Jennkins

Author: Shixiong Zhu 

Closes #16216 from zsxwing/SPARK-18774-2.1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1cafc76e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1cafc76e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1cafc76e

Branch: refs/heads/branch-2.1
Commit: 1cafc76ea1e9eef40b24060d1cd7c4aaf9f16a49
Parents: fcd22e5
Author: Shixiong Zhu 
Authored: Thu Dec 8 17:58:44 2016 -0800
Committer: Reynold Xin 
Committed: Thu Dec 8 17:58:44 2016 -0800

--
 .../apache/spark/internal/config/package.scala  |  3 +-
 .../scala/org/apache/spark/rdd/HadoopRDD.scala  | 30 +++-
 .../org/apache/spark/rdd/NewHadoopRDD.scala | 50 
 .../sql/execution/datasources/FileScanRDD.scala |  3 ++
 .../org/apache/spark/sql/internal/SQLConf.scala |  3 +-
 5 files changed, 57 insertions(+), 32 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1cafc76e/core/src/main/scala/org/apache/spark/internal/config/package.scala
--
diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala 
b/core/src/main/scala/org/apache/spark/internal/config/package.scala
index 4a3e3d5..8ce9883 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/package.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -203,7 +203,8 @@ package object config {
 
   private[spark] val IGNORE_CORRUPT_FILES = 
ConfigBuilder("spark.files.ignoreCorruptFiles")
 .doc("Whether to ignore corrupt files. If true, the Spark jobs will 
continue to run when " +
-  "encountering corrupt files and contents that have been read will still 
be returned.")
+  "encountering corrupted or non-existing files and contents that have 
been read will still " +
+  "be returned.")
 .booleanConf
 .createWithDefault(false)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1cafc76e/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
index 3133a28..b56ebf4 100644
--- a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
@@ -210,12 +210,12 @@ class HadoopRDD[K, V](
   override def compute(theSplit: Partition, context: TaskContext): 
InterruptibleIterator[(K, V)] = {
 val iter = new NextIterator[(K, V)] {
 
-  val split = theSplit.asInstanceOf[HadoopPartition]
+  private val split = theSplit.asInstanceOf[HadoopPartition]
   logInfo("Input split: " + split.inputSplit)
-  val jobConf = getJobConf()
+  private val jobConf = getJobConf()
 
-  val inputMetrics = context.taskMetrics().inputMetrics
-  val existingBytesRead = inputMetrics.bytesRead
+  private val inputMetrics = context.taskMetrics().inputMetrics
+  private val existingBytesRead = inputMetrics.bytesRead
 
   // Sets the thread local variable for the file's name
   split.inputSplit.value match {
@@ -225,7 +225,7 @@ class HadoopRDD[K, V](
 
   // Find a function that will return the FileSystem bytes read by this 
thread. Do this before
   // creating RecordReader, because RecordReader's constructor might read 
some bytes
-  val getBytesReadCallback: Option[() => Long] = split.inputSplit.value 
match {
+  private val getBytesReadCallback: Option[() => Long] = 
split.inputSplit.value match {
 case _: FileSplit | _: CombineFileSplit =>
   SparkHadoopUtil.get.getFSBytesReadOnThreadCallback()
 case _ => None
@@ -235,23 +235,31 @@ class HadoopRDD[K, V](
   // If we do a coalesce, however, we are likely to compute multiple 
partitions in the same
   // task and in the same thread, in which case we need to avoid override 
values written by
   // previous partitions (SPARK-13071).
-  def updateBytesRead(): Unit = {
+  private def updateBytesRead(): Unit = {
 getBytesReadCallback.foreach { getBytesRead =>
   inputMetrics.setBytesRead(existingBytesRead + getBytesRead())
 }
   }
 
-  var reader: RecordReader[K, V] = null
-  val inputFormat = getInputFormat(jobConf)
+  private var reader: RecordReader[K, V] = null
+  private val 

spark git commit: [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json

2016-12-08 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/master 202fcd21c -> 458fa3325


[SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json

## What changes were proposed in this pull request?

- Changed FileStreamSource to use new FileStreamSourceOffset rather than 
LongOffset. The field is named as `logOffset` to make it more clear that this 
is a offset in the file stream log.
- Fixed bug in FileStreamSourceLog, the field endId in the 
FileStreamSourceLog.get(startId, endId) was not being used at all. No test 
caught it earlier. Only my updated tests caught it.

Other minor changes
- Dont use batchId in the FileStreamSource, as calling it batch id is extremely 
miss leading. With multiple sources, it may happen that a new batch has no new 
data from a file source. So offset of FileStreamSource != batchId after that 
batch.

## How was this patch tested?

Updated unit test.

Author: Tathagata Das 

Closes #16205 from tdas/SPARK-18776.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/458fa332
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/458fa332
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/458fa332

Branch: refs/heads/master
Commit: 458fa3325e5f8c21c50e406ac8059d6236f93a9c
Parents: 202fcd2
Author: Tathagata Das 
Authored: Thu Dec 8 17:53:34 2016 -0800
Committer: Tathagata Das 
Committed: Thu Dec 8 17:53:34 2016 -0800

--
 .../sql/kafka010/KafkaSourceOffsetSuite.scala   |  2 +-
 .../execution/streaming/FileStreamSource.scala  | 32 ++--
 .../streaming/FileStreamSourceLog.scala |  2 +-
 .../streaming/FileStreamSourceOffset.scala  | 53 
 .../file-source-offset-version-2.1.0-json.txt   |  1 +
 .../file-source-offset-version-2.1.0-long.txt   |  1 +
 .../file-source-offset-version-2.1.0.txt|  1 -
 .../offset-log-version-2.1.0/0  |  4 +-
 .../streaming/FileStreamSourceSuite.scala   |  2 +-
 .../execution/streaming/OffsetSeqLogSuite.scala |  2 +-
 .../sql/streaming/FileStreamSourceSuite.scala   | 30 +++
 11 files changed, 96 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/458fa332/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
index 22668fd..10b35c7 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
@@ -90,7 +90,7 @@ class KafkaSourceOffsetSuite extends OffsetSuite with 
SharedSQLContext {
 }
   }
 
-  test("read Spark 2.1.0 log format") {
+  test("read Spark 2.1.0 offset format") {
 val offset = readFromResource("kafka-source-offset-version-2.1.0.txt")
 assert(KafkaSourceOffset(offset) ===
   KafkaSourceOffset(("topic1", 0, 456L), ("topic1", 1, 789L), ("topic2", 
0, 0L)))

http://git-wip-us.apache.org/repos/asf/spark/blob/458fa332/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
index 8494aef..20e0dce 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
@@ -57,7 +57,7 @@ class FileStreamSource(
 
   private val metadataLog =
 new FileStreamSourceLog(FileStreamSourceLog.VERSION, sparkSession, 
metadataPath)
-  private var maxBatchId = metadataLog.getLatest().map(_._1).getOrElse(-1L)
+  private var metadataLogCurrentOffset = 
metadataLog.getLatest().map(_._1).getOrElse(-1L)
 
   /** Maximum number of new files to be considered in each batch */
   private val maxFilesPerBatch = sourceOptions.maxFilesPerTrigger
@@ -79,7 +79,7 @@ class FileStreamSource(
* `synchronized` on this method is for solving race conditions in tests. In 
the normal usage,
* there is no race here, so the cost of `synchronized` should be rare.
*/
-  private def fetchMaxOffset(): LongOffset = synchronized {
+  private def fetchMaxOffset(): FileStreamSourceOffset = synchronized {
 // All the new files found - ignore aged files and files 

spark git commit: [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json

2016-12-08 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 e43209fe2 -> fcd22e538


[SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json

## What changes were proposed in this pull request?

- Changed FileStreamSource to use new FileStreamSourceOffset rather than 
LongOffset. The field is named as `logOffset` to make it more clear that this 
is a offset in the file stream log.
- Fixed bug in FileStreamSourceLog, the field endId in the 
FileStreamSourceLog.get(startId, endId) was not being used at all. No test 
caught it earlier. Only my updated tests caught it.

Other minor changes
- Dont use batchId in the FileStreamSource, as calling it batch id is extremely 
miss leading. With multiple sources, it may happen that a new batch has no new 
data from a file source. So offset of FileStreamSource != batchId after that 
batch.

## How was this patch tested?

Updated unit test.

Author: Tathagata Das 

Closes #16205 from tdas/SPARK-18776.

(cherry picked from commit 458fa3325e5f8c21c50e406ac8059d6236f93a9c)
Signed-off-by: Tathagata Das 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fcd22e53
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fcd22e53
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fcd22e53

Branch: refs/heads/branch-2.1
Commit: fcd22e5389a7dffda32be0e143d772f611a0f3d9
Parents: e43209f
Author: Tathagata Das 
Authored: Thu Dec 8 17:53:34 2016 -0800
Committer: Tathagata Das 
Committed: Thu Dec 8 17:53:45 2016 -0800

--
 .../sql/kafka010/KafkaSourceOffsetSuite.scala   |  2 +-
 .../execution/streaming/FileStreamSource.scala  | 32 ++--
 .../streaming/FileStreamSourceLog.scala |  2 +-
 .../streaming/FileStreamSourceOffset.scala  | 53 
 .../file-source-offset-version-2.1.0-json.txt   |  1 +
 .../file-source-offset-version-2.1.0-long.txt   |  1 +
 .../file-source-offset-version-2.1.0.txt|  1 -
 .../offset-log-version-2.1.0/0  |  4 +-
 .../streaming/FileStreamSourceSuite.scala   |  2 +-
 .../execution/streaming/OffsetSeqLogSuite.scala |  2 +-
 .../sql/streaming/FileStreamSourceSuite.scala   | 30 +++
 11 files changed, 96 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fcd22e53/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
--
diff --git 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
index 22668fd..10b35c7 100644
--- 
a/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
+++ 
b/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
@@ -90,7 +90,7 @@ class KafkaSourceOffsetSuite extends OffsetSuite with 
SharedSQLContext {
 }
   }
 
-  test("read Spark 2.1.0 log format") {
+  test("read Spark 2.1.0 offset format") {
 val offset = readFromResource("kafka-source-offset-version-2.1.0.txt")
 assert(KafkaSourceOffset(offset) ===
   KafkaSourceOffset(("topic1", 0, 456L), ("topic1", 1, 789L), ("topic2", 
0, 0L)))

http://git-wip-us.apache.org/repos/asf/spark/blob/fcd22e53/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
index 8494aef..20e0dce 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
@@ -57,7 +57,7 @@ class FileStreamSource(
 
   private val metadataLog =
 new FileStreamSourceLog(FileStreamSourceLog.VERSION, sparkSession, 
metadataPath)
-  private var maxBatchId = metadataLog.getLatest().map(_._1).getOrElse(-1L)
+  private var metadataLogCurrentOffset = 
metadataLog.getLatest().map(_._1).getOrElse(-1L)
 
   /** Maximum number of new files to be considered in each batch */
   private val maxFilesPerBatch = sourceOptions.maxFilesPerTrigger
@@ -79,7 +79,7 @@ class FileStreamSource(
* `synchronized` on this method is for solving race conditions in tests. In 
the normal usage,
* there is no race here, so the cost of `synchronized` should be rare.
*/
-  private def fetchMaxOffset(): LongOffset = synchronized 

spark git commit: [SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master 3261e25da -> 202fcd21c


[SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6

This PR changes the SparkR source release tarball to be built using the Hadoop 
2.6 profile. Previously it was using the without hadoop profile which leads to 
an error as discussed in 
https://github.com/apache/spark/pull/16014#issuecomment-265843991

Author: Shivaram Venkataraman 

Closes #16218 from shivaram/fix-sparkr-release-build.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/202fcd21
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/202fcd21
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/202fcd21

Branch: refs/heads/master
Commit: 202fcd21ce01393fa6dfaa1c2126e18e9b85ee96
Parents: 3261e25
Author: Shivaram Venkataraman 
Authored: Thu Dec 8 13:01:46 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 13:01:46 2016 -0800

--
 dev/create-release/release-build.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/202fcd21/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 8863ee6..1b05b20 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -238,10 +238,10 @@ if [[ "$1" == "package" ]]; then
   FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos"
   make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" &
   make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" &
-  make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" &
+  make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" "withr" &
   make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" "withpip" &
   make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn 
-Pmesos" "3037" &
-  make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn 
-Pmesos" "3038" "withr" &
+  make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn 
-Pmesos" "3038" &
   wait
   rm -rf spark-$SPARK_VERSION-bin-*/
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 9483242f4 -> e43209fe2


[SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6

This PR changes the SparkR source release tarball to be built using the Hadoop 
2.6 profile. Previously it was using the without hadoop profile which leads to 
an error as discussed in 
https://github.com/apache/spark/pull/16014#issuecomment-265843991

Author: Shivaram Venkataraman 

Closes #16218 from shivaram/fix-sparkr-release-build.

(cherry picked from commit 202fcd21ce01393fa6dfaa1c2126e18e9b85ee96)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e43209fe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e43209fe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e43209fe

Branch: refs/heads/branch-2.1
Commit: e43209fe2a69fb239dff8bc1a18297d3696f0dcd
Parents: 9483242
Author: Shivaram Venkataraman 
Authored: Thu Dec 8 13:01:46 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 13:01:54 2016 -0800

--
 dev/create-release/release-build.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e43209fe/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 8863ee6..1b05b20 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -238,10 +238,10 @@ if [[ "$1" == "package" ]]; then
   FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos"
   make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" &
   make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" &
-  make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" &
+  make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" "withr" &
   make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" "withpip" &
   make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn 
-Pmesos" "3037" &
-  make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn 
-Pmesos" "3038" "withr" &
+  make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn 
-Pmesos" "3038" &
   wait
   rm -rf spark-$SPARK_VERSION-bin-*/
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18760][SQL] Consistent format specification for FileFormats

2016-12-08 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 a03564418 -> 9483242f4


[SPARK-18760][SQL] Consistent format specification for FileFormats

## What changes were proposed in this pull request?
This patch fixes the format specification in explain for file sources (Parquet 
and Text formats are the only two that are different from the rest):

Before:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: 
org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: 
InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
```

After:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: Text, Location: 
InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
```

Also closes #14680.

## How was this patch tested?
Verified in spark-shell.

Author: Reynold Xin 

Closes #16187 from rxin/SPARK-18760.

(cherry picked from commit 5f894d23a54ea99f75f8b722e111e5270f7f80cf)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9483242f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9483242f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9483242f

Branch: refs/heads/branch-2.1
Commit: 9483242f4c6cc13001e5a967810718b26beb2361
Parents: a035644
Author: Reynold Xin 
Authored: Thu Dec 8 12:52:05 2016 -0800
Committer: Reynold Xin 
Committed: Thu Dec 8 12:52:21 2016 -0800

--
 .../sql/execution/datasources/parquet/ParquetFileFormat.scala | 2 +-
 .../spark/sql/execution/datasources/text/TextFileFormat.scala | 2 ++
 .../apache/spark/sql/streaming/FileStreamSourceSuite.scala| 7 ---
 3 files changed, 7 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9483242f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index 031a0fe..0965ffe 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
@@ -61,7 +61,7 @@ class ParquetFileFormat
 
   override def shortName(): String = "parquet"
 
-  override def toString: String = "ParquetFormat"
+  override def toString: String = "Parquet"
 
   override def hashCode(): Int = getClass.hashCode()
 

http://git-wip-us.apache.org/repos/asf/spark/blob/9483242f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
index 8e04396..3e89082 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
@@ -43,6 +43,8 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 
   override def shortName(): String = "text"
 
+  override def toString: String = "Text"
+
   private def verifySchema(schema: StructType): Unit = {
 if (schema.size != 1) {
   throw new AnalysisException(

http://git-wip-us.apache.org/repos/asf/spark/blob/9483242f/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
index 7b6fe83..267c462 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
@@ -31,7 +31,8 @@ import org.apache.spark.sql.test.SharedSQLContext
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
 
-class FileStreamSourceTest extends StreamTest with SharedSQLContext with 
PrivateMethodTester {
+abstract class FileStreamSourceTest
+  extends StreamTest with SharedSQLContext with 

spark git commit: [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext

2016-12-08 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/master c3d3a9d0e -> 26432df9c


[SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in 
Utils.tryOrStopSparkContext

## What changes were proposed in this pull request?

When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the 
following three places), it will cause deadlock because the `stop` method needs 
to wait for the thread running `stop` to exit.

- ContextCleaner.keepCleaning
- LiveListenerBus.listenerThread.run
- TaskSchedulerImpl.start

This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the 
potential deadlock. I also removed my changes in #15775 since they are not 
necessary now.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16178 from zsxwing/fix-stop-deadlock.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/26432df9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/26432df9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/26432df9

Branch: refs/heads/master
Commit: 26432df9cc6ffe569583aa628c6ecd7050b38316
Parents: c3d3a9d
Author: Shixiong Zhu 
Authored: Thu Dec 8 11:54:04 2016 -0800
Committer: Shixiong Zhu 
Committed: Thu Dec 8 11:54:04 2016 -0800

--
 .../scala/org/apache/spark/SparkContext.scala   | 35 +++-
 .../scala/org/apache/spark/rpc/RpcEnv.scala |  5 ---
 .../org/apache/spark/rpc/netty/Dispatcher.scala |  1 -
 .../apache/spark/rpc/netty/NettyRpcEnv.scala|  5 ---
 .../apache/spark/scheduler/DAGScheduler.scala   |  2 +-
 .../cluster/StandaloneSchedulerBackend.scala|  2 +-
 .../scala/org/apache/spark/util/Utils.scala |  2 +-
 .../org/apache/spark/rpc/RpcEnvSuite.scala  | 13 
 8 files changed, 23 insertions(+), 42 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/26432df9/core/src/main/scala/org/apache/spark/SparkContext.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index be4dae1..b42820a 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -1760,25 +1760,30 @@ class SparkContext(config: SparkConf) extends Logging {
   def listJars(): Seq[String] = addedJars.keySet.toSeq
 
   /**
-   * Shut down the SparkContext.
+   * When stopping SparkContext inside Spark components, it's easy to cause 
dead-lock since Spark
+   * may wait for some internal threads to finish. It's better to use this 
method to stop
+   * SparkContext instead.
*/
-  def stop(): Unit = {
-if (env.rpcEnv.isInRPCThread) {
-  // `stop` will block until all RPC threads exit, so we cannot call stop 
inside a RPC thread.
-  // We should launch a new thread to call `stop` to avoid dead-lock.
-  new Thread("stop-spark-context") {
-setDaemon(true)
-
-override def run(): Unit = {
-  _stop()
+  private[spark] def stopInNewThread(): Unit = {
+new Thread("stop-spark-context") {
+  setDaemon(true)
+
+  override def run(): Unit = {
+try {
+  SparkContext.this.stop()
+} catch {
+  case e: Throwable =>
+logError(e.getMessage, e)
+throw e
 }
-  }.start()
-} else {
-  _stop()
-}
+  }
+}.start()
   }
 
-  private def _stop() {
+  /**
+   * Shut down the SparkContext.
+   */
+  def stop(): Unit = {
 if (LiveListenerBus.withinListenerThread.value) {
   throw new SparkException(
 s"Cannot stop SparkContext within listener thread of 
${LiveListenerBus.name}")

http://git-wip-us.apache.org/repos/asf/spark/blob/26432df9/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala 
b/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala
index bbc4163..530743c 100644
--- a/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala
+++ b/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala
@@ -146,11 +146,6 @@ private[spark] abstract class RpcEnv(conf: SparkConf) {
* @param uri URI with location of the file.
*/
   def openChannel(uri: String): ReadableByteChannel
-
-  /**
-   * Return if the current thread is a RPC thread.
-   */
-  def isInRPCThread: Boolean
 }
 
 /**

http://git-wip-us.apache.org/repos/asf/spark/blob/26432df9/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala 

spark git commit: [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext

2016-12-08 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 d69df9073 -> a03564418


[SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in 
Utils.tryOrStopSparkContext

## What changes were proposed in this pull request?

When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the 
following three places), it will cause deadlock because the `stop` method needs 
to wait for the thread running `stop` to exit.

- ContextCleaner.keepCleaning
- LiveListenerBus.listenerThread.run
- TaskSchedulerImpl.start

This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the 
potential deadlock. I also removed my changes in #15775 since they are not 
necessary now.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #16178 from zsxwing/fix-stop-deadlock.

(cherry picked from commit 26432df9cc6ffe569583aa628c6ecd7050b38316)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a0356441
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a0356441
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a0356441

Branch: refs/heads/branch-2.1
Commit: a035644182646a2160ac16ecd6c7f4d98be2caad
Parents: d69df90
Author: Shixiong Zhu 
Authored: Thu Dec 8 11:54:04 2016 -0800
Committer: Shixiong Zhu 
Committed: Thu Dec 8 11:54:10 2016 -0800

--
 .../scala/org/apache/spark/SparkContext.scala   | 35 +++-
 .../scala/org/apache/spark/rpc/RpcEnv.scala |  5 ---
 .../org/apache/spark/rpc/netty/Dispatcher.scala |  1 -
 .../apache/spark/rpc/netty/NettyRpcEnv.scala|  5 ---
 .../apache/spark/scheduler/DAGScheduler.scala   |  2 +-
 .../cluster/StandaloneSchedulerBackend.scala|  2 +-
 .../scala/org/apache/spark/util/Utils.scala |  2 +-
 .../org/apache/spark/rpc/RpcEnvSuite.scala  | 13 
 8 files changed, 23 insertions(+), 42 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a0356441/core/src/main/scala/org/apache/spark/SparkContext.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index b8414b5..8f8392f 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -1757,25 +1757,30 @@ class SparkContext(config: SparkConf) extends Logging {
   def listJars(): Seq[String] = addedJars.keySet.toSeq
 
   /**
-   * Shut down the SparkContext.
+   * When stopping SparkContext inside Spark components, it's easy to cause 
dead-lock since Spark
+   * may wait for some internal threads to finish. It's better to use this 
method to stop
+   * SparkContext instead.
*/
-  def stop(): Unit = {
-if (env.rpcEnv.isInRPCThread) {
-  // `stop` will block until all RPC threads exit, so we cannot call stop 
inside a RPC thread.
-  // We should launch a new thread to call `stop` to avoid dead-lock.
-  new Thread("stop-spark-context") {
-setDaemon(true)
-
-override def run(): Unit = {
-  _stop()
+  private[spark] def stopInNewThread(): Unit = {
+new Thread("stop-spark-context") {
+  setDaemon(true)
+
+  override def run(): Unit = {
+try {
+  SparkContext.this.stop()
+} catch {
+  case e: Throwable =>
+logError(e.getMessage, e)
+throw e
 }
-  }.start()
-} else {
-  _stop()
-}
+  }
+}.start()
   }
 
-  private def _stop() {
+  /**
+   * Shut down the SparkContext.
+   */
+  def stop(): Unit = {
 if (LiveListenerBus.withinListenerThread.value) {
   throw new SparkException(
 s"Cannot stop SparkContext within listener thread of 
${LiveListenerBus.name}")

http://git-wip-us.apache.org/repos/asf/spark/blob/a0356441/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala 
b/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala
index bbc4163..530743c 100644
--- a/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala
+++ b/core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala
@@ -146,11 +146,6 @@ private[spark] abstract class RpcEnv(conf: SparkConf) {
* @param uri URI with location of the file.
*/
   def openChannel(uri: String): ReadableByteChannel
-
-  /**
-   * Return if the current thread is a RPC thread.
-   */
-  def isInRPCThread: Boolean
 }
 
 /**

http://git-wip-us.apache.org/repos/asf/spark/blob/a0356441/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala

spark git commit: [SPARK-18590][SPARKR] build R source package when making distribution

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 e0173f14e -> d69df9073


[SPARK-18590][SPARKR] build R source package when making distribution

This PR has 2 key changes. One, we are building source package (aka bundle 
package) for SparkR which could be released on CRAN. Two, we should include in 
the official Spark binary distributions SparkR installed from this source 
package instead (which would have help/vignettes rds needed for those to work 
when the SparkR package is loaded in R, whereas earlier approach with devtools 
does not)

But, because of various differences in how R performs different tasks, this PR 
is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

These are the additional steps in make-distribution; please see 
[here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's 
going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is 
`library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and 
captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes 
again (this time from source packaged) - this is a key step required to release 
R package on CRAN
 (will skip tests here but tests will need to pass for CRAN release process to 
success - ideally, during release signoff we should install from the R source 
package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate 
doc/vignettes rds files correctly, not in step # 1)
 (the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
   R CMD build should already be installing the package in a temp directory 
though it might just be finding this location and set it to lib.loc parameter; 
another approach is perhaps we could try calling `R CMD INSTALL --build pkg` 
instead.
 But in any case, despite installing the package multiple times this is 
relatively fast.
Building vignettes takes a while though.

Manually, CI.

Author: Felix Cheung 

Closes #16014 from felixcheung/rdist.

(cherry picked from commit c3d3a9d0e85b834abef87069e4edd27db87fc607)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d69df907
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d69df907
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d69df907

Branch: refs/heads/branch-2.1
Commit: d69df9073274f7ab3a3598bb182a3233fd7775cd
Parents: e0173f1
Author: Felix Cheung 
Authored: Thu Dec 8 11:29:31 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 11:31:24 2016 -0800

--
 R/CRAN_RELEASE.md   |  2 +-
 R/check-cran.sh | 19 ++-
 R/install-dev.sh|  2 +-
 R/pkg/.Rbuildignore |  3 +++
 R/pkg/DESCRIPTION   | 13 ++---
 R/pkg/NAMESPACE |  2 +-
 dev/create-release/release-build.sh | 27 +++
 dev/make-distribution.sh| 25 +
 8 files changed, 74 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d69df907/R/CRAN_RELEASE.md
--
diff --git a/R/CRAN_RELEASE.md b/R/CRAN_RELEASE.md
index bea8f9f..d6084c7 100644
--- a/R/CRAN_RELEASE.md
+++ b/R/CRAN_RELEASE.md
@@ -7,7 +7,7 @@ To release SparkR as a package to CRAN, we would use the 
`devtools` package. Ple
 
 First, check that the `Version:` field in the `pkg/DESCRIPTION` file is 
updated. Also, check for stale files not under source control.
 
-Note that while `check-cran.sh` is running `R CMD check`, it is doing so with 
`--no-manual --no-vignettes`, which skips a few vignettes or PDF checks - 
therefore it will be preferred to run `R CMD check` on the source package built 
manually before uploading a release.
+Note that while `run-tests.sh` runs `check-cran.sh` (which runs `R CMD 
check`), it is doing so with `--no-manual --no-vignettes`, which skips a few 
vignettes or PDF checks - therefore it will be preferred to run `R CMD check` 
on the source package built manually before uploading a release. Also note that 
for CRAN checks for pdf vignettes to success, `qpdf` tool must be there (to 
install it, eg. `yum -q -y install qpdf`).
 
 To upload a release, we would need to update the `cran-comments.md`. This 
should generally contain the results from running the `check-cran.sh` script 
along with comments on status of all `WARNING` (should not be 

spark git commit: [SPARK-18590][SPARKR] build R source package when making distribution

2016-12-08 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master 3c68944b2 -> c3d3a9d0e


[SPARK-18590][SPARKR] build R source package when making distribution

## What changes were proposed in this pull request?

This PR has 2 key changes. One, we are building source package (aka bundle 
package) for SparkR which could be released on CRAN. Two, we should include in 
the official Spark binary distributions SparkR installed from this source 
package instead (which would have help/vignettes rds needed for those to work 
when the SparkR package is loaded in R, whereas earlier approach with devtools 
does not)

But, because of various differences in how R performs different tasks, this PR 
is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

### more details

These are the additional steps in make-distribution; please see 
[here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's 
going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is 
`library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and 
captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes 
again (this time from source packaged) - this is a key step required to release 
R package on CRAN
 (will skip tests here but tests will need to pass for CRAN release process to 
success - ideally, during release signoff we should install from the R source 
package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate 
doc/vignettes rds files correctly, not in step # 1)
 (the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
   R CMD build should already be installing the package in a temp directory 
though it might just be finding this location and set it to lib.loc parameter; 
another approach is perhaps we could try calling `R CMD INSTALL --build pkg` 
instead.
 But in any case, despite installing the package multiple times this is 
relatively fast.
Building vignettes takes a while though.

## How was this patch tested?

Manually, CI.

Author: Felix Cheung 

Closes #16014 from felixcheung/rdist.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c3d3a9d0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c3d3a9d0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c3d3a9d0

Branch: refs/heads/master
Commit: c3d3a9d0e85b834abef87069e4edd27db87fc607
Parents: 3c68944
Author: Felix Cheung 
Authored: Thu Dec 8 11:29:31 2016 -0800
Committer: Shivaram Venkataraman 
Committed: Thu Dec 8 11:29:31 2016 -0800

--
 R/CRAN_RELEASE.md   |  2 +-
 R/check-cran.sh | 19 ++-
 R/install-dev.sh|  2 +-
 R/pkg/.Rbuildignore |  3 +++
 R/pkg/DESCRIPTION   | 13 ++---
 R/pkg/NAMESPACE |  2 +-
 dev/create-release/release-build.sh | 27 +++
 dev/make-distribution.sh| 25 +
 8 files changed, 74 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c3d3a9d0/R/CRAN_RELEASE.md
--
diff --git a/R/CRAN_RELEASE.md b/R/CRAN_RELEASE.md
index bea8f9f..d6084c7 100644
--- a/R/CRAN_RELEASE.md
+++ b/R/CRAN_RELEASE.md
@@ -7,7 +7,7 @@ To release SparkR as a package to CRAN, we would use the 
`devtools` package. Ple
 
 First, check that the `Version:` field in the `pkg/DESCRIPTION` file is 
updated. Also, check for stale files not under source control.
 
-Note that while `check-cran.sh` is running `R CMD check`, it is doing so with 
`--no-manual --no-vignettes`, which skips a few vignettes or PDF checks - 
therefore it will be preferred to run `R CMD check` on the source package built 
manually before uploading a release.
+Note that while `run-tests.sh` runs `check-cran.sh` (which runs `R CMD 
check`), it is doing so with `--no-manual --no-vignettes`, which skips a few 
vignettes or PDF checks - therefore it will be preferred to run `R CMD check` 
on the source package built manually before uploading a release. Also note that 
for CRAN checks for pdf vignettes to success, `qpdf` tool must be there (to 
install it, eg. `yum -q -y install qpdf`).
 
 To upload a release, we would need to update the `cran-comments.md`. This 
should generally contain the results from running the `check-cran.sh` script 
along with comments on status of all `WARNING` (should not be any) or `NOTE`. 
As a part of 

spark git commit: [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records

2016-12-08 Thread davies
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 726217eb7 -> e0173f14e


[SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to 
batching that showed up in repeated cartesian products with seemingly random 
results. The root cause being multiple iterators pulling from the same stream 
in the wrong order because of logic that ignored batching.

`CartesianDeserializer` and `PairDeserializer` were changed to implement 
`_load_stream_without_unbatching` and borrow the one line implementation of 
`load_stream` from `BatchedSerializer`. The default implementation of 
`_load_stream_without_unbatching` was changed to give consistent results 
(always an iterable) so that it could be used without additional checks.

`PairDeserializer` no longer extends `CartesianDeserializer` as it was not 
really proper. If wanted a new common super class could be added.

Both `CartesianDeserializer` and `PairDeserializer` now only extend 
`Serializer` (which has no `dump_stream` implementation) since they are only 
meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from #14248) plus one for testing a cartesian 
with zip.

Author: Andrew Ray 

Closes #16121 from aray/fix-cartesian.

(cherry picked from commit 3c68944b229aaaeeaee3efcbae3e3be9a2914855)
Signed-off-by: Davies Liu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e0173f14
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e0173f14
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e0173f14

Branch: refs/heads/branch-2.1
Commit: e0173f14e3ea28d83c1c46bf97f7d3755960a8fc
Parents: 726217e
Author: Andrew Ray 
Authored: Thu Dec 8 11:08:12 2016 -0800
Committer: Davies Liu 
Committed: Thu Dec 8 11:08:27 2016 -0800

--
 python/pyspark/serializers.py | 58 +++---
 python/pyspark/tests.py   | 18 
 2 files changed, 53 insertions(+), 23 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e0173f14/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index 2a13269..c4f2f08 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -61,7 +61,7 @@ import itertools
 if sys.version < '3':
 import cPickle as pickle
 protocol = 2
-from itertools import izip as zip
+from itertools import izip as zip, imap as map
 else:
 import pickle
 protocol = 3
@@ -96,7 +96,12 @@ class Serializer(object):
 raise NotImplementedError
 
 def _load_stream_without_unbatching(self, stream):
-return self.load_stream(stream)
+"""
+Return an iterator of deserialized batches (lists) of objects from the 
input stream.
+if the serializer does not operate on batches the default 
implementation returns an
+iterator of single element lists.
+"""
+return map(lambda x: [x], self.load_stream(stream))
 
 # Note: our notion of "equality" is that output generated by
 # equal serializers can be deserialized using the same serializer.
@@ -278,50 +283,57 @@ class AutoBatchedSerializer(BatchedSerializer):
 return "AutoBatchedSerializer(%s)" % self.serializer
 
 
-class CartesianDeserializer(FramedSerializer):
+class CartesianDeserializer(Serializer):
 
 """
 Deserializes the JavaRDD cartesian() of two PythonRDDs.
+Due to pyspark batching we cannot simply use the result of the Java RDD 
cartesian,
+we additionally need to do the cartesian within each pair of batches.
 """
 
 def __init__(self, key_ser, val_ser):
-FramedSerializer.__init__(self)
 self.key_ser = key_ser
 self.val_ser = val_ser
 
-def prepare_keys_values(self, stream):
-key_stream = self.key_ser._load_stream_without_unbatching(stream)
-val_stream = self.val_ser._load_stream_without_unbatching(stream)
-key_is_batched = isinstance(self.key_ser, BatchedSerializer)
-val_is_batched = isinstance(self.val_ser, BatchedSerializer)
-for (keys, vals) in zip(key_stream, val_stream):
-keys = keys if key_is_batched else [keys]
-vals = vals if val_is_batched else [vals]
-yield (keys, vals)
+def _load_stream_without_unbatching(self, stream):
+key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
+val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
+for (key_batch, val_batch) in 

spark git commit: [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records

2016-12-08 Thread davies
Repository: spark
Updated Branches:
  refs/heads/master ed8869ebb -> 3c68944b2


[SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to 
batching that showed up in repeated cartesian products with seemingly random 
results. The root cause being multiple iterators pulling from the same stream 
in the wrong order because of logic that ignored batching.

`CartesianDeserializer` and `PairDeserializer` were changed to implement 
`_load_stream_without_unbatching` and borrow the one line implementation of 
`load_stream` from `BatchedSerializer`. The default implementation of 
`_load_stream_without_unbatching` was changed to give consistent results 
(always an iterable) so that it could be used without additional checks.

`PairDeserializer` no longer extends `CartesianDeserializer` as it was not 
really proper. If wanted a new common super class could be added.

Both `CartesianDeserializer` and `PairDeserializer` now only extend 
`Serializer` (which has no `dump_stream` implementation) since they are only 
meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from #14248) plus one for testing a cartesian 
with zip.

Author: Andrew Ray 

Closes #16121 from aray/fix-cartesian.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3c68944b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3c68944b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3c68944b

Branch: refs/heads/master
Commit: 3c68944b229aaaeeaee3efcbae3e3be9a2914855
Parents: ed8869e
Author: Andrew Ray 
Authored: Thu Dec 8 11:08:12 2016 -0800
Committer: Davies Liu 
Committed: Thu Dec 8 11:08:12 2016 -0800

--
 python/pyspark/serializers.py | 58 +++---
 python/pyspark/tests.py   | 18 
 2 files changed, 53 insertions(+), 23 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3c68944b/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index 2a13269..c4f2f08 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -61,7 +61,7 @@ import itertools
 if sys.version < '3':
 import cPickle as pickle
 protocol = 2
-from itertools import izip as zip
+from itertools import izip as zip, imap as map
 else:
 import pickle
 protocol = 3
@@ -96,7 +96,12 @@ class Serializer(object):
 raise NotImplementedError
 
 def _load_stream_without_unbatching(self, stream):
-return self.load_stream(stream)
+"""
+Return an iterator of deserialized batches (lists) of objects from the 
input stream.
+if the serializer does not operate on batches the default 
implementation returns an
+iterator of single element lists.
+"""
+return map(lambda x: [x], self.load_stream(stream))
 
 # Note: our notion of "equality" is that output generated by
 # equal serializers can be deserialized using the same serializer.
@@ -278,50 +283,57 @@ class AutoBatchedSerializer(BatchedSerializer):
 return "AutoBatchedSerializer(%s)" % self.serializer
 
 
-class CartesianDeserializer(FramedSerializer):
+class CartesianDeserializer(Serializer):
 
 """
 Deserializes the JavaRDD cartesian() of two PythonRDDs.
+Due to pyspark batching we cannot simply use the result of the Java RDD 
cartesian,
+we additionally need to do the cartesian within each pair of batches.
 """
 
 def __init__(self, key_ser, val_ser):
-FramedSerializer.__init__(self)
 self.key_ser = key_ser
 self.val_ser = val_ser
 
-def prepare_keys_values(self, stream):
-key_stream = self.key_ser._load_stream_without_unbatching(stream)
-val_stream = self.val_ser._load_stream_without_unbatching(stream)
-key_is_batched = isinstance(self.key_ser, BatchedSerializer)
-val_is_batched = isinstance(self.val_ser, BatchedSerializer)
-for (keys, vals) in zip(key_stream, val_stream):
-keys = keys if key_is_batched else [keys]
-vals = vals if val_is_batched else [vals]
-yield (keys, vals)
+def _load_stream_without_unbatching(self, stream):
+key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
+val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
+for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
+# for correctness with repeated cartesian/zip this must be 
returned as 

spark git commit: [SPARK-8617][WEBUI] HistoryServer: Include in-progress files during cleanup

2016-12-08 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/master b44d1b8fc -> ed8869ebb


[SPARK-8617][WEBUI] HistoryServer: Include in-progress files during cleanup

## What changes were proposed in this pull request?
- Removed the`attempt.completed ` filter so cleaner would include the orphan 
inprogress files.
- Use loading time for inprogress files as lastUpdated. Keep using the modTime 
for completed files. First one will prevent deletion of inprogress job files. 
Second one will ensure that lastUpdated time won't change for completed jobs in 
an event of HistoryServer reboot.

## How was this patch tested?
Added new unittests and via existing tests.

Author: Ergin Seyfe 

Closes #16165 from seyfe/clear_old_inprogress_files.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ed8869eb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ed8869eb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ed8869eb

Branch: refs/heads/master
Commit: ed8869ebbf39783b16daba2e2498a2bc1889306f
Parents: b44d1b8
Author: Ergin Seyfe 
Authored: Thu Dec 8 10:21:09 2016 -0800
Committer: Marcelo Vanzin 
Committed: Thu Dec 8 10:21:09 2016 -0800

--
 .../deploy/history/FsHistoryProvider.scala  | 10 ++--
 .../deploy/history/FsHistoryProviderSuite.scala | 50 +++-
 2 files changed, 55 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ed8869eb/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
index 8ef69b1..3011ed0 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
@@ -446,9 +446,13 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
   }
 
   val logPath = fileStatus.getPath()
-
   val appCompleted = isApplicationCompleted(fileStatus)
 
+  // Use loading time as lastUpdated since some filesystems don't update 
modifiedTime
+  // each time file is updated. However use modifiedTime for completed 
jobs so lastUpdated
+  // won't change whenever HistoryServer restarts and reloads the file.
+  val lastUpdated = if (appCompleted) fileStatus.getModificationTime else 
clock.getTimeMillis()
+
   val appListener = replay(fileStatus, appCompleted, new 
ReplayListenerBus(), eventsFilter)
 
   // Without an app ID, new logs will render incorrectly in the listing 
page, so do not list or
@@ -461,7 +465,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
   appListener.appAttemptId,
   appListener.startTime.getOrElse(-1L),
   appListener.endTime.getOrElse(-1L),
-  fileStatus.getModificationTime(),
+  lastUpdated,
   appListener.sparkUser.getOrElse(NOT_STARTED),
   appCompleted,
   fileStatus.getLen()
@@ -546,7 +550,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
   val appsToRetain = new mutable.LinkedHashMap[String, 
FsApplicationHistoryInfo]()
 
   def shouldClean(attempt: FsApplicationAttemptInfo): Boolean = {
-now - attempt.lastUpdated > maxAge && attempt.completed
+now - attempt.lastUpdated > maxAge
   }
 
   // Scan all logs from the log directory.

http://git-wip-us.apache.org/repos/asf/spark/blob/ed8869eb/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 
b/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
index 2c41c43..027f412 100644
--- 
a/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
@@ -66,7 +66,8 @@ class FsHistoryProviderSuite extends SparkFunSuite with 
BeforeAndAfter with Matc
   }
 
   test("Parse application logs") {
-val provider = new FsHistoryProvider(createTestConf())
+val clock = new ManualClock(12345678)
+val provider = new FsHistoryProvider(createTestConf(), clock)
 
 // Write a new-style application log.
 val newAppComplete = newLogFile("new1", None, inProgress = false)
@@ -109,12 +110,15 @@ class FsHistoryProviderSuite extends SparkFunSuite with 
BeforeAndAfter with Matc
   List(ApplicationAttemptInfo(None, start, end, lastMod, user, 
completed)))
  

spark git commit: [SPARK-18662][HOTFIX] Add new resource-managers directories to SparkLauncher.

2016-12-08 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/master 6a5a7254d -> b44d1b8fc


[SPARK-18662][HOTFIX] Add new resource-managers directories to SparkLauncher.

These directories are added to the classpath of applications when testing or
using SPARK_PREPEND_CLASSES, otherwise updated classes are not seen. Also,
add the mesos directory which was missing.

Author: Marcelo Vanzin 

Closes #16202 from vanzin/SPARK-18662.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b44d1b8f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b44d1b8f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b44d1b8f

Branch: refs/heads/master
Commit: b44d1b8fcf00b238df434cf70ad09460b27adf07
Parents: 6a5a725
Author: Marcelo Vanzin 
Authored: Thu Dec 8 09:48:33 2016 -0800
Committer: Marcelo Vanzin 
Committed: Thu Dec 8 09:48:33 2016 -0800

--
 .../java/org/apache/spark/launcher/AbstractCommandBuilder.java  | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b44d1b8f/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
--
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java 
b/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
index c748808..ba43659 100644
--- 
a/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
+++ 
b/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
@@ -158,12 +158,13 @@ abstract class AbstractCommandBuilder {
 "launcher",
 "mllib",
 "repl",
+"resource-managers/mesos",
+"resource-managers/yarn",
 "sql/catalyst",
 "sql/core",
 "sql/hive",
 "sql/hive-thriftserver",
-"streaming",
-"yarn"
+"streaming"
   );
   if (prependClasses) {
 if (!isTesting) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark

2016-12-08 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 9095c152e -> 726217eb7


[SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec 
so input_file_name function can work with UDF in pyspark

## What changes were proposed in this pull request?

`input_file_name` doesn't return filename when working with UDF in PySpark. An 
example shows the problem:

from pyspark.sql.functions import *
from pyspark.sql.types import *

def filename(path):
return path

sourceFile = udf(filename, StringType())
spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()

+---+
|filename(input_file_name())|
+---+
|   |
+---+

The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching 
processing of PythonUDF. Currently we group rows first and then evaluate 
expressions on the rows. If the data is less than the required number of rows 
for a group, the iterator will be consumed to the end before the evaluation. 
However, once the iterator reaches the end, we will unset input filename. So 
the input_file_name expression can't return correct filename.

This patch fixes the approach to group the batch of rows. We evaluate the 
expression first and then group evaluated results to batch.

## How was this patch tested?

Added unit test to PySpark.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Liang-Chi Hsieh 

Closes #16115 from viirya/fix-py-udf-input-filename.

(cherry picked from commit 6a5a7254dc37952505989e9e580a14543adb730c)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/726217eb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/726217eb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/726217eb

Branch: refs/heads/branch-2.1
Commit: 726217eb7f783e10571a043546694b5b3c90ac77
Parents: 9095c15
Author: Liang-Chi Hsieh 
Authored: Thu Dec 8 23:22:18 2016 +0800
Committer: Wenchen Fan 
Committed: Thu Dec 8 23:22:40 2016 +0800

--
 python/pyspark/sql/tests.py |  8 +
 .../execution/python/BatchEvalPythonExec.scala  | 35 +---
 2 files changed, 24 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/726217eb/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 50df68b..66320bd 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -412,6 +412,14 @@ class SQLTests(ReusedPySparkTestCase):
 res.explain(True)
 self.assertEqual(res.collect(), [Row(id=0, copy=0)])
 
+def test_udf_with_input_file_name(self):
+from pyspark.sql.functions import udf, input_file_name
+from pyspark.sql.types import StringType
+sourceFile = udf(lambda path: path, StringType())
+filePath = "python/test_support/sql/people1.json"
+row = 
self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
+self.assertTrue(row[0].find("people1.json") != -1)
+
 def test_basic_functions(self):
 rdd = self.sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])
 df = self.spark.read.json(rdd)

http://git-wip-us.apache.org/repos/asf/spark/blob/726217eb/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
index dcaf2c7..7a5ac48 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
@@ -119,26 +119,23 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], 
output: Seq[Attribute], chi
   val pickle = new Pickler(needConversion)
   // Input iterator to Python: input rows are grouped so we send them in 
batches to Python.
   // For each row, add it to the queue.
-  val inputIterator = iter.grouped(100).map { inputRows =>
-val toBePickled = inputRows.map { inputRow =>
-  queue.add(inputRow.asInstanceOf[UnsafeRow])
-  val row = projection(inputRow)
-  if (needConversion) {
-EvaluatePython.toJava(row, schema)
-  } else {
-// fast path for these types that does not need conversion in 

spark git commit: [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark

2016-12-08 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 7f3c778fd -> 6a5a7254d


[SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec 
so input_file_name function can work with UDF in pyspark

## What changes were proposed in this pull request?

`input_file_name` doesn't return filename when working with UDF in PySpark. An 
example shows the problem:

from pyspark.sql.functions import *
from pyspark.sql.types import *

def filename(path):
return path

sourceFile = udf(filename, StringType())
spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()

+---+
|filename(input_file_name())|
+---+
|   |
+---+

The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching 
processing of PythonUDF. Currently we group rows first and then evaluate 
expressions on the rows. If the data is less than the required number of rows 
for a group, the iterator will be consumed to the end before the evaluation. 
However, once the iterator reaches the end, we will unset input filename. So 
the input_file_name expression can't return correct filename.

This patch fixes the approach to group the batch of rows. We evaluate the 
expression first and then group evaluated results to batch.

## How was this patch tested?

Added unit test to PySpark.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Liang-Chi Hsieh 

Closes #16115 from viirya/fix-py-udf-input-filename.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a5a7254
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a5a7254
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a5a7254

Branch: refs/heads/master
Commit: 6a5a7254dc37952505989e9e580a14543adb730c
Parents: 7f3c778
Author: Liang-Chi Hsieh 
Authored: Thu Dec 8 23:22:18 2016 +0800
Committer: Wenchen Fan 
Committed: Thu Dec 8 23:22:18 2016 +0800

--
 python/pyspark/sql/tests.py |  8 +
 .../execution/python/BatchEvalPythonExec.scala  | 35 +---
 2 files changed, 24 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6a5a7254/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 50df68b..66320bd 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -412,6 +412,14 @@ class SQLTests(ReusedPySparkTestCase):
 res.explain(True)
 self.assertEqual(res.collect(), [Row(id=0, copy=0)])
 
+def test_udf_with_input_file_name(self):
+from pyspark.sql.functions import udf, input_file_name
+from pyspark.sql.types import StringType
+sourceFile = udf(lambda path: path, StringType())
+filePath = "python/test_support/sql/people1.json"
+row = 
self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
+self.assertTrue(row[0].find("people1.json") != -1)
+
 def test_basic_functions(self):
 rdd = self.sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])
 df = self.spark.read.json(rdd)

http://git-wip-us.apache.org/repos/asf/spark/blob/6a5a7254/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
index dcaf2c7..7a5ac48 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
@@ -119,26 +119,23 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], 
output: Seq[Attribute], chi
   val pickle = new Pickler(needConversion)
   // Input iterator to Python: input rows are grouped so we send them in 
batches to Python.
   // For each row, add it to the queue.
-  val inputIterator = iter.grouped(100).map { inputRows =>
-val toBePickled = inputRows.map { inputRow =>
-  queue.add(inputRow.asInstanceOf[UnsafeRow])
-  val row = projection(inputRow)
-  if (needConversion) {
-EvaluatePython.toJava(row, schema)
-  } else {
-// fast path for these types that does not need conversion in 
Python
-val fields = new Array[Any](row.numFields)
-var i = 0
-while (i < row.numFields) {
-   

spark git commit: [SPARK-18718][TESTS] Skip some test failures due to path length limitation and fix tests to pass on Windows

2016-12-08 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 9bf8f3cd4 -> 7f3c778fd


[SPARK-18718][TESTS] Skip some test failures due to path length limitation and 
fix tests to pass on Windows

## What changes were proposed in this pull request?

There are some tests failed on Windows due to the wrong format of path and the 
limitation of path length as below:

This PR proposes both to fix the failed tests by fixing the path for the tests 
below:

- `InsertSuite`
  ```
  Exception encountered when attempting to run a suite with class name: 
org.apache.spark.sql.sources.InsertSuite *** ABORTED *** (12 seconds, 547 
milliseconds)
  org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/C:projectssparkarget   
mpspark-177945ef-9128-42b4-8c07-de31f78bbbd6;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  ```

- `PathOptionSuite`
  ```
  - path option also exist for write path *** FAILED *** (1 second, 93 
milliseconds)
"C:[projectsspark   arget   mp]spark-5ab34a58-df8d-..." did not equal 
"C:[\projects\spark\target\tmp\]spark-5ab34a58-df8d-..." 
(PathOptionSuite.scala:93)
org.scalatest.exceptions.TestFailedException:
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
...
  ```

- `UDFSuite`
  ```
  - SPARK-8005 input_file_name *** FAILED *** (2 seconds, 234 milliseconds)

"file:///C:/projects/spark/target/tmp/spark-e4e5720a-2006-48f9-8b11-797bf59794bf/part-1-26fb05e4-603d-471d-ae9d-b9549e0c7765.snappy.parquet"
 did not contain 
"C:\projects\spark\target\tmp\spark-e4e5720a-2006-48f9-8b11-797bf59794bf" 
(UDFSuite.scala:67)
org.scalatest.exceptions.TestFailedException:
  at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
...
  ```

and to skip the tests belows which are being failed on Windows due to path 
length limitation.

- `SparkLauncherSuite`
  ```
  Test org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher 
failed: java.lang.AssertionError: expected:<0> but was:<1>, took 0.062 sec
at 
org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177)
  ...
  ```

  The stderr from the process is `The filename or extension is too long` which 
is equivalent to the one below.

- `BroadcastJoinSuite`
  ```
  04:09:40.882 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error 
running executor
  java.io.IOException: Cannot run program "C:\Progra~1\Java\jdk1.8.0\bin\java" 
(in directory "C:\projects\spark\work\app-20161205040542-\51658"): 
CreateProcess error=206, The filename or extension is too long
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
  at 
org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
  at 
org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
  Caused by: java.io.IOException: CreateProcess error=206, The filename or 
extension is too long
  at java.lang.ProcessImpl.create(Native Method)
  at java.lang.ProcessImpl.(ProcessImpl.java:386)
  at java.lang.ProcessImpl.start(ProcessImpl.java:137)
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
  ... 2 more
  04:09:40.929 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error 
running executor

(appearently infinite same error messages)

  ...
  ```

## How was this patch tested?

Manually tested via AppVeyor.

**Before**

`InsertSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/148-InsertSuite-pr
`PathOptionSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/139-PathOptionSuite-pr
`UDFSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/143-UDFSuite-pr
`SparkLauncherSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/141-SparkLauncherSuite-pr
`BroadcastJoinSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/145-BroadcastJoinSuite-pr

**After**

`PathOptionSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/140-PathOptionSuite-pr
`SparkLauncherSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/142-SparkLauncherSuite-pr
`UDFSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/144-UDFSuite-pr
`InsertSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/147-InsertSuite-pr
`BroadcastJoinSuite`: 
https://ci.appveyor.com/project/spark-test/spark/build/149-BroadcastJoinSuite-pr

Author: hyukjinkwon 

Closes #16147 from HyukjinKwon/fix-tests.



spark git commit: [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide

2016-12-08 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 48aa6775d -> 9095c152e


[SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide

## What changes were proposed in this pull request?
* Add all R examples for ML wrappers which were added during 2.1 release cycle.
* Split the whole ```ml.R``` example file into individual example for each 
algorithm, which will be convenient for users to rerun them.
* Add corresponding examples to ML user guide.
* Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR 
examples may different from them, since R users may use the algorithms in a 
different way, for example, using R ```formula``` to specify ```featuresCol``` 
and ```labelCol```.

## How was this patch tested?
Run all examples manually.

Author: Yanbo Liang 

Closes #16148 from yanboliang/spark-18325.

(cherry picked from commit 9bf8f3cd4f62f921c32fb50b8abf49576a80874f)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9095c152
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9095c152
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9095c152

Branch: refs/heads/branch-2.1
Commit: 9095c152e7fedf469dcc4887f5b6a1882cd74c28
Parents: 48aa677
Author: Yanbo Liang 
Authored: Thu Dec 8 06:19:38 2016 -0800
Committer: Yanbo Liang 
Committed: Thu Dec 8 06:20:28 2016 -0800

--
 docs/ml-classification-regression.md |  67 +++-
 docs/ml-clustering.md|  18 +++-
 docs/ml-collaborative-filtering.md   |   8 ++
 docs/sparkr.md   |  46 
 examples/src/main/r/ml.R | 148 --
 examples/src/main/r/ml/als.R |  45 
 examples/src/main/r/ml/gaussianMixture.R |  42 
 examples/src/main/r/ml/gbt.R |  63 +++
 examples/src/main/r/ml/glm.R |  57 ++
 examples/src/main/r/ml/isoreg.R  |  42 
 examples/src/main/r/ml/kmeans.R  |  44 
 examples/src/main/r/ml/kstest.R  |  39 +++
 examples/src/main/r/ml/lda.R |  46 
 examples/src/main/r/ml/logit.R   |  63 +++
 examples/src/main/r/ml/ml.R  |  65 +++
 examples/src/main/r/ml/mlp.R |  48 +
 examples/src/main/r/ml/naiveBayes.R  |  41 +++
 examples/src/main/r/ml/randomForest.R|  63 +++
 examples/src/main/r/ml/survreg.R |  43 
 19 files changed, 810 insertions(+), 178 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9095c152/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index 557a53c..2ffea64 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -75,6 +75,13 @@ More details on parameters can be found in the [Python API 
documentation](api/py
 {% include_example python/ml/logistic_regression_with_elastic_net.py %}
 
 
+
+
+More details on parameters can be found in the [R API 
documentation](api/R/spark.logit.html).
+
+{% include_example binomial r/ml/logit.R %}
+
+
 
 
 The `spark.ml` implementation of logistic regression also supports
@@ -171,6 +178,13 @@ model with elastic net regularization.
 {% include_example 
python/ml/multiclass_logistic_regression_with_elastic_net.py %}
 
 
+
+
+More details on parameters can be found in the [R API 
documentation](api/R/spark.logit.html).
+
+{% include_example multinomial r/ml/logit.R %}
+
+
 
 
 
@@ -242,6 +256,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/random_forest_classifier_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.randomForest.html) for more details.
+
+{% include_example classification r/ml/randomForest.R %}
+
+
 
 
 ## Gradient-boosted tree classifier
@@ -275,6 +297,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/gradient_boosted_tree_classifier_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.gbt.html) for more details.
+
+{% include_example classification r/ml/gbt.R %}
+
+
 
 
 ## Multilayer perceptron classifier
@@ -324,6 +354,13 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 {% include_example python/ml/multilayer_perceptron_classification.py %}
 
 
+
+
+Refer to the [R API docs](api/R/spark.mlp.html) for more details.
+
+{% include_example r/ml/mlp.R %}
+
+
 
 
 
@@ -400,7 +437,7 @@ Refer to the 

spark git commit: [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide

2016-12-08 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/master b47b892e4 -> 9bf8f3cd4


[SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide

## What changes were proposed in this pull request?
* Add all R examples for ML wrappers which were added during 2.1 release cycle.
* Split the whole ```ml.R``` example file into individual example for each 
algorithm, which will be convenient for users to rerun them.
* Add corresponding examples to ML user guide.
* Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR 
examples may different from them, since R users may use the algorithms in a 
different way, for example, using R ```formula``` to specify ```featuresCol``` 
and ```labelCol```.

## How was this patch tested?
Run all examples manually.

Author: Yanbo Liang 

Closes #16148 from yanboliang/spark-18325.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9bf8f3cd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9bf8f3cd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9bf8f3cd

Branch: refs/heads/master
Commit: 9bf8f3cd4f62f921c32fb50b8abf49576a80874f
Parents: b47b892
Author: Yanbo Liang 
Authored: Thu Dec 8 06:19:38 2016 -0800
Committer: Yanbo Liang 
Committed: Thu Dec 8 06:19:38 2016 -0800

--
 docs/ml-classification-regression.md |  67 +++-
 docs/ml-clustering.md|  18 +++-
 docs/ml-collaborative-filtering.md   |   8 ++
 docs/sparkr.md   |  46 
 examples/src/main/r/ml.R | 148 --
 examples/src/main/r/ml/als.R |  45 
 examples/src/main/r/ml/gaussianMixture.R |  42 
 examples/src/main/r/ml/gbt.R |  63 +++
 examples/src/main/r/ml/glm.R |  57 ++
 examples/src/main/r/ml/isoreg.R  |  42 
 examples/src/main/r/ml/kmeans.R  |  44 
 examples/src/main/r/ml/kstest.R  |  39 +++
 examples/src/main/r/ml/lda.R |  46 
 examples/src/main/r/ml/logit.R   |  63 +++
 examples/src/main/r/ml/ml.R  |  65 +++
 examples/src/main/r/ml/mlp.R |  48 +
 examples/src/main/r/ml/naiveBayes.R  |  41 +++
 examples/src/main/r/ml/randomForest.R|  63 +++
 examples/src/main/r/ml/survreg.R |  43 
 19 files changed, 810 insertions(+), 178 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9bf8f3cd/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index bb9390f..782ee58 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -75,6 +75,13 @@ More details on parameters can be found in the [Python API 
documentation](api/py
 {% include_example python/ml/logistic_regression_with_elastic_net.py %}
 
 
+
+
+More details on parameters can be found in the [R API 
documentation](api/R/spark.logit.html).
+
+{% include_example binomial r/ml/logit.R %}
+
+
 
 
 The `spark.ml` implementation of logistic regression also supports
@@ -171,6 +178,13 @@ model with elastic net regularization.
 {% include_example 
python/ml/multiclass_logistic_regression_with_elastic_net.py %}
 
 
+
+
+More details on parameters can be found in the [R API 
documentation](api/R/spark.logit.html).
+
+{% include_example multinomial r/ml/logit.R %}
+
+
 
 
 
@@ -242,6 +256,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/random_forest_classifier_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.randomForest.html) for more details.
+
+{% include_example classification r/ml/randomForest.R %}
+
+
 
 
 ## Gradient-boosted tree classifier
@@ -275,6 +297,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/gradient_boosted_tree_classifier_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.gbt.html) for more details.
+
+{% include_example classification r/ml/gbt.R %}
+
+
 
 
 ## Multilayer perceptron classifier
@@ -324,6 +354,13 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 {% include_example python/ml/multilayer_perceptron_classification.py %}
 
 
+
+
+Refer to the [R API docs](api/R/spark.mlp.html) for more details.
+
+{% include_example r/ml/mlp.R %}
+
+
 
 
 
@@ -400,7 +437,7 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 Refer to the [R API docs](api/R/spark.naiveBayes.html) for