date:20180510

svn commit: r26841 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_20_01-75cf369-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-10 Thread pwendell

Author: pwendell
Date: Fri May 11 03:15:44 2018
New Revision: 26841

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_05_10_20_01-75cf369 docs


[This commit notification would consist of 1462 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r26840 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_10_18_01-414e4e3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-10 Thread pwendell

Author: pwendell
Date: Fri May 11 01:15:34 2018
New Revision: 26840

Log:
Apache Spark 2.3.1-SNAPSHOT-2018_05_10_18_01-414e4e3 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24197][SPARKR][SQL] Adding array_sort function to SparkR

2018-05-10 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master a4206d58e -> 75cf369c7


[SPARK-24197][SPARKR][SQL] Adding array_sort function to SparkR

## What changes were proposed in this pull request?

The PR adds array_sort function to SparkR.

## How was this patch tested?

Tests added into R/pkg/tests/fulltests/test_sparkSQL.R

## Example
```
> df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 6L, 5L, 
> NA, 4L
> head(collect(select(df, array_sort(df[[1]]
```
Result:
```
   array_sort(_1)
1 1, 2, 3, NA
2 4, 5, 6, NA, NA
```

Author: Marek Novotny 

Closes #21294 from mn-mikke/SPARK-24197.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/75cf369c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/75cf369c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/75cf369c

Branch: refs/heads/master
Commit: 75cf369c742e7c7b68f384d123447c97be95c9f0
Parents: a4206d5
Author: Marek Novotny 
Authored: Fri May 11 09:05:35 2018 +0800
Committer: hyukjinkwon 
Committed: Fri May 11 09:05:35 2018 +0800

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/functions.R   | 21 ++---
 R/pkg/R/generics.R|  4 
 R/pkg/tests/fulltests/test_sparkSQL.R | 13 +
 4 files changed, 32 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/75cf369c/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 8cd0035..5f82096 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -204,6 +204,7 @@ exportMethods("%<=>%",
   "array_max",
   "array_min",
   "array_position",
+  "array_sort",
   "asc",
   "ascii",
   "asin",

http://git-wip-us.apache.org/repos/asf/spark/blob/75cf369c/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 04d0e46..1f97054 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -207,7 +207,7 @@ NULL
 #' tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp))
 #' head(select(tmp, array_contains(tmp$v1, 21), size(tmp$v1)))
 #' head(select(tmp, array_max(tmp$v1), array_min(tmp$v1)))
-#' head(select(tmp, array_position(tmp$v1, 21)))
+#' head(select(tmp, array_position(tmp$v1, 21), array_sort(tmp$v1)))
 #' head(select(tmp, flatten(tmp$v1)))
 #' tmp2 <- mutate(tmp, v2 = explode(tmp$v1))
 #' head(tmp2)
@@ -3044,6 +3044,20 @@ setMethod("array_position",
   })
 
 #' @details
+#' \code{array_sort}: Sorts the input array in ascending order. The elements 
of the input array
+#' must be orderable. NA elements will be placed at the end of the returned 
array.
+#'
+#' @rdname column_collection_functions
+#' @aliases array_sort array_sort,Column-method
+#' @note array_sort since 2.4.0
+setMethod("array_sort",
+  signature(x = "Column"),
+  function(x) {
+jc <- callJStatic("org.apache.spark.sql.functions", "array_sort", 
x@jc)
+column(jc)
+  })
+
+#' @details
 #' \code{flatten}: Transforms an array of arrays into a single array.
 #'
 #' @rdname column_collection_functions
@@ -3125,8 +3139,9 @@ setMethod("size",
   })
 
 #' @details
-#' \code{sort_array}: Sorts the input array in ascending or descending order 
according
-#' to the natural ordering of the array elements.
+#' \code{sort_array}: Sorts the input array in ascending or descending order 
according to
+#' the natural ordering of the array elements. NA elements will be placed at 
the beginning of
+#' the returned array in ascending order or at the end of the returned array 
in descending order.
 #'
 #' @rdname column_collection_functions
 #' @param asc a logical flag indicating the sorting order.

http://git-wip-us.apache.org/repos/asf/spark/blob/75cf369c/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 4ef12d1..5faa51e 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -769,6 +769,10 @@ setGeneric("array_min", function(x) { 
standardGeneric("array_min") })
 #' @name NULL
 setGeneric("array_position", function(x, value) { 
standardGeneric("array_position") })
 
+#' @rdname column_collection_functions
+#' @name NULL
+setGeneric("array_sort", function(x) { standardGeneric("array_sort") })
+
 #' @rdname column_string_functions
 #' @name NULL
 setGeneric("ascii", function(x) { standardGeneric("ascii") })

http://git-wip-us.apache.org/repos/asf/spark/blob/75cf369c/R/pkg/tests/fulltests/test_sparkSQL.R

spark git commit: [SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is accessed only on the driver

2018-05-10 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master d3c426a5b -> a4206d58e


[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is accessed only on the 
driver

## What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/20136 . #20136 
didn't really work because in the test, we are using local backend, which 
shares the driver side `SparkEnv`, so `SparkEnv.get.executorId == 
SparkContext.DRIVER_IDENTIFIER` doesn't work.

This PR changes the check to `TaskContext.get != null`, and move the check to 
`SQLConf.get`, and fix all the places that violate this check:
* `InMemoryTableScanExec#createAndDecompressColumn` is executed inside 
`rdd.map`, we can't access `conf.offHeapColumnVectorEnabled` there. 
https://github.com/apache/spark/pull/21223 merged
* `DataType#sameType` may be executed in executor side, for things like json 
schema inference, so we can't call `conf.caseSensitiveAnalysis` there. This 
contributes to most of the code changes, as we need to add `caseSensitive` 
parameter to a lot of methods.
* `ParquetFilters` is used in the file scan function, which is executed in 
executor side, so we can't can't call `conf.parquetFilterPushDownDate` there. 
https://github.com/apache/spark/pull/21224 merged
* `WindowExec#createBoundOrdering` is called on executor side, so we can't use 
`conf.sessionLocalTimezone` there. https://github.com/apache/spark/pull/21225 
merged
* `JsonToStructs` can be serialized to executors and evaluate, we should not 
call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)` in the 
body. https://github.com/apache/spark/pull/21226 merged

## How was this patch tested?

existing test

Author: Wenchen Fan 

Closes #21190 from cloud-fan/minor.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a4206d58
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a4206d58
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a4206d58

Branch: refs/heads/master
Commit: a4206d58e05ab9ed6f01fee57e18dee65cbc4efc
Parents: d3c426a
Author: Wenchen Fan 
Authored: Fri May 11 09:01:40 2018 +0800
Committer: hyukjinkwon 
Committed: Fri May 11 09:01:40 2018 +0800

--
 .../sql/catalyst/analysis/CheckAnalysis.scala   |   5 +-
 .../catalyst/analysis/ResolveInlineTables.scala |   4 +-
 .../sql/catalyst/analysis/TypeCoercion.scala| 156 +++
 .../org/apache/spark/sql/internal/SQLConf.scala |  16 +-
 .../org/apache/spark/sql/types/DataType.scala   |   8 +-
 .../catalyst/analysis/TypeCoercionSuite.scala   |  70 -
 .../org/apache/spark/sql/SparkSession.scala |  21 ++-
 .../datasources/PartitioningUtils.scala |   5 +-
 .../datasources/json/JsonInferSchema.scala  |  39 +++--
 .../execution/datasources/json/JsonSuite.scala  |   4 +-
 10 files changed, 188 insertions(+), 140 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a4206d58/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 90bda2a..94b0561 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -24,6 +24,7 @@ import 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
 import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 
 /**
@@ -260,7 +261,9 @@ trait CheckAnalysis extends PredicateHelper {
   // Check if the data types match.
   dataTypes(child).zip(ref).zipWithIndex.foreach { case ((dt1, 
dt2), ci) =>
 // SPARK-18058: we shall not care about the nullability of 
columns
-if (TypeCoercion.findWiderTypeForTwo(dt1.asNullable, 
dt2.asNullable).isEmpty) {
+val widerType = TypeCoercion.findWiderTypeForTwo(
+  dt1.asNullable, dt2.asNullable, 
SQLConf.get.caseSensitiveAnalysis)
+if (widerType.isEmpty) {
   failAnalysis(
 s"""
   |${operator.nodeName} can only be performed on tables 
with the compatible

svn commit: r26838 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_16_01-d3c426a-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-10 Thread pwendell

Author: pwendell
Date: Thu May 10 23:16:23 2018
New Revision: 26838

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_05_10_16_01-d3c426a docs


[This commit notification would consist of 1462 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts at the same time

2018-05-10 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 4c49b12da -> 414e4e3d7


[SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts 
at the same time

## What changes were proposed in this pull request?

When multiple clients attempt to resolve artifacts via the `--packages` 
parameter, they could run into race condition when they each attempt to modify 
the dummy `org.apache.spark-spark-submit-parent-default.xml` file created in 
the default ivy cache dir.
This PR changes the behavior to encode UUID in the dummy module descriptor so 
each client will operate on a different resolution file in the ivy cache dir. 
In addition, this patch changes the behavior of when and which resolution files 
are cleaned to prevent accumulation of resolution files in the default ivy 
cache dir.

Since this PR is a successor of #18801, close #18801. Many codes were ported 
from #18801. **Many efforts were put here. I think this PR should credit to 
Victsm .**

## How was this patch tested?

added UT into `SparkSubmitUtilsSuite`

Author: Kazuaki Ishizaki 

Closes #21251 from kiszk/SPARK-10878.

(cherry picked from commit d3c426a5b02abdec49ff45df12a8f11f9e473a88)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/414e4e3d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/414e4e3d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/414e4e3d

Branch: refs/heads/branch-2.3
Commit: 414e4e3d70caa950a63fab1c8cac3314fb961b0c
Parents: 4c49b12
Author: Kazuaki Ishizaki 
Authored: Thu May 10 14:41:55 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu May 10 14:42:06 2018 -0700

--
 .../org/apache/spark/deploy/SparkSubmit.scala   | 42 +++-
 .../spark/deploy/SparkSubmitUtilsSuite.scala| 15 +++
 2 files changed, 47 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/414e4e3d/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
index deb52a4..d1347f1 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@@ -22,6 +22,7 @@ import java.lang.reflect.{InvocationTargetException, 
Modifier, UndeclaredThrowab
 import java.net.URL
 import java.security.PrivilegedExceptionAction
 import java.text.ParseException
+import java.util.UUID
 
 import scala.annotation.tailrec
 import scala.collection.mutable.{ArrayBuffer, HashMap, Map}
@@ -1209,7 +1210,33 @@ private[spark] object SparkSubmitUtils {
 
   /** A nice function to use in tests as well. Values are dummy strings. */
   def getModuleDescriptor: DefaultModuleDescriptor = 
DefaultModuleDescriptor.newDefaultInstance(
-ModuleRevisionId.newInstance("org.apache.spark", "spark-submit-parent", 
"1.0"))
+// Include UUID in module name, so multiple clients resolving maven 
coordinate at the same time
+// do not modify the same resolution file concurrently.
+ModuleRevisionId.newInstance("org.apache.spark",
+  s"spark-submit-parent-${UUID.randomUUID.toString}",
+  "1.0"))
+
+  /**
+   * Clear ivy resolution from current launch. The resolution file is usually 
at
+   * ~/.ivy2/org.apache.spark-spark-submit-parent-$UUID-default.xml,
+   * ~/.ivy2/resolved-org.apache.spark-spark-submit-parent-$UUID-1.0.xml, and
+   * 
~/.ivy2/resolved-org.apache.spark-spark-submit-parent-$UUID-1.0.properties.
+   * Since each launch will have its own resolution files created, delete them 
after
+   * each resolution to prevent accumulation of these files in the ivy cache 
dir.
+   */
+  private def clearIvyResolutionFiles(
+  mdId: ModuleRevisionId,
+  ivySettings: IvySettings,
+  ivyConfName: String): Unit = {
+val currentResolutionFiles = Seq(
+  s"${mdId.getOrganisation}-${mdId.getName}-$ivyConfName.xml",
+  
s"resolved-${mdId.getOrganisation}-${mdId.getName}-${mdId.getRevision}.xml",
+  
s"resolved-${mdId.getOrganisation}-${mdId.getName}-${mdId.getRevision}.properties"
+)
+currentResolutionFiles.foreach { filename =>
+  new File(ivySettings.getDefaultCache, filename).delete()
+}
+  }
 
   /**
* Resolves any dependencies that were supplied through maven coordinates
@@ -1260,14 +1287,6 @@ private[spark] object SparkSubmitUtils {
 
 // A Module descriptor must be specified. Entries are dummy strings
 val md = getModuleDescriptor
-// clear ivy resolution from previous launches. The resolution file is 
usually at
-

spark git commit: [SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts at the same time

2018-05-10 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master 3e2600538 -> d3c426a5b


[SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts 
at the same time

## What changes were proposed in this pull request?

When multiple clients attempt to resolve artifacts via the `--packages` 
parameter, they could run into race condition when they each attempt to modify 
the dummy `org.apache.spark-spark-submit-parent-default.xml` file created in 
the default ivy cache dir.
This PR changes the behavior to encode UUID in the dummy module descriptor so 
each client will operate on a different resolution file in the ivy cache dir. 
In addition, this patch changes the behavior of when and which resolution files 
are cleaned to prevent accumulation of resolution files in the default ivy 
cache dir.

Since this PR is a successor of #18801, close #18801. Many codes were ported 
from #18801. **Many efforts were put here. I think this PR should credit to 
Victsm .**

## How was this patch tested?

added UT into `SparkSubmitUtilsSuite`

Author: Kazuaki Ishizaki 

Closes #21251 from kiszk/SPARK-10878.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d3c426a5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d3c426a5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d3c426a5

Branch: refs/heads/master
Commit: d3c426a5b02abdec49ff45df12a8f11f9e473a88
Parents: 3e26005
Author: Kazuaki Ishizaki 
Authored: Thu May 10 14:41:55 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu May 10 14:41:55 2018 -0700

--
 .../org/apache/spark/deploy/SparkSubmit.scala   | 42 +++-
 .../spark/deploy/SparkSubmitUtilsSuite.scala| 15 +++
 2 files changed, 47 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d3c426a5/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
index 427c797..087e9c3 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@@ -22,6 +22,7 @@ import java.lang.reflect.{InvocationTargetException, 
Modifier, UndeclaredThrowab
 import java.net.URL
 import java.security.PrivilegedExceptionAction
 import java.text.ParseException
+import java.util.UUID
 
 import scala.annotation.tailrec
 import scala.collection.mutable.{ArrayBuffer, HashMap, Map}
@@ -1204,7 +1205,33 @@ private[spark] object SparkSubmitUtils {
 
   /** A nice function to use in tests as well. Values are dummy strings. */
   def getModuleDescriptor: DefaultModuleDescriptor = 
DefaultModuleDescriptor.newDefaultInstance(
-ModuleRevisionId.newInstance("org.apache.spark", "spark-submit-parent", 
"1.0"))
+// Include UUID in module name, so multiple clients resolving maven 
coordinate at the same time
+// do not modify the same resolution file concurrently.
+ModuleRevisionId.newInstance("org.apache.spark",
+  s"spark-submit-parent-${UUID.randomUUID.toString}",
+  "1.0"))
+
+  /**
+   * Clear ivy resolution from current launch. The resolution file is usually 
at
+   * ~/.ivy2/org.apache.spark-spark-submit-parent-$UUID-default.xml,
+   * ~/.ivy2/resolved-org.apache.spark-spark-submit-parent-$UUID-1.0.xml, and
+   * 
~/.ivy2/resolved-org.apache.spark-spark-submit-parent-$UUID-1.0.properties.
+   * Since each launch will have its own resolution files created, delete them 
after
+   * each resolution to prevent accumulation of these files in the ivy cache 
dir.
+   */
+  private def clearIvyResolutionFiles(
+  mdId: ModuleRevisionId,
+  ivySettings: IvySettings,
+  ivyConfName: String): Unit = {
+val currentResolutionFiles = Seq(
+  s"${mdId.getOrganisation}-${mdId.getName}-$ivyConfName.xml",
+  
s"resolved-${mdId.getOrganisation}-${mdId.getName}-${mdId.getRevision}.xml",
+  
s"resolved-${mdId.getOrganisation}-${mdId.getName}-${mdId.getRevision}.properties"
+)
+currentResolutionFiles.foreach { filename =>
+  new File(ivySettings.getDefaultCache, filename).delete()
+}
+  }
 
   /**
* Resolves any dependencies that were supplied through maven coordinates
@@ -1255,14 +1282,6 @@ private[spark] object SparkSubmitUtils {
 
 // A Module descriptor must be specified. Entries are dummy strings
 val md = getModuleDescriptor
-// clear ivy resolution from previous launches. The resolution file is 
usually at
-// ~/.ivy2/org.apache.spark-spark-submit-parent-default.xml. In 
between runs, this file
-// leads to confusion with

spark git commit: [SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics"

2018-05-10 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master 6282fc64e -> 3e2600538


[SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics"

## What changes were proposed in this pull request?

Sometimes "SparkListenerSuite.local metrics" test fails because the average of 
executorDeserializeTime is too short. As squito suggested to avoid these 
situations in one of the task a reference introduced to an object implementing 
a custom Externalizable.readExternal which sleeps 1ms before returning.

## How was this patch tested?

With unit tests (and checking the effect of this change to the average with a 
much larger sleep time).

Author: âattilapirosâ 
Author: Attila Zsolt Piros <2017933+attilapi...@users.noreply.github.com>

Closes #21280 from attilapiros/SPARK-19181.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3e260053
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3e260053
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3e260053

Branch: refs/heads/master
Commit: 3e2600538ee477ffe3f23fba57719e035219550b
Parents: 6282fc6
Author: âattilapirosâ 
Authored: Thu May 10 14:26:38 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu May 10 14:26:38 2018 -0700

--
 .../apache/spark/scheduler/SparkListenerSuite.scala  | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3e260053/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
index da6ecb8..fa47a52 100644
--- a/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.scheduler
 
+import java.io.{Externalizable, IOException, ObjectInput, ObjectOutput}
 import java.util.concurrent.Semaphore
 
 import scala.collection.JavaConverters._
@@ -294,10 +295,13 @@ class SparkListenerSuite extends SparkFunSuite with 
LocalSparkContext with Match
 val listener = new SaveStageAndTaskInfo
 sc.addSparkListener(listener)
 sc.addSparkListener(new StatsReportListener)
-// just to make sure some of the tasks take a noticeable amount of time
+// just to make sure some of the tasks and their deserialization take a 
noticeable
+// amount of time
+val slowDeserializable = new SlowDeserializable
 val w = { i: Int =>
   if (i == 0) {
 Thread.sleep(100)
+slowDeserializable.use()
   }
   i
 }
@@ -583,3 +587,12 @@ private class FirehoseListenerThatAcceptsSparkConf(conf: 
SparkConf) extends Spar
 case _ =>
   }
 }
+
+private class SlowDeserializable extends Externalizable {
+
+  override def writeExternal(out: ObjectOutput): Unit = { }
+
+  override def readExternal(in: ObjectInput): Unit = Thread.sleep(1)
+
+  def use(): Unit = { }
+}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics"

2018-05-10 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 16cd9ac52 -> 4c49b12da


[SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics"

## What changes were proposed in this pull request?

Sometimes "SparkListenerSuite.local metrics" test fails because the average of 
executorDeserializeTime is too short. As squito suggested to avoid these 
situations in one of the task a reference introduced to an object implementing 
a custom Externalizable.readExternal which sleeps 1ms before returning.

## How was this patch tested?

With unit tests (and checking the effect of this change to the average with a 
much larger sleep time).

Author: âattilapirosâ 
Author: Attila Zsolt Piros <2017933+attilapi...@users.noreply.github.com>

Closes #21280 from attilapiros/SPARK-19181.

(cherry picked from commit 3e2600538ee477ffe3f23fba57719e035219550b)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4c49b12d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4c49b12d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4c49b12d

Branch: refs/heads/branch-2.3
Commit: 4c49b12da512ae29e2e4b773a334abbf6a4f08f1
Parents: 16cd9ac
Author: âattilapirosâ 
Authored: Thu May 10 14:26:38 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu May 10 14:26:47 2018 -0700

--
 .../apache/spark/scheduler/SparkListenerSuite.scala  | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4c49b12d/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
index da6ecb8..fa47a52 100644
--- a/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.scheduler
 
+import java.io.{Externalizable, IOException, ObjectInput, ObjectOutput}
 import java.util.concurrent.Semaphore
 
 import scala.collection.JavaConverters._
@@ -294,10 +295,13 @@ class SparkListenerSuite extends SparkFunSuite with 
LocalSparkContext with Match
 val listener = new SaveStageAndTaskInfo
 sc.addSparkListener(listener)
 sc.addSparkListener(new StatsReportListener)
-// just to make sure some of the tasks take a noticeable amount of time
+// just to make sure some of the tasks and their deserialization take a 
noticeable
+// amount of time
+val slowDeserializable = new SlowDeserializable
 val w = { i: Int =>
   if (i == 0) {
 Thread.sleep(100)
+slowDeserializable.use()
   }
   i
 }
@@ -583,3 +587,12 @@ private class FirehoseListenerThatAcceptsSparkConf(conf: 
SparkConf) extends Spar
 case _ =>
   }
 }
+
+private class SlowDeserializable extends Externalizable {
+
+  override def writeExternal(out: ObjectOutput): Unit = { }
+
+  override def readExternal(in: ObjectInput): Unit = Thread.sleep(1)
+
+  def use(): Unit = { }
+}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r26837 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_10_14_01-16cd9ac-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-10 Thread pwendell

Author: pwendell
Date: Thu May 10 21:15:50 2018
New Revision: 26837

Log:
Apache Spark 2.3.1-SNAPSHOT-2018_05_10_14_01-16cd9ac docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r26835 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_12_01-6282fc6-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-10 Thread pwendell

Author: pwendell
Date: Thu May 10 19:15:10 2018
New Revision: 26835

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_05_10_12_01-6282fc6 docs


[This commit notification would consist of 1462 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24137][K8S] Mount local directories as empty dir volumes.

2018-05-10 Thread foxish

Repository: spark
Updated Branches:
  refs/heads/master f4fed0512 -> 6282fc64e


[SPARK-24137][K8S] Mount local directories as empty dir volumes.

## What changes were proposed in this pull request?

Drastically improves performance and won't cause Spark applications to fail 
because they write too much data to the Docker image's specific file system. 
The file system's directories that back emptydir volumes are generally larger 
and more performant.

## How was this patch tested?

Has been in use via the prototype version of Kubernetes support, but lost in 
the transition to here.

Author: mcheah 

Closes #21238 from mccheah/mount-local-dirs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6282fc64
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6282fc64
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6282fc64

Branch: refs/heads/master
Commit: 6282fc64e32fc2f70e79ace14efd4922e4535dbb
Parents: f4fed05
Author: mcheah 
Authored: Thu May 10 11:36:41 2018 -0700
Committer: Anirudh Ramanathan 
Committed: Thu May 10 11:36:41 2018 -0700

--
 .../main/scala/org/apache/spark/SparkConf.scala |   5 +-
 .../k8s/features/LocalDirsFeatureStep.scala |  77 +
 .../k8s/submit/KubernetesDriverBuilder.scala|  10 +-
 .../cluster/k8s/KubernetesExecutorBuilder.scala |   9 +-
 .../features/LocalDirsFeatureStepSuite.scala| 111 +++
 .../submit/KubernetesDriverBuilderSuite.scala   |  13 ++-
 .../k8s/KubernetesExecutorBuilderSuite.scala|  12 +-
 7 files changed, 223 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6282fc64/core/src/main/scala/org/apache/spark/SparkConf.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala 
b/core/src/main/scala/org/apache/spark/SparkConf.scala
index 129956e..dab4095 100644
--- a/core/src/main/scala/org/apache/spark/SparkConf.scala
+++ b/core/src/main/scala/org/apache/spark/SparkConf.scala
@@ -454,8 +454,9 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable 
with Logging with Seria
*/
   private[spark] def validateSettings() {
 if (contains("spark.local.dir")) {
-  val msg = "In Spark 1.0 and later spark.local.dir will be overridden by 
the value set by " +
-"the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and 
LOCAL_DIRS in YARN)."
+  val msg = "Note that spark.local.dir will be overridden by the value set 
by " +
+"the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone/kubernetes and LOCAL_DIRS" +
+" in YARN)."
   logWarning(msg)
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6282fc64/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala
--
diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala
new file mode 100644
index 000..70b3073
--- /dev/null
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s.features
+
+import java.nio.file.Paths
+import java.util.UUID
+
+import io.fabric8.kubernetes.api.model.{ContainerBuilder, HasMetadata, 
PodBuilder, VolumeBuilder, VolumeMountBuilder}
+
+import org.apache.spark.deploy.k8s.{KubernetesConf, 
KubernetesDriverSpecificConf, KubernetesRoleSpecificConf, SparkPod}
+
+private[spark] class LocalDirsFeatureStep(
+conf: KubernetesConf[_ <: KubernetesRoleSpecificConf],
+defaultLocalDir: String = s"/var/data/spark-${UUID.randomUUID}")
+  extends KubernetesFeatureConfigStep {
+
+  //

[2/2] spark git commit: [PYSPARK] Update py4j to version 0.10.7.

2018-05-10 Thread vanzin

[PYSPARK] Update py4j to version 0.10.7.

(cherry picked from commit cc613b552e753d03cb62661591de59e1c8d82c74)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/323dc3ad
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/323dc3ad
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/323dc3ad

Branch: refs/heads/branch-2.3
Commit: 323dc3ad02e63a7c99b5bd6da618d6020657ecba
Parents: eab10f9
Author: Marcelo Vanzin 
Authored: Fri Apr 13 14:28:24 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu May 10 10:47:37 2018 -0700

--
 LICENSE |   2 +-
 bin/pyspark |   6 +-
 bin/pyspark2.cmd|   2 +-
 core/pom.xml|   2 +-
 .../org/apache/spark/SecurityManager.scala  |  11 +-
 .../spark/api/python/PythonGatewayServer.scala  |  50 ++---
 .../org/apache/spark/api/python/PythonRDD.scala |  29 --
 .../apache/spark/api/python/PythonUtils.scala   |   2 +-
 .../spark/api/python/PythonWorkerFactory.scala  |  21 ++--
 .../org/apache/spark/deploy/PythonRunner.scala  |  12 ++-
 .../apache/spark/internal/config/package.scala  |   5 +
 .../spark/security/SocketAuthHelper.scala   | 101 +++
 .../scala/org/apache/spark/util/Utils.scala |  12 +++
 .../spark/security/SocketAuthHelperSuite.scala  |  97 ++
 dev/deps/spark-deps-hadoop-2.6  |   2 +-
 dev/deps/spark-deps-hadoop-2.7  |   2 +-
 dev/run-pip-tests   |   2 +-
 python/README.md|   2 +-
 python/docs/Makefile|   2 +-
 python/lib/py4j-0.10.6-src.zip  | Bin 80352 -> 0 bytes
 python/lib/py4j-0.10.7-src.zip  | Bin 0 -> 42437 bytes
 python/pyspark/context.py   |   4 +-
 python/pyspark/daemon.py|  21 +++-
 python/pyspark/java_gateway.py  |  93 ++---
 python/pyspark/rdd.py   |  21 ++--
 python/pyspark/sql/dataframe.py |  12 +--
 python/pyspark/worker.py|   7 +-
 python/setup.py |   2 +-
 .../org/apache/spark/deploy/yarn/Client.scala   |   2 +-
 .../spark/deploy/yarn/YarnClusterSuite.scala|   2 +-
 sbin/spark-config.sh|   2 +-
 .../scala/org/apache/spark/sql/Dataset.scala|   6 +-
 32 files changed, 418 insertions(+), 116 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/323dc3ad/LICENSE
--
diff --git a/LICENSE b/LICENSE
index c2b0d72..820f14d 100644
--- a/LICENSE
+++ b/LICENSE
@@ -263,7 +263,7 @@ The text of each license is also included at 
licenses/LICENSE-[project].txt.
  (New BSD license) Protocol Buffer Java API 
(org.spark-project.protobuf:protobuf-java:2.4.1-shaded - 
http://code.google.com/p/protobuf)
  (The BSD License) Fortran to Java ARPACK 
(net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net)
  (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - 
http://xmlenc.sourceforge.net)
- (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.6 - 
http://py4j.sourceforge.net/)
+ (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.7 - 
http://py4j.sourceforge.net/)
  (Two-clause BSD-style license) JUnit-Interface 
(com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/)
  (BSD licence) sbt and sbt-launch-lib.bash
  (BSD 3 Clause) d3.min.js 
(https://github.com/mbostock/d3/blob/master/LICENSE)

http://git-wip-us.apache.org/repos/asf/spark/blob/323dc3ad/bin/pyspark
--
diff --git a/bin/pyspark b/bin/pyspark
index dd28627..5d5affb 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -25,14 +25,14 @@ source "${SPARK_HOME}"/bin/load-spark-env.sh
 export _SPARK_CMD_USAGE="Usage: ./bin/pyspark [options]"
 
 # In Spark 2.0, IPYTHON and IPYTHON_OPTS are removed and pyspark fails to 
launch if either option
-# is set in the user's environment. Instead, users should set 
PYSPARK_DRIVER_PYTHON=ipython 
+# is set in the user's environment. Instead, users should set 
PYSPARK_DRIVER_PYTHON=ipython
 # to use IPython and set PYSPARK_DRIVER_PYTHON_OPTS to pass options when 
starting the Python driver
 # (e.g. PYSPARK_DRIVER_PYTHON_OPTS='notebook').  This supports full 
customization of the IPython
 # and executor Python executables.
 
 # Fail noisily if removed options are set
 if [[ -n "$IPYTHON" || -n "$IPYTHON_OPTS" ]]; then
-  echo "Error in

[1/2] spark git commit: [SPARKR] Match pyspark features in SparkR communication protocol.

2018-05-10 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 eab10f994 -> 16cd9ac52


[SPARKR] Match pyspark features in SparkR communication protocol.

(cherry picked from commit 628c7b517969c4a7ccb26ea67ab3dd61266073ca)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16cd9ac5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16cd9ac5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16cd9ac5

Branch: refs/heads/branch-2.3
Commit: 16cd9ac5264831e061c033b26fe1173ebc88e5d1
Parents: 323dc3a
Author: Marcelo Vanzin 
Authored: Tue Apr 17 13:29:43 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu May 10 10:47:37 2018 -0700

--
 R/pkg/R/client.R|  4 +-
 R/pkg/R/deserialize.R   | 10 ++--
 R/pkg/R/sparkR.R| 39 --
 R/pkg/inst/worker/daemon.R  |  4 +-
 R/pkg/inst/worker/worker.R  |  5 +-
 .../org/apache/spark/api/r/RAuthHelper.scala| 38 ++
 .../scala/org/apache/spark/api/r/RBackend.scala | 43 ---
 .../spark/api/r/RBackendAuthHandler.scala   | 55 
 .../scala/org/apache/spark/api/r/RRunner.scala  | 35 +
 .../scala/org/apache/spark/deploy/RRunner.scala |  6 ++-
 10 files changed, 210 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/16cd9ac5/R/pkg/R/client.R
--
diff --git a/R/pkg/R/client.R b/R/pkg/R/client.R
index 9d82814..7244cc9 100644
--- a/R/pkg/R/client.R
+++ b/R/pkg/R/client.R
@@ -19,7 +19,7 @@
 
 # Creates a SparkR client connection object
 # if one doesn't already exist
-connectBackend <- function(hostname, port, timeout) {
+connectBackend <- function(hostname, port, timeout, authSecret) {
   if (exists(".sparkRcon", envir = .sparkREnv)) {
 if (isOpen(.sparkREnv[[".sparkRCon"]])) {
   cat("SparkRBackend client connection already exists\n")
@@ -29,7 +29,7 @@ connectBackend <- function(hostname, port, timeout) {
 
   con <- socketConnection(host = hostname, port = port, server = FALSE,
   blocking = TRUE, open = "wb", timeout = timeout)
-
+  doServerAuth(con, authSecret)
   assign(".sparkRCon", con, envir = .sparkREnv)
   con
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/16cd9ac5/R/pkg/R/deserialize.R
--
diff --git a/R/pkg/R/deserialize.R b/R/pkg/R/deserialize.R
index a90f7d3..cb03f16 100644
--- a/R/pkg/R/deserialize.R
+++ b/R/pkg/R/deserialize.R
@@ -60,14 +60,18 @@ readTypedObject <- function(con, type) {
 stop(paste("Unsupported type for deserialization", type)))
 }
 
-readString <- function(con) {
-  stringLen <- readInt(con)
-  raw <- readBin(con, raw(), stringLen, endian = "big")
+readStringData <- function(con, len) {
+  raw <- readBin(con, raw(), len, endian = "big")
   string <- rawToChar(raw)
   Encoding(string) <- "UTF-8"
   string
 }
 
+readString <- function(con) {
+  stringLen <- readInt(con)
+  readStringData(con, stringLen)
+}
+
 readInt <- function(con) {
   readBin(con, integer(), n = 1, endian = "big")
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/16cd9ac5/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index 965471f..7430d84 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -161,6 +161,10 @@ sparkR.sparkContext <- function(
 " please use the --packages commandline instead", sep = 
","))
 }
 backendPort <- existingPort
+authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET")
+if (nchar(authSecret) == 0) {
+  stop("Auth secret not provided in environment.")
+}
   } else {
 path <- tempfile(pattern = "backend_port")
 submitOps <- getClientModeSparkSubmitOpts(
@@ -189,16 +193,27 @@ sparkR.sparkContext <- function(
 monitorPort <- readInt(f)
 rLibPath <- readString(f)
 connectionTimeout <- readInt(f)
+
+# Don't use readString() so that we can provide a useful
+# error message if the R and Java versions are mismatched.
+authSecretLen = readInt(f)
+if (length(authSecretLen) == 0 || authSecretLen == 0) {
+  stop("Unexpected EOF in JVM connection data. Mismatched versions?")
+}
+authSecret <- readStringData(f, authSecretLen)
 close(f)
 file.remove(path)
 if (length(backendPort) == 0 || backendPort == 0 ||
 length(monitorPort) == 0 || monitorPort == 0 ||
-length(rLibPath) != 1) {
+length(rLibPath) != 1 || length(authSecret) == 0) {
   stop("JVM failed

svn commit: r26831 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_10_10_01-eab10f9-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-10 Thread pwendell

Author: pwendell
Date: Thu May 10 17:15:43 2018
New Revision: 26831

Log:
Apache Spark 2.3.1-SNAPSHOT-2018_05_10_10_01-eab10f9 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24171] Adding a note for non-deterministic functions

2018-05-10 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master 94d671448 -> f4fed0512


[SPARK-24171] Adding a note for non-deterministic functions

## What changes were proposed in this pull request?

I propose to add a clear statement for functions like `collect_list()` about 
non-deterministic behavior of such functions. The behavior must be taken into 
account by user while creating and running queries.

Author: Maxim Gekk 

Closes #21228 from MaxGekk/deterministic-comments.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4fed051
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4fed051
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4fed051

Branch: refs/heads/master
Commit: f4fed0512101a67d9dae50ace11d3940b910e05e
Parents: 94d6714
Author: Maxim Gekk 
Authored: Thu May 10 09:44:49 2018 -0700
Committer: gatorsmile 
Committed: Thu May 10 09:44:49 2018 -0700

--
 R/pkg/R/functions.R | 11 +
 python/pyspark/sql/functions.py | 18 
 .../expressions/MonotonicallyIncreasingID.scala |  1 +
 .../spark/sql/catalyst/expressions/misc.scala   |  5 ++-
 .../expressions/randomExpressions.scala |  8 ++--
 .../scala/org/apache/spark/sql/functions.scala  | 46 ++--
 6 files changed, 81 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f4fed051/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 0ec99d1..04d0e46 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -805,6 +805,8 @@ setMethod("factorial",
 #'
 #' The function by default returns the first values it sees. It will return 
the first non-missing
 #' value it sees when na.rm is set to true. If all values are missing, then NA 
is returned.
+#' Note: the function is non-deterministic because its results depends on 
order of rows which
+#' may be non-deterministic after a shuffle.
 #'
 #' @param na.rm a logical value indicating whether NA values should be stripped
 #'before the computation proceeds.
@@ -948,6 +950,8 @@ setMethod("kurtosis",
 #'
 #' The function by default returns the last values it sees. It will return the 
last non-missing
 #' value it sees when na.rm is set to true. If all values are missing, then NA 
is returned.
+#' Note: the function is non-deterministic because its results depends on 
order of rows which
+#' may be non-deterministic after a shuffle.
 #'
 #' @param x column to compute on.
 #' @param na.rm a logical value indicating whether NA values should be stripped
@@ -1201,6 +1205,7 @@ setMethod("minute",
 #' 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
 #' This is equivalent to the MONOTONICALLY_INCREASING_ID function in SQL.
 #' The method should be used with no argument.
+#' Note: the function is non-deterministic because its result depends on 
partition IDs.
 #'
 #' @rdname column_nonaggregate_functions
 #' @aliases monotonically_increasing_id 
monotonically_increasing_id,missing-method
@@ -2584,6 +2589,7 @@ setMethod("lpad", signature(x = "Column", len = 
"numeric", pad = "character"),
 #' @details
 #' \code{rand}: Generates a random column with independent and identically 
distributed (i.i.d.)
 #' samples from U[0.0, 1.0].
+#' Note: the function is non-deterministic in general case.
 #'
 #' @rdname column_nonaggregate_functions
 #' @param seed a random seed. Can be missing.
@@ -2612,6 +2618,7 @@ setMethod("rand", signature(seed = "numeric"),
 #' @details
 #' \code{randn}: Generates a column with independent and identically 
distributed (i.i.d.) samples
 #' from the standard normal distribution.
+#' Note: the function is non-deterministic in general case.
 #'
 #' @rdname column_nonaggregate_functions
 #' @aliases randn randn,missing-method
@@ -3188,6 +3195,8 @@ setMethod("create_map",
 
 #' @details
 #' \code{collect_list}: Creates a list of objects with duplicates.
+#' Note: the function is non-deterministic because the order of collected 
results depends
+#' on order of rows which may be non-deterministic after a shuffle.
 #'
 #' @rdname column_aggregate_functions
 #' @aliases collect_list collect_list,Column-method
@@ -3207,6 +3216,8 @@ setMethod("collect_list",
 
 #' @details
 #' \code{collect_set}: Creates a list of objects with duplicate elements 
eliminated.
+#' Note: the function is non-deterministic because the order of collected 
results depends
+#' on order of rows which may be non-deterministic after a shuffle.
 #'
 #' @rdname column_aggregate_functions
 #' @aliases collect_set collect_set,Column-method

http://git-wip-us.apache.org/repos/asf/spark/blob/f4fed051/python/pyspark/sql/functions.py

spark git commit: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring

2018-05-10 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 8889d7864 -> eab10f994


[SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text 
datasource on schema inferring

## What changes were proposed in this pull request?

While reading CSV or JSON files, DataFrameReader's options are converted to 
Hadoop's parameters, for example there:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302

but the options are not propagated to Text datasource on schema inferring, for 
instance:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188

The PR proposes propagation of user's options to Text datasource on scheme 
inferring in similar way as user's options are converted to Hadoop parameters 
if schema is specified.

## How was this patch tested?
The changes were tested manually by using https://github.com/twitter/hadoop-lzo:

```
hadoop-lzo> mvn clean package
hadoop-lzo> ln -s ./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar
```
Create 2 test files in JSON and CSV format and compress them:
```shell
$ cat test.csv
col1|col2
a|1
$ lzop test.csv
$ cat test.json
{"col1":"a","col2":1}
$ lzop test.json
```
Run `spark-shell` with hadoop-lzo:
```
bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar
```
reading compressed CSV and JSON without schema:
```scala
spark.read.option("io.compression.codecs", 
"com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","|").csv("test.csv.lzo").show()
+++
|col1|col2|
+++
|   a|   1|
+++
```
```scala
spark.read.option("io.compression.codecs", 
"com.hadoop.compression.lzo.LzopCodec").option("multiLine", 
true).json("test.json.lzo").printSchema
root
 |-- col1: string (nullable = true)
 |-- col2: long (nullable = true)
```

Author: Maxim Gekk 
Author: Maxim Gekk 

Closes #21292 from MaxGekk/text-options-backport-v2.3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eab10f99
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eab10f99
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eab10f99

Branch: refs/heads/branch-2.3
Commit: eab10f9945c1d01daa45c233a39dedfd184f543c
Parents: 8889d78
Author: Maxim Gekk 
Authored: Fri May 11 00:28:43 2018 +0800
Committer: hyukjinkwon 
Committed: Fri May 11 00:28:43 2018 +0800

--
 .../spark/sql/catalyst/json/JSONOptions.scala   |  2 +-
 .../execution/datasources/csv/CSVDataSource.scala   |  6 --
 .../sql/execution/datasources/csv/CSVOptions.scala  |  2 +-
 .../execution/datasources/csv/UnivocityParser.scala |  2 --
 .../execution/datasources/json/JsonDataSource.scala | 16 ++--
 5 files changed, 16 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eab10f99/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
index 652412b..190fcc6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
@@ -31,7 +31,7 @@ import org.apache.spark.sql.catalyst.util._
  * Most of these map directly to Jackson's internal options, specified in 
[[JsonParser.Feature]].
  */
 private[sql] class JSONOptions(
-@transient private val parameters: CaseInsensitiveMap[String],
+@transient val parameters: CaseInsensitiveMap[String],
 defaultTimeZoneId: String,
 defaultColumnNameOfCorruptRecord: String)
   extends Logging with Serializable  {

http://git-wip-us.apache.org/repos/asf/spark/blob/eab10f99/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
index 4870d75..fffad17 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
@@ -184,7 +184,8 @@ object TextInputCSVDataSource extends CSVDataSource {
 DataSource.apply(
   sparkSession,
   paths = paths,
-

svn commit: r26830 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_08_01-94d6714-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-10 Thread pwendell

Author: pwendell
Date: Thu May 10 15:16:10 2018
New Revision: 26830

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_05_10_08_01-94d6714 docs


[This commit notification would consist of 1462 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23907][SQL] Add regr_* functions

2018-05-10 Thread ueshin

Repository: spark
Updated Branches:
  refs/heads/master e3d434994 -> 94d671448


[SPARK-23907][SQL] Add regr_* functions

## What changes were proposed in this pull request?

The PR introduces regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, 
regr_sxy, regr_avgx, regr_avgy, regr_count.

The implementation of this functions mirrors Hive's one in HIVE-15978.

## How was this patch tested?

added UT (values compared with Hive)

Author: Marco Gaido 

Closes #21054 from mgaido91/SPARK-23907.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/94d67144
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/94d67144
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/94d67144

Branch: refs/heads/master
Commit: 94d671448240c8f6da11d2523ba9e4ae5b56a410
Parents: e3d4349
Author: Marco Gaido 
Authored: Thu May 10 20:38:52 2018 +0900
Committer: Takuya UESHIN 
Committed: Thu May 10 20:38:52 2018 +0900

--
 .../catalyst/analysis/FunctionRegistry.scala|   9 +
 .../expressions/aggregate/Average.scala |  47 +++--
 .../aggregate/CentralMomentAgg.scala|  60 +++---
 .../catalyst/expressions/aggregate/Corr.scala   |  52 ++---
 .../catalyst/expressions/aggregate/Count.scala  |  47 +++--
 .../expressions/aggregate/Covariance.scala  |  36 ++--
 .../expressions/aggregate/regression.scala  | 190 +++
 .../scala/org/apache/spark/sql/functions.scala  | 172 +
 .../sql-tests/inputs/udaf-regrfunctions.sql |  56 ++
 .../results/udaf-regrfunctions.sql.out  |  93 +
 .../spark/sql/DataFrameAggregateSuite.scala |  71 ++-
 11 files changed, 721 insertions(+), 112 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/94d67144/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index 87b0911..087d000 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -299,6 +299,15 @@ object FunctionRegistry {
 expression[CollectList]("collect_list"),
 expression[CollectSet]("collect_set"),
 expression[CountMinSketchAgg]("count_min_sketch"),
+expression[RegrCount]("regr_count"),
+expression[RegrSXX]("regr_sxx"),
+expression[RegrSYY]("regr_syy"),
+expression[RegrAvgX]("regr_avgx"),
+expression[RegrAvgY]("regr_avgy"),
+expression[RegrSXY]("regr_sxy"),
+expression[RegrSlope]("regr_slope"),
+expression[RegrR2]("regr_r2"),
+expression[RegrIntercept]("regr_intercept"),
 
 // string functions
 expression[Ascii]("ascii"),

http://git-wip-us.apache.org/repos/asf/spark/blob/94d67144/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
index 708bdbf..a133bc2 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
@@ -23,24 +23,12 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.util.TypeUtils
 import org.apache.spark.sql.types._
 
-@ExpressionDescription(
-  usage = "_FUNC_(expr) - Returns the mean calculated from values of a group.")
-case class Average(child: Expression) extends DeclarativeAggregate with 
ImplicitCastInputTypes {
-
-  override def prettyName: String = "avg"
-
-  override def children: Seq[Expression] = child :: Nil
+abstract class AverageLike(child: Expression) extends DeclarativeAggregate {
 
   override def nullable: Boolean = true
-
   // Return data type.
   override def dataType: DataType = resultType
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(NumericType)
-
-  override def checkInputDataTypes(): TypeCheckResult =
-TypeUtils.checkForNumericExpr(child.dataType, "function average")
-
   private lazy val resultType = child.dataType match {
 case DecimalType.Fixed(p, s) =>
   DecimalType.bounded(p + 4, s + 4)
@@ -62,14 +50,6 @@ case class Average(child: Expression) extends 
DeclarativeAggregate with Implicit
 /* count =

svn commit: r26841 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_20_01-75cf369-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

svn commit: r26840 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_10_18_01-414e4e3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-24197][SPARKR][SQL] Adding array_sort function to SparkR

spark git commit: [SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is accessed only on the driver

svn commit: r26838 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_16_01-d3c426a-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts at the same time

spark git commit: [SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts at the same time

spark git commit: [SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics"

spark git commit: [SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics"

svn commit: r26837 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_10_14_01-16cd9ac-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

svn commit: r26835 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_12_01-6282fc6-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-24137][K8S] Mount local directories as empty dir volumes.

[2/2] spark git commit: [PYSPARK] Update py4j to version 0.10.7.

[1/2] spark git commit: [SPARKR] Match pyspark features in SparkR communication protocol.

svn commit: r26831 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_10_10_01-eab10f9-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-24171] Adding a note for non-deterministic functions

spark git commit: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring

svn commit: r26830 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_08_01-94d6714-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-23907][SQL] Add regr_* functions

19 matches

Site Navigation

Mail list logo

Footer information