svn commit: r26841 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_20_01-75cf369-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Fri May 11 03:15:44 2018 New Revision: 26841 Log: Apache Spark 2.4.0-SNAPSHOT-2018_05_10_20_01-75cf369 docs [This commit notification would consist of 1462 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r26840 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_10_18_01-414e4e3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Fri May 11 01:15:34 2018 New Revision: 26840 Log: Apache Spark 2.3.1-SNAPSHOT-2018_05_10_18_01-414e4e3 docs [This commit notification would consist of 1443 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24197][SPARKR][SQL] Adding array_sort function to SparkR
Repository: spark Updated Branches: refs/heads/master a4206d58e -> 75cf369c7 [SPARK-24197][SPARKR][SQL] Adding array_sort function to SparkR ## What changes were proposed in this pull request? The PR adds array_sort function to SparkR. ## How was this patch tested? Tests added into R/pkg/tests/fulltests/test_sparkSQL.R ## Example ``` > df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 6L, 5L, > NA, 4L > head(collect(select(df, array_sort(df[[1]] ``` Result: ``` array_sort(_1) 1 1, 2, 3, NA 2 4, 5, 6, NA, NA ``` Author: Marek NovotnyCloses #21294 from mn-mikke/SPARK-24197. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/75cf369c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/75cf369c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/75cf369c Branch: refs/heads/master Commit: 75cf369c742e7c7b68f384d123447c97be95c9f0 Parents: a4206d5 Author: Marek Novotny Authored: Fri May 11 09:05:35 2018 +0800 Committer: hyukjinkwon Committed: Fri May 11 09:05:35 2018 +0800 -- R/pkg/NAMESPACE | 1 + R/pkg/R/functions.R | 21 ++--- R/pkg/R/generics.R| 4 R/pkg/tests/fulltests/test_sparkSQL.R | 13 + 4 files changed, 32 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/75cf369c/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 8cd0035..5f82096 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -204,6 +204,7 @@ exportMethods("%<=>%", "array_max", "array_min", "array_position", + "array_sort", "asc", "ascii", "asin", http://git-wip-us.apache.org/repos/asf/spark/blob/75cf369c/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 04d0e46..1f97054 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -207,7 +207,7 @@ NULL #' tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp)) #' head(select(tmp, array_contains(tmp$v1, 21), size(tmp$v1))) #' head(select(tmp, array_max(tmp$v1), array_min(tmp$v1))) -#' head(select(tmp, array_position(tmp$v1, 21))) +#' head(select(tmp, array_position(tmp$v1, 21), array_sort(tmp$v1))) #' head(select(tmp, flatten(tmp$v1))) #' tmp2 <- mutate(tmp, v2 = explode(tmp$v1)) #' head(tmp2) @@ -3044,6 +3044,20 @@ setMethod("array_position", }) #' @details +#' \code{array_sort}: Sorts the input array in ascending order. The elements of the input array +#' must be orderable. NA elements will be placed at the end of the returned array. +#' +#' @rdname column_collection_functions +#' @aliases array_sort array_sort,Column-method +#' @note array_sort since 2.4.0 +setMethod("array_sort", + signature(x = "Column"), + function(x) { +jc <- callJStatic("org.apache.spark.sql.functions", "array_sort", x@jc) +column(jc) + }) + +#' @details #' \code{flatten}: Transforms an array of arrays into a single array. #' #' @rdname column_collection_functions @@ -3125,8 +3139,9 @@ setMethod("size", }) #' @details -#' \code{sort_array}: Sorts the input array in ascending or descending order according -#' to the natural ordering of the array elements. +#' \code{sort_array}: Sorts the input array in ascending or descending order according to +#' the natural ordering of the array elements. NA elements will be placed at the beginning of +#' the returned array in ascending order or at the end of the returned array in descending order. #' #' @rdname column_collection_functions #' @param asc a logical flag indicating the sorting order. http://git-wip-us.apache.org/repos/asf/spark/blob/75cf369c/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 4ef12d1..5faa51e 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -769,6 +769,10 @@ setGeneric("array_min", function(x) { standardGeneric("array_min") }) #' @name NULL setGeneric("array_position", function(x, value) { standardGeneric("array_position") }) +#' @rdname column_collection_functions +#' @name NULL +setGeneric("array_sort", function(x) { standardGeneric("array_sort") }) + #' @rdname column_string_functions #' @name NULL setGeneric("ascii", function(x) { standardGeneric("ascii") }) http://git-wip-us.apache.org/repos/asf/spark/blob/75cf369c/R/pkg/tests/fulltests/test_sparkSQL.R
spark git commit: [SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is accessed only on the driver
Repository: spark Updated Branches: refs/heads/master d3c426a5b -> a4206d58e [SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is accessed only on the driver ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/20136 . #20136 didn't really work because in the test, we are using local backend, which shares the driver side `SparkEnv`, so `SparkEnv.get.executorId == SparkContext.DRIVER_IDENTIFIER` doesn't work. This PR changes the check to `TaskContext.get != null`, and move the check to `SQLConf.get`, and fix all the places that violate this check: * `InMemoryTableScanExec#createAndDecompressColumn` is executed inside `rdd.map`, we can't access `conf.offHeapColumnVectorEnabled` there. https://github.com/apache/spark/pull/21223 merged * `DataType#sameType` may be executed in executor side, for things like json schema inference, so we can't call `conf.caseSensitiveAnalysis` there. This contributes to most of the code changes, as we need to add `caseSensitive` parameter to a lot of methods. * `ParquetFilters` is used in the file scan function, which is executed in executor side, so we can't can't call `conf.parquetFilterPushDownDate` there. https://github.com/apache/spark/pull/21224 merged * `WindowExec#createBoundOrdering` is called on executor side, so we can't use `conf.sessionLocalTimezone` there. https://github.com/apache/spark/pull/21225 merged * `JsonToStructs` can be serialized to executors and evaluate, we should not call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)` in the body. https://github.com/apache/spark/pull/21226 merged ## How was this patch tested? existing test Author: Wenchen FanCloses #21190 from cloud-fan/minor. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a4206d58 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a4206d58 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a4206d58 Branch: refs/heads/master Commit: a4206d58e05ab9ed6f01fee57e18dee65cbc4efc Parents: d3c426a Author: Wenchen Fan Authored: Fri May 11 09:01:40 2018 +0800 Committer: hyukjinkwon Committed: Fri May 11 09:01:40 2018 +0800 -- .../sql/catalyst/analysis/CheckAnalysis.scala | 5 +- .../catalyst/analysis/ResolveInlineTables.scala | 4 +- .../sql/catalyst/analysis/TypeCoercion.scala| 156 +++ .../org/apache/spark/sql/internal/SQLConf.scala | 16 +- .../org/apache/spark/sql/types/DataType.scala | 8 +- .../catalyst/analysis/TypeCoercionSuite.scala | 70 - .../org/apache/spark/sql/SparkSession.scala | 21 ++- .../datasources/PartitioningUtils.scala | 5 +- .../datasources/json/JsonInferSchema.scala | 39 +++-- .../execution/datasources/json/JsonSuite.scala | 4 +- 10 files changed, 188 insertions(+), 140 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a4206d58/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 90bda2a..94b0561 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ /** @@ -260,7 +261,9 @@ trait CheckAnalysis extends PredicateHelper { // Check if the data types match. dataTypes(child).zip(ref).zipWithIndex.foreach { case ((dt1, dt2), ci) => // SPARK-18058: we shall not care about the nullability of columns -if (TypeCoercion.findWiderTypeForTwo(dt1.asNullable, dt2.asNullable).isEmpty) { +val widerType = TypeCoercion.findWiderTypeForTwo( + dt1.asNullable, dt2.asNullable, SQLConf.get.caseSensitiveAnalysis) +if (widerType.isEmpty) { failAnalysis( s""" |${operator.nodeName} can only be performed on tables with the compatible
svn commit: r26838 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_16_01-d3c426a-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu May 10 23:16:23 2018 New Revision: 26838 Log: Apache Spark 2.4.0-SNAPSHOT-2018_05_10_16_01-d3c426a docs [This commit notification would consist of 1462 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts at the same time
Repository: spark Updated Branches: refs/heads/branch-2.3 4c49b12da -> 414e4e3d7 [SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts at the same time ## What changes were proposed in this pull request? When multiple clients attempt to resolve artifacts via the `--packages` parameter, they could run into race condition when they each attempt to modify the dummy `org.apache.spark-spark-submit-parent-default.xml` file created in the default ivy cache dir. This PR changes the behavior to encode UUID in the dummy module descriptor so each client will operate on a different resolution file in the ivy cache dir. In addition, this patch changes the behavior of when and which resolution files are cleaned to prevent accumulation of resolution files in the default ivy cache dir. Since this PR is a successor of #18801, close #18801. Many codes were ported from #18801. **Many efforts were put here. I think this PR should credit to Victsm .** ## How was this patch tested? added UT into `SparkSubmitUtilsSuite` Author: Kazuaki IshizakiCloses #21251 from kiszk/SPARK-10878. (cherry picked from commit d3c426a5b02abdec49ff45df12a8f11f9e473a88) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/414e4e3d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/414e4e3d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/414e4e3d Branch: refs/heads/branch-2.3 Commit: 414e4e3d70caa950a63fab1c8cac3314fb961b0c Parents: 4c49b12 Author: Kazuaki Ishizaki Authored: Thu May 10 14:41:55 2018 -0700 Committer: Marcelo Vanzin Committed: Thu May 10 14:42:06 2018 -0700 -- .../org/apache/spark/deploy/SparkSubmit.scala | 42 +++- .../spark/deploy/SparkSubmitUtilsSuite.scala| 15 +++ 2 files changed, 47 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/414e4e3d/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index deb52a4..d1347f1 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -22,6 +22,7 @@ import java.lang.reflect.{InvocationTargetException, Modifier, UndeclaredThrowab import java.net.URL import java.security.PrivilegedExceptionAction import java.text.ParseException +import java.util.UUID import scala.annotation.tailrec import scala.collection.mutable.{ArrayBuffer, HashMap, Map} @@ -1209,7 +1210,33 @@ private[spark] object SparkSubmitUtils { /** A nice function to use in tests as well. Values are dummy strings. */ def getModuleDescriptor: DefaultModuleDescriptor = DefaultModuleDescriptor.newDefaultInstance( -ModuleRevisionId.newInstance("org.apache.spark", "spark-submit-parent", "1.0")) +// Include UUID in module name, so multiple clients resolving maven coordinate at the same time +// do not modify the same resolution file concurrently. +ModuleRevisionId.newInstance("org.apache.spark", + s"spark-submit-parent-${UUID.randomUUID.toString}", + "1.0")) + + /** + * Clear ivy resolution from current launch. The resolution file is usually at + * ~/.ivy2/org.apache.spark-spark-submit-parent-$UUID-default.xml, + * ~/.ivy2/resolved-org.apache.spark-spark-submit-parent-$UUID-1.0.xml, and + * ~/.ivy2/resolved-org.apache.spark-spark-submit-parent-$UUID-1.0.properties. + * Since each launch will have its own resolution files created, delete them after + * each resolution to prevent accumulation of these files in the ivy cache dir. + */ + private def clearIvyResolutionFiles( + mdId: ModuleRevisionId, + ivySettings: IvySettings, + ivyConfName: String): Unit = { +val currentResolutionFiles = Seq( + s"${mdId.getOrganisation}-${mdId.getName}-$ivyConfName.xml", + s"resolved-${mdId.getOrganisation}-${mdId.getName}-${mdId.getRevision}.xml", + s"resolved-${mdId.getOrganisation}-${mdId.getName}-${mdId.getRevision}.properties" +) +currentResolutionFiles.foreach { filename => + new File(ivySettings.getDefaultCache, filename).delete() +} + } /** * Resolves any dependencies that were supplied through maven coordinates @@ -1260,14 +1287,6 @@ private[spark] object SparkSubmitUtils { // A Module descriptor must be specified. Entries are dummy strings val md = getModuleDescriptor -// clear ivy resolution from previous launches. The resolution file is usually at -
spark git commit: [SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts at the same time
Repository: spark Updated Branches: refs/heads/master 3e2600538 -> d3c426a5b [SPARK-10878][CORE] Fix race condition when multiple clients resolves artifacts at the same time ## What changes were proposed in this pull request? When multiple clients attempt to resolve artifacts via the `--packages` parameter, they could run into race condition when they each attempt to modify the dummy `org.apache.spark-spark-submit-parent-default.xml` file created in the default ivy cache dir. This PR changes the behavior to encode UUID in the dummy module descriptor so each client will operate on a different resolution file in the ivy cache dir. In addition, this patch changes the behavior of when and which resolution files are cleaned to prevent accumulation of resolution files in the default ivy cache dir. Since this PR is a successor of #18801, close #18801. Many codes were ported from #18801. **Many efforts were put here. I think this PR should credit to Victsm .** ## How was this patch tested? added UT into `SparkSubmitUtilsSuite` Author: Kazuaki IshizakiCloses #21251 from kiszk/SPARK-10878. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d3c426a5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d3c426a5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d3c426a5 Branch: refs/heads/master Commit: d3c426a5b02abdec49ff45df12a8f11f9e473a88 Parents: 3e26005 Author: Kazuaki Ishizaki Authored: Thu May 10 14:41:55 2018 -0700 Committer: Marcelo Vanzin Committed: Thu May 10 14:41:55 2018 -0700 -- .../org/apache/spark/deploy/SparkSubmit.scala | 42 +++- .../spark/deploy/SparkSubmitUtilsSuite.scala| 15 +++ 2 files changed, 47 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d3c426a5/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index 427c797..087e9c3 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -22,6 +22,7 @@ import java.lang.reflect.{InvocationTargetException, Modifier, UndeclaredThrowab import java.net.URL import java.security.PrivilegedExceptionAction import java.text.ParseException +import java.util.UUID import scala.annotation.tailrec import scala.collection.mutable.{ArrayBuffer, HashMap, Map} @@ -1204,7 +1205,33 @@ private[spark] object SparkSubmitUtils { /** A nice function to use in tests as well. Values are dummy strings. */ def getModuleDescriptor: DefaultModuleDescriptor = DefaultModuleDescriptor.newDefaultInstance( -ModuleRevisionId.newInstance("org.apache.spark", "spark-submit-parent", "1.0")) +// Include UUID in module name, so multiple clients resolving maven coordinate at the same time +// do not modify the same resolution file concurrently. +ModuleRevisionId.newInstance("org.apache.spark", + s"spark-submit-parent-${UUID.randomUUID.toString}", + "1.0")) + + /** + * Clear ivy resolution from current launch. The resolution file is usually at + * ~/.ivy2/org.apache.spark-spark-submit-parent-$UUID-default.xml, + * ~/.ivy2/resolved-org.apache.spark-spark-submit-parent-$UUID-1.0.xml, and + * ~/.ivy2/resolved-org.apache.spark-spark-submit-parent-$UUID-1.0.properties. + * Since each launch will have its own resolution files created, delete them after + * each resolution to prevent accumulation of these files in the ivy cache dir. + */ + private def clearIvyResolutionFiles( + mdId: ModuleRevisionId, + ivySettings: IvySettings, + ivyConfName: String): Unit = { +val currentResolutionFiles = Seq( + s"${mdId.getOrganisation}-${mdId.getName}-$ivyConfName.xml", + s"resolved-${mdId.getOrganisation}-${mdId.getName}-${mdId.getRevision}.xml", + s"resolved-${mdId.getOrganisation}-${mdId.getName}-${mdId.getRevision}.properties" +) +currentResolutionFiles.foreach { filename => + new File(ivySettings.getDefaultCache, filename).delete() +} + } /** * Resolves any dependencies that were supplied through maven coordinates @@ -1255,14 +1282,6 @@ private[spark] object SparkSubmitUtils { // A Module descriptor must be specified. Entries are dummy strings val md = getModuleDescriptor -// clear ivy resolution from previous launches. The resolution file is usually at -// ~/.ivy2/org.apache.spark-spark-submit-parent-default.xml. In between runs, this file -// leads to confusion with
spark git commit: [SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics"
Repository: spark Updated Branches: refs/heads/master 6282fc64e -> 3e2600538 [SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics" ## What changes were proposed in this pull request? Sometimes "SparkListenerSuite.local metrics" test fails because the average of executorDeserializeTime is too short. As squito suggested to avoid these situations in one of the task a reference introduced to an object implementing a custom Externalizable.readExternal which sleeps 1ms before returning. ## How was this patch tested? With unit tests (and checking the effect of this change to the average with a much larger sleep time). Author: âattilapirosâAuthor: Attila Zsolt Piros <2017933+attilapi...@users.noreply.github.com> Closes #21280 from attilapiros/SPARK-19181. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3e260053 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3e260053 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3e260053 Branch: refs/heads/master Commit: 3e2600538ee477ffe3f23fba57719e035219550b Parents: 6282fc6 Author: âattilapirosâ Authored: Thu May 10 14:26:38 2018 -0700 Committer: Marcelo Vanzin Committed: Thu May 10 14:26:38 2018 -0700 -- .../apache/spark/scheduler/SparkListenerSuite.scala | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3e260053/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala index da6ecb8..fa47a52 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala @@ -17,6 +17,7 @@ package org.apache.spark.scheduler +import java.io.{Externalizable, IOException, ObjectInput, ObjectOutput} import java.util.concurrent.Semaphore import scala.collection.JavaConverters._ @@ -294,10 +295,13 @@ class SparkListenerSuite extends SparkFunSuite with LocalSparkContext with Match val listener = new SaveStageAndTaskInfo sc.addSparkListener(listener) sc.addSparkListener(new StatsReportListener) -// just to make sure some of the tasks take a noticeable amount of time +// just to make sure some of the tasks and their deserialization take a noticeable +// amount of time +val slowDeserializable = new SlowDeserializable val w = { i: Int => if (i == 0) { Thread.sleep(100) +slowDeserializable.use() } i } @@ -583,3 +587,12 @@ private class FirehoseListenerThatAcceptsSparkConf(conf: SparkConf) extends Spar case _ => } } + +private class SlowDeserializable extends Externalizable { + + override def writeExternal(out: ObjectOutput): Unit = { } + + override def readExternal(in: ObjectInput): Unit = Thread.sleep(1) + + def use(): Unit = { } +} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics"
Repository: spark Updated Branches: refs/heads/branch-2.3 16cd9ac52 -> 4c49b12da [SPARK-19181][CORE] Fixing flaky "SparkListenerSuite.local metrics" ## What changes were proposed in this pull request? Sometimes "SparkListenerSuite.local metrics" test fails because the average of executorDeserializeTime is too short. As squito suggested to avoid these situations in one of the task a reference introduced to an object implementing a custom Externalizable.readExternal which sleeps 1ms before returning. ## How was this patch tested? With unit tests (and checking the effect of this change to the average with a much larger sleep time). Author: âattilapirosâAuthor: Attila Zsolt Piros <2017933+attilapi...@users.noreply.github.com> Closes #21280 from attilapiros/SPARK-19181. (cherry picked from commit 3e2600538ee477ffe3f23fba57719e035219550b) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4c49b12d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4c49b12d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4c49b12d Branch: refs/heads/branch-2.3 Commit: 4c49b12da512ae29e2e4b773a334abbf6a4f08f1 Parents: 16cd9ac Author: âattilapirosâ Authored: Thu May 10 14:26:38 2018 -0700 Committer: Marcelo Vanzin Committed: Thu May 10 14:26:47 2018 -0700 -- .../apache/spark/scheduler/SparkListenerSuite.scala | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4c49b12d/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala index da6ecb8..fa47a52 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala @@ -17,6 +17,7 @@ package org.apache.spark.scheduler +import java.io.{Externalizable, IOException, ObjectInput, ObjectOutput} import java.util.concurrent.Semaphore import scala.collection.JavaConverters._ @@ -294,10 +295,13 @@ class SparkListenerSuite extends SparkFunSuite with LocalSparkContext with Match val listener = new SaveStageAndTaskInfo sc.addSparkListener(listener) sc.addSparkListener(new StatsReportListener) -// just to make sure some of the tasks take a noticeable amount of time +// just to make sure some of the tasks and their deserialization take a noticeable +// amount of time +val slowDeserializable = new SlowDeserializable val w = { i: Int => if (i == 0) { Thread.sleep(100) +slowDeserializable.use() } i } @@ -583,3 +587,12 @@ private class FirehoseListenerThatAcceptsSparkConf(conf: SparkConf) extends Spar case _ => } } + +private class SlowDeserializable extends Externalizable { + + override def writeExternal(out: ObjectOutput): Unit = { } + + override def readExternal(in: ObjectInput): Unit = Thread.sleep(1) + + def use(): Unit = { } +} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r26837 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_10_14_01-16cd9ac-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu May 10 21:15:50 2018 New Revision: 26837 Log: Apache Spark 2.3.1-SNAPSHOT-2018_05_10_14_01-16cd9ac docs [This commit notification would consist of 1443 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r26835 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_12_01-6282fc6-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu May 10 19:15:10 2018 New Revision: 26835 Log: Apache Spark 2.4.0-SNAPSHOT-2018_05_10_12_01-6282fc6 docs [This commit notification would consist of 1462 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24137][K8S] Mount local directories as empty dir volumes.
Repository: spark Updated Branches: refs/heads/master f4fed0512 -> 6282fc64e [SPARK-24137][K8S] Mount local directories as empty dir volumes. ## What changes were proposed in this pull request? Drastically improves performance and won't cause Spark applications to fail because they write too much data to the Docker image's specific file system. The file system's directories that back emptydir volumes are generally larger and more performant. ## How was this patch tested? Has been in use via the prototype version of Kubernetes support, but lost in the transition to here. Author: mcheahCloses #21238 from mccheah/mount-local-dirs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6282fc64 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6282fc64 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6282fc64 Branch: refs/heads/master Commit: 6282fc64e32fc2f70e79ace14efd4922e4535dbb Parents: f4fed05 Author: mcheah Authored: Thu May 10 11:36:41 2018 -0700 Committer: Anirudh Ramanathan Committed: Thu May 10 11:36:41 2018 -0700 -- .../main/scala/org/apache/spark/SparkConf.scala | 5 +- .../k8s/features/LocalDirsFeatureStep.scala | 77 + .../k8s/submit/KubernetesDriverBuilder.scala| 10 +- .../cluster/k8s/KubernetesExecutorBuilder.scala | 9 +- .../features/LocalDirsFeatureStepSuite.scala| 111 +++ .../submit/KubernetesDriverBuilderSuite.scala | 13 ++- .../k8s/KubernetesExecutorBuilderSuite.scala| 12 +- 7 files changed, 223 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6282fc64/core/src/main/scala/org/apache/spark/SparkConf.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 129956e..dab4095 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -454,8 +454,9 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria */ private[spark] def validateSettings() { if (contains("spark.local.dir")) { - val msg = "In Spark 1.0 and later spark.local.dir will be overridden by the value set by " + -"the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN)." + val msg = "Note that spark.local.dir will be overridden by the value set by " + +"the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS" + +" in YARN)." logWarning(msg) } http://git-wip-us.apache.org/repos/asf/spark/blob/6282fc64/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala -- diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala new file mode 100644 index 000..70b3073 --- /dev/null +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features + +import java.nio.file.Paths +import java.util.UUID + +import io.fabric8.kubernetes.api.model.{ContainerBuilder, HasMetadata, PodBuilder, VolumeBuilder, VolumeMountBuilder} + +import org.apache.spark.deploy.k8s.{KubernetesConf, KubernetesDriverSpecificConf, KubernetesRoleSpecificConf, SparkPod} + +private[spark] class LocalDirsFeatureStep( +conf: KubernetesConf[_ <: KubernetesRoleSpecificConf], +defaultLocalDir: String = s"/var/data/spark-${UUID.randomUUID}") + extends KubernetesFeatureConfigStep { + + //
[2/2] spark git commit: [PYSPARK] Update py4j to version 0.10.7.
[PYSPARK] Update py4j to version 0.10.7. (cherry picked from commit cc613b552e753d03cb62661591de59e1c8d82c74) Signed-off-by: Marcelo VanzinProject: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/323dc3ad Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/323dc3ad Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/323dc3ad Branch: refs/heads/branch-2.3 Commit: 323dc3ad02e63a7c99b5bd6da618d6020657ecba Parents: eab10f9 Author: Marcelo Vanzin Authored: Fri Apr 13 14:28:24 2018 -0700 Committer: Marcelo Vanzin Committed: Thu May 10 10:47:37 2018 -0700 -- LICENSE | 2 +- bin/pyspark | 6 +- bin/pyspark2.cmd| 2 +- core/pom.xml| 2 +- .../org/apache/spark/SecurityManager.scala | 11 +- .../spark/api/python/PythonGatewayServer.scala | 50 ++--- .../org/apache/spark/api/python/PythonRDD.scala | 29 -- .../apache/spark/api/python/PythonUtils.scala | 2 +- .../spark/api/python/PythonWorkerFactory.scala | 21 ++-- .../org/apache/spark/deploy/PythonRunner.scala | 12 ++- .../apache/spark/internal/config/package.scala | 5 + .../spark/security/SocketAuthHelper.scala | 101 +++ .../scala/org/apache/spark/util/Utils.scala | 12 +++ .../spark/security/SocketAuthHelperSuite.scala | 97 ++ dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- dev/run-pip-tests | 2 +- python/README.md| 2 +- python/docs/Makefile| 2 +- python/lib/py4j-0.10.6-src.zip | Bin 80352 -> 0 bytes python/lib/py4j-0.10.7-src.zip | Bin 0 -> 42437 bytes python/pyspark/context.py | 4 +- python/pyspark/daemon.py| 21 +++- python/pyspark/java_gateway.py | 93 ++--- python/pyspark/rdd.py | 21 ++-- python/pyspark/sql/dataframe.py | 12 +-- python/pyspark/worker.py| 7 +- python/setup.py | 2 +- .../org/apache/spark/deploy/yarn/Client.scala | 2 +- .../spark/deploy/yarn/YarnClusterSuite.scala| 2 +- sbin/spark-config.sh| 2 +- .../scala/org/apache/spark/sql/Dataset.scala| 6 +- 32 files changed, 418 insertions(+), 116 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/323dc3ad/LICENSE -- diff --git a/LICENSE b/LICENSE index c2b0d72..820f14d 100644 --- a/LICENSE +++ b/LICENSE @@ -263,7 +263,7 @@ The text of each license is also included at licenses/LICENSE-[project].txt. (New BSD license) Protocol Buffer Java API (org.spark-project.protobuf:protobuf-java:2.4.1-shaded - http://code.google.com/p/protobuf) (The BSD License) Fortran to Java ARPACK (net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net) (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - http://xmlenc.sourceforge.net) - (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.6 - http://py4j.sourceforge.net/) + (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.7 - http://py4j.sourceforge.net/) (Two-clause BSD-style license) JUnit-Interface (com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/) (BSD licence) sbt and sbt-launch-lib.bash (BSD 3 Clause) d3.min.js (https://github.com/mbostock/d3/blob/master/LICENSE) http://git-wip-us.apache.org/repos/asf/spark/blob/323dc3ad/bin/pyspark -- diff --git a/bin/pyspark b/bin/pyspark index dd28627..5d5affb 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -25,14 +25,14 @@ source "${SPARK_HOME}"/bin/load-spark-env.sh export _SPARK_CMD_USAGE="Usage: ./bin/pyspark [options]" # In Spark 2.0, IPYTHON and IPYTHON_OPTS are removed and pyspark fails to launch if either option -# is set in the user's environment. Instead, users should set PYSPARK_DRIVER_PYTHON=ipython +# is set in the user's environment. Instead, users should set PYSPARK_DRIVER_PYTHON=ipython # to use IPython and set PYSPARK_DRIVER_PYTHON_OPTS to pass options when starting the Python driver # (e.g. PYSPARK_DRIVER_PYTHON_OPTS='notebook'). This supports full customization of the IPython # and executor Python executables. # Fail noisily if removed options are set if [[ -n "$IPYTHON" || -n "$IPYTHON_OPTS" ]]; then - echo "Error in
[1/2] spark git commit: [SPARKR] Match pyspark features in SparkR communication protocol.
Repository: spark Updated Branches: refs/heads/branch-2.3 eab10f994 -> 16cd9ac52 [SPARKR] Match pyspark features in SparkR communication protocol. (cherry picked from commit 628c7b517969c4a7ccb26ea67ab3dd61266073ca) Signed-off-by: Marcelo VanzinProject: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16cd9ac5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16cd9ac5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16cd9ac5 Branch: refs/heads/branch-2.3 Commit: 16cd9ac5264831e061c033b26fe1173ebc88e5d1 Parents: 323dc3a Author: Marcelo Vanzin Authored: Tue Apr 17 13:29:43 2018 -0700 Committer: Marcelo Vanzin Committed: Thu May 10 10:47:37 2018 -0700 -- R/pkg/R/client.R| 4 +- R/pkg/R/deserialize.R | 10 ++-- R/pkg/R/sparkR.R| 39 -- R/pkg/inst/worker/daemon.R | 4 +- R/pkg/inst/worker/worker.R | 5 +- .../org/apache/spark/api/r/RAuthHelper.scala| 38 ++ .../scala/org/apache/spark/api/r/RBackend.scala | 43 --- .../spark/api/r/RBackendAuthHandler.scala | 55 .../scala/org/apache/spark/api/r/RRunner.scala | 35 + .../scala/org/apache/spark/deploy/RRunner.scala | 6 ++- 10 files changed, 210 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/16cd9ac5/R/pkg/R/client.R -- diff --git a/R/pkg/R/client.R b/R/pkg/R/client.R index 9d82814..7244cc9 100644 --- a/R/pkg/R/client.R +++ b/R/pkg/R/client.R @@ -19,7 +19,7 @@ # Creates a SparkR client connection object # if one doesn't already exist -connectBackend <- function(hostname, port, timeout) { +connectBackend <- function(hostname, port, timeout, authSecret) { if (exists(".sparkRcon", envir = .sparkREnv)) { if (isOpen(.sparkREnv[[".sparkRCon"]])) { cat("SparkRBackend client connection already exists\n") @@ -29,7 +29,7 @@ connectBackend <- function(hostname, port, timeout) { con <- socketConnection(host = hostname, port = port, server = FALSE, blocking = TRUE, open = "wb", timeout = timeout) - + doServerAuth(con, authSecret) assign(".sparkRCon", con, envir = .sparkREnv) con } http://git-wip-us.apache.org/repos/asf/spark/blob/16cd9ac5/R/pkg/R/deserialize.R -- diff --git a/R/pkg/R/deserialize.R b/R/pkg/R/deserialize.R index a90f7d3..cb03f16 100644 --- a/R/pkg/R/deserialize.R +++ b/R/pkg/R/deserialize.R @@ -60,14 +60,18 @@ readTypedObject <- function(con, type) { stop(paste("Unsupported type for deserialization", type))) } -readString <- function(con) { - stringLen <- readInt(con) - raw <- readBin(con, raw(), stringLen, endian = "big") +readStringData <- function(con, len) { + raw <- readBin(con, raw(), len, endian = "big") string <- rawToChar(raw) Encoding(string) <- "UTF-8" string } +readString <- function(con) { + stringLen <- readInt(con) + readStringData(con, stringLen) +} + readInt <- function(con) { readBin(con, integer(), n = 1, endian = "big") } http://git-wip-us.apache.org/repos/asf/spark/blob/16cd9ac5/R/pkg/R/sparkR.R -- diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R index 965471f..7430d84 100644 --- a/R/pkg/R/sparkR.R +++ b/R/pkg/R/sparkR.R @@ -161,6 +161,10 @@ sparkR.sparkContext <- function( " please use the --packages commandline instead", sep = ",")) } backendPort <- existingPort +authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET") +if (nchar(authSecret) == 0) { + stop("Auth secret not provided in environment.") +} } else { path <- tempfile(pattern = "backend_port") submitOps <- getClientModeSparkSubmitOpts( @@ -189,16 +193,27 @@ sparkR.sparkContext <- function( monitorPort <- readInt(f) rLibPath <- readString(f) connectionTimeout <- readInt(f) + +# Don't use readString() so that we can provide a useful +# error message if the R and Java versions are mismatched. +authSecretLen = readInt(f) +if (length(authSecretLen) == 0 || authSecretLen == 0) { + stop("Unexpected EOF in JVM connection data. Mismatched versions?") +} +authSecret <- readStringData(f, authSecretLen) close(f) file.remove(path) if (length(backendPort) == 0 || backendPort == 0 || length(monitorPort) == 0 || monitorPort == 0 || -length(rLibPath) != 1) { +length(rLibPath) != 1 || length(authSecret) == 0) { stop("JVM failed
svn commit: r26831 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_10_10_01-eab10f9-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu May 10 17:15:43 2018 New Revision: 26831 Log: Apache Spark 2.3.1-SNAPSHOT-2018_05_10_10_01-eab10f9 docs [This commit notification would consist of 1443 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24171] Adding a note for non-deterministic functions
Repository: spark Updated Branches: refs/heads/master 94d671448 -> f4fed0512 [SPARK-24171] Adding a note for non-deterministic functions ## What changes were proposed in this pull request? I propose to add a clear statement for functions like `collect_list()` about non-deterministic behavior of such functions. The behavior must be taken into account by user while creating and running queries. Author: Maxim GekkCloses #21228 from MaxGekk/deterministic-comments. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4fed051 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4fed051 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4fed051 Branch: refs/heads/master Commit: f4fed0512101a67d9dae50ace11d3940b910e05e Parents: 94d6714 Author: Maxim Gekk Authored: Thu May 10 09:44:49 2018 -0700 Committer: gatorsmile Committed: Thu May 10 09:44:49 2018 -0700 -- R/pkg/R/functions.R | 11 + python/pyspark/sql/functions.py | 18 .../expressions/MonotonicallyIncreasingID.scala | 1 + .../spark/sql/catalyst/expressions/misc.scala | 5 ++- .../expressions/randomExpressions.scala | 8 ++-- .../scala/org/apache/spark/sql/functions.scala | 46 ++-- 6 files changed, 81 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f4fed051/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 0ec99d1..04d0e46 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -805,6 +805,8 @@ setMethod("factorial", #' #' The function by default returns the first values it sees. It will return the first non-missing #' value it sees when na.rm is set to true. If all values are missing, then NA is returned. +#' Note: the function is non-deterministic because its results depends on order of rows which +#' may be non-deterministic after a shuffle. #' #' @param na.rm a logical value indicating whether NA values should be stripped #'before the computation proceeds. @@ -948,6 +950,8 @@ setMethod("kurtosis", #' #' The function by default returns the last values it sees. It will return the last non-missing #' value it sees when na.rm is set to true. If all values are missing, then NA is returned. +#' Note: the function is non-deterministic because its results depends on order of rows which +#' may be non-deterministic after a shuffle. #' #' @param x column to compute on. #' @param na.rm a logical value indicating whether NA values should be stripped @@ -1201,6 +1205,7 @@ setMethod("minute", #' 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. #' This is equivalent to the MONOTONICALLY_INCREASING_ID function in SQL. #' The method should be used with no argument. +#' Note: the function is non-deterministic because its result depends on partition IDs. #' #' @rdname column_nonaggregate_functions #' @aliases monotonically_increasing_id monotonically_increasing_id,missing-method @@ -2584,6 +2589,7 @@ setMethod("lpad", signature(x = "Column", len = "numeric", pad = "character"), #' @details #' \code{rand}: Generates a random column with independent and identically distributed (i.i.d.) #' samples from U[0.0, 1.0]. +#' Note: the function is non-deterministic in general case. #' #' @rdname column_nonaggregate_functions #' @param seed a random seed. Can be missing. @@ -2612,6 +2618,7 @@ setMethod("rand", signature(seed = "numeric"), #' @details #' \code{randn}: Generates a column with independent and identically distributed (i.i.d.) samples #' from the standard normal distribution. +#' Note: the function is non-deterministic in general case. #' #' @rdname column_nonaggregate_functions #' @aliases randn randn,missing-method @@ -3188,6 +3195,8 @@ setMethod("create_map", #' @details #' \code{collect_list}: Creates a list of objects with duplicates. +#' Note: the function is non-deterministic because the order of collected results depends +#' on order of rows which may be non-deterministic after a shuffle. #' #' @rdname column_aggregate_functions #' @aliases collect_list collect_list,Column-method @@ -3207,6 +3216,8 @@ setMethod("collect_list", #' @details #' \code{collect_set}: Creates a list of objects with duplicate elements eliminated. +#' Note: the function is non-deterministic because the order of collected results depends +#' on order of rows which may be non-deterministic after a shuffle. #' #' @rdname column_aggregate_functions #' @aliases collect_set collect_set,Column-method http://git-wip-us.apache.org/repos/asf/spark/blob/f4fed051/python/pyspark/sql/functions.py
spark git commit: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring
Repository: spark Updated Branches: refs/heads/branch-2.3 8889d7864 -> eab10f994 [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring ## What changes were proposed in this pull request? While reading CSV or JSON files, DataFrameReader's options are converted to Hadoop's parameters, for example there: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302 but the options are not propagated to Text datasource on schema inferring, for instance: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188 The PR proposes propagation of user's options to Text datasource on scheme inferring in similar way as user's options are converted to Hadoop parameters if schema is specified. ## How was this patch tested? The changes were tested manually by using https://github.com/twitter/hadoop-lzo: ``` hadoop-lzo> mvn clean package hadoop-lzo> ln -s ./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar ``` Create 2 test files in JSON and CSV format and compress them: ```shell $ cat test.csv col1|col2 a|1 $ lzop test.csv $ cat test.json {"col1":"a","col2":1} $ lzop test.json ``` Run `spark-shell` with hadoop-lzo: ``` bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar ``` reading compressed CSV and JSON without schema: ```scala spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","|").csv("test.csv.lzo").show() +++ |col1|col2| +++ | a| 1| +++ ``` ```scala spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("multiLine", true).json("test.json.lzo").printSchema root |-- col1: string (nullable = true) |-- col2: long (nullable = true) ``` Author: Maxim GekkAuthor: Maxim Gekk Closes #21292 from MaxGekk/text-options-backport-v2.3. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eab10f99 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eab10f99 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eab10f99 Branch: refs/heads/branch-2.3 Commit: eab10f9945c1d01daa45c233a39dedfd184f543c Parents: 8889d78 Author: Maxim Gekk Authored: Fri May 11 00:28:43 2018 +0800 Committer: hyukjinkwon Committed: Fri May 11 00:28:43 2018 +0800 -- .../spark/sql/catalyst/json/JSONOptions.scala | 2 +- .../execution/datasources/csv/CSVDataSource.scala | 6 -- .../sql/execution/datasources/csv/CSVOptions.scala | 2 +- .../execution/datasources/csv/UnivocityParser.scala | 2 -- .../execution/datasources/json/JsonDataSource.scala | 16 ++-- 5 files changed, 16 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/eab10f99/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala index 652412b..190fcc6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala @@ -31,7 +31,7 @@ import org.apache.spark.sql.catalyst.util._ * Most of these map directly to Jackson's internal options, specified in [[JsonParser.Feature]]. */ private[sql] class JSONOptions( -@transient private val parameters: CaseInsensitiveMap[String], +@transient val parameters: CaseInsensitiveMap[String], defaultTimeZoneId: String, defaultColumnNameOfCorruptRecord: String) extends Logging with Serializable { http://git-wip-us.apache.org/repos/asf/spark/blob/eab10f99/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala index 4870d75..fffad17 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala @@ -184,7 +184,8 @@ object TextInputCSVDataSource extends CSVDataSource { DataSource.apply( sparkSession, paths = paths, -
svn commit: r26830 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_10_08_01-94d6714-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Thu May 10 15:16:10 2018 New Revision: 26830 Log: Apache Spark 2.4.0-SNAPSHOT-2018_05_10_08_01-94d6714 docs [This commit notification would consist of 1462 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-23907][SQL] Add regr_* functions
Repository: spark Updated Branches: refs/heads/master e3d434994 -> 94d671448 [SPARK-23907][SQL] Add regr_* functions ## What changes were proposed in this pull request? The PR introduces regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. The implementation of this functions mirrors Hive's one in HIVE-15978. ## How was this patch tested? added UT (values compared with Hive) Author: Marco GaidoCloses #21054 from mgaido91/SPARK-23907. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/94d67144 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/94d67144 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/94d67144 Branch: refs/heads/master Commit: 94d671448240c8f6da11d2523ba9e4ae5b56a410 Parents: e3d4349 Author: Marco Gaido Authored: Thu May 10 20:38:52 2018 +0900 Committer: Takuya UESHIN Committed: Thu May 10 20:38:52 2018 +0900 -- .../catalyst/analysis/FunctionRegistry.scala| 9 + .../expressions/aggregate/Average.scala | 47 +++-- .../aggregate/CentralMomentAgg.scala| 60 +++--- .../catalyst/expressions/aggregate/Corr.scala | 52 ++--- .../catalyst/expressions/aggregate/Count.scala | 47 +++-- .../expressions/aggregate/Covariance.scala | 36 ++-- .../expressions/aggregate/regression.scala | 190 +++ .../scala/org/apache/spark/sql/functions.scala | 172 + .../sql-tests/inputs/udaf-regrfunctions.sql | 56 ++ .../results/udaf-regrfunctions.sql.out | 93 + .../spark/sql/DataFrameAggregateSuite.scala | 71 ++- 11 files changed, 721 insertions(+), 112 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/94d67144/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index 87b0911..087d000 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -299,6 +299,15 @@ object FunctionRegistry { expression[CollectList]("collect_list"), expression[CollectSet]("collect_set"), expression[CountMinSketchAgg]("count_min_sketch"), +expression[RegrCount]("regr_count"), +expression[RegrSXX]("regr_sxx"), +expression[RegrSYY]("regr_syy"), +expression[RegrAvgX]("regr_avgx"), +expression[RegrAvgY]("regr_avgy"), +expression[RegrSXY]("regr_sxy"), +expression[RegrSlope]("regr_slope"), +expression[RegrR2]("regr_r2"), +expression[RegrIntercept]("regr_intercept"), // string functions expression[Ascii]("ascii"), http://git-wip-us.apache.org/repos/asf/spark/blob/94d67144/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala index 708bdbf..a133bc2 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala @@ -23,24 +23,12 @@ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.util.TypeUtils import org.apache.spark.sql.types._ -@ExpressionDescription( - usage = "_FUNC_(expr) - Returns the mean calculated from values of a group.") -case class Average(child: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes { - - override def prettyName: String = "avg" - - override def children: Seq[Expression] = child :: Nil +abstract class AverageLike(child: Expression) extends DeclarativeAggregate { override def nullable: Boolean = true - // Return data type. override def dataType: DataType = resultType - override def inputTypes: Seq[AbstractDataType] = Seq(NumericType) - - override def checkInputDataTypes(): TypeCheckResult = -TypeUtils.checkForNumericExpr(child.dataType, "function average") - private lazy val resultType = child.dataType match { case DecimalType.Fixed(p, s) => DecimalType.bounded(p + 4, s + 4) @@ -62,14 +50,6 @@ case class Average(child: Expression) extends DeclarativeAggregate with Implicit /* count =