spark git commit: [SPARK-16632][SQL] Respect Hive schema when merging parquet schema.
Repository: spark Updated Branches: refs/heads/branch-2.0 6f209c8fa -> c2b5b3ca5 [SPARK-16632][SQL] Respect Hive schema when merging parquet schema. When Hive (or at least certain versions of Hive) creates parquet files containing tinyint or smallint columns, it stores them as int32, but doesn't annotate the parquet field as containing the corresponding int8 / int16 data. When Spark reads those files using the vectorized reader, it follows the parquet schema for these fields, but when actually reading the data it tries to use the type fetched from the metastore, and then fails because data has been loaded into the wrong fields in OnHeapColumnVector. So instead of blindly trusting the parquet schema, check whether the Catalyst-provided schema disagrees with it, and adjust the types so that the necessary metadata is present when loading the data into the ColumnVector instance. Tested with unit tests and with tests that create byte / short columns in Hive and try to read them from Spark. Author: Marcelo VanzinCloses #14272 from vanzin/SPARK-16632. (cherry picked from commit 75146be6ba5e9f559f5f15430310bb476ee0812c) Signed-off-by: Cheng Lian Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c2b5b3ca Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c2b5b3ca Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c2b5b3ca Branch: refs/heads/branch-2.0 Commit: c2b5b3ca538aaaef946653e60bd68e38c58dc41f Parents: 6f209c8 Author: Marcelo Vanzin Authored: Wed Jul 20 13:00:22 2016 +0800 Committer: Cheng Lian Committed: Wed Jul 20 13:49:45 2016 +0800 -- .../parquet/ParquetReadSupport.scala| 18 + .../parquet/ParquetSchemaSuite.scala| 39 2 files changed, 57 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c2b5b3ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala index 12f4974..1628e4c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala @@ -26,6 +26,8 @@ import org.apache.parquet.hadoop.api.{InitContext, ReadSupport} import org.apache.parquet.hadoop.api.ReadSupport.ReadContext import org.apache.parquet.io.api.RecordMaterializer import org.apache.parquet.schema._ +import org.apache.parquet.schema.OriginalType._ +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._ import org.apache.parquet.schema.Type.Repetition import org.apache.spark.internal.Logging @@ -116,6 +118,12 @@ private[parquet] object ParquetReadSupport { } private def clipParquetType(parquetType: Type, catalystType: DataType): Type = { +val primName = if (parquetType.isPrimitive()) { + parquetType.asPrimitiveType().getPrimitiveTypeName() +} else { + null +} + catalystType match { case t: ArrayType if !isPrimitiveCatalystType(t.elementType) => // Only clips array types with nested type as element type. @@ -130,6 +138,16 @@ private[parquet] object ParquetReadSupport { case t: StructType => clipParquetGroup(parquetType.asGroupType(), t) + case _: ByteType if primName == INT32 => +// SPARK-16632: Handle case where Hive stores bytes in a int32 field without specifying +// the original type. +Types.primitive(INT32, parquetType.getRepetition()).as(INT_8).named(parquetType.getName()) + + case _: ShortType if primName == INT32 => +// SPARK-16632: Handle case where Hive stores shorts in a int32 field without specifying +// the original type. +Types.primitive(INT32, parquetType.getRepetition()).as(INT_16).named(parquetType.getName()) + case _ => // UDTs and primitive types are not clipped. For UDTs, a clipped version might not be able // to be mapped to desired user-space types. So UDTs shouldn't participate schema merging. http://git-wip-us.apache.org/repos/asf/spark/blob/c2b5b3ca/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala
spark git commit: [SPARK-16632][SQL] Respect Hive schema when merging parquet schema.
Repository: spark Updated Branches: refs/heads/master fc2326362 -> 75146be6b [SPARK-16632][SQL] Respect Hive schema when merging parquet schema. When Hive (or at least certain versions of Hive) creates parquet files containing tinyint or smallint columns, it stores them as int32, but doesn't annotate the parquet field as containing the corresponding int8 / int16 data. When Spark reads those files using the vectorized reader, it follows the parquet schema for these fields, but when actually reading the data it tries to use the type fetched from the metastore, and then fails because data has been loaded into the wrong fields in OnHeapColumnVector. So instead of blindly trusting the parquet schema, check whether the Catalyst-provided schema disagrees with it, and adjust the types so that the necessary metadata is present when loading the data into the ColumnVector instance. Tested with unit tests and with tests that create byte / short columns in Hive and try to read them from Spark. Author: Marcelo VanzinCloses #14272 from vanzin/SPARK-16632. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/75146be6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/75146be6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/75146be6 Branch: refs/heads/master Commit: 75146be6ba5e9f559f5f15430310bb476ee0812c Parents: fc23263 Author: Marcelo Vanzin Authored: Wed Jul 20 13:00:22 2016 +0800 Committer: Cheng Lian Committed: Wed Jul 20 13:00:22 2016 +0800 -- .../parquet/ParquetReadSupport.scala| 18 + .../parquet/ParquetSchemaSuite.scala| 39 2 files changed, 57 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/75146be6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala index e6ef634..46d786d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala @@ -26,6 +26,8 @@ import org.apache.parquet.hadoop.api.{InitContext, ReadSupport} import org.apache.parquet.hadoop.api.ReadSupport.ReadContext import org.apache.parquet.io.api.RecordMaterializer import org.apache.parquet.schema._ +import org.apache.parquet.schema.OriginalType._ +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._ import org.apache.parquet.schema.Type.Repetition import org.apache.spark.internal.Logging @@ -120,6 +122,12 @@ private[parquet] object ParquetReadSupport { } private def clipParquetType(parquetType: Type, catalystType: DataType): Type = { +val primName = if (parquetType.isPrimitive()) { + parquetType.asPrimitiveType().getPrimitiveTypeName() +} else { + null +} + catalystType match { case t: ArrayType if !isPrimitiveCatalystType(t.elementType) => // Only clips array types with nested type as element type. @@ -134,6 +142,16 @@ private[parquet] object ParquetReadSupport { case t: StructType => clipParquetGroup(parquetType.asGroupType(), t) + case _: ByteType if primName == INT32 => +// SPARK-16632: Handle case where Hive stores bytes in a int32 field without specifying +// the original type. +Types.primitive(INT32, parquetType.getRepetition()).as(INT_8).named(parquetType.getName()) + + case _: ShortType if primName == INT32 => +// SPARK-16632: Handle case where Hive stores shorts in a int32 field without specifying +// the original type. +Types.primitive(INT32, parquetType.getRepetition()).as(INT_16).named(parquetType.getName()) + case _ => // UDTs and primitive types are not clipped. For UDTs, a clipped version might not be able // to be mapped to desired user-space types. So UDTs shouldn't participate schema merging. http://git-wip-us.apache.org/repos/asf/spark/blob/75146be6/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala index 8a980a7..31ebec0 100644 ---
spark git commit: [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to SparkSubmitSuite
Repository: spark Updated Branches: refs/heads/branch-2.0 f58fd4620 -> 6f209c8fa [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to SparkSubmitSuite ## What changes were proposed in this pull request? This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar. ## How was this patch tested? SparkR unit tests, SparkSubmitSuite, check-cran.sh Author: Shivaram VenkataramanCloses #14243 from shivaram/sparkr-jar-move. (cherry picked from commit fc23263623d5dcd1167fa93c094fe41ace77c326) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6f209c8f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6f209c8f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6f209c8f Branch: refs/heads/branch-2.0 Commit: 6f209c8faad0c928368852c881e2aaabe100b152 Parents: f58fd46 Author: Shivaram Venkataraman Authored: Tue Jul 19 19:28:08 2016 -0700 Committer: Shivaram Venkataraman Committed: Tue Jul 19 19:28:18 2016 -0700 -- .../inst/test_support/sparktestjar_2.10-1.0.jar | Bin 2886 -> 0 bytes R/pkg/inst/tests/testthat/jarTest.R | 10 ++--- R/pkg/inst/tests/testthat/test_includeJAR.R | 36 -- .../scala/org/apache/spark/api/r/RUtils.scala | 9 + .../apache/spark/deploy/SparkSubmitSuite.scala | 38 +++ 5 files changed, 52 insertions(+), 41 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6f209c8f/R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar -- diff --git a/R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar b/R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar deleted file mode 100644 index 1d5c2af..000 Binary files a/R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/6f209c8f/R/pkg/inst/tests/testthat/jarTest.R -- diff --git a/R/pkg/inst/tests/testthat/jarTest.R b/R/pkg/inst/tests/testthat/jarTest.R index 51754a4..c9615c8 100644 --- a/R/pkg/inst/tests/testthat/jarTest.R +++ b/R/pkg/inst/tests/testthat/jarTest.R @@ -16,17 +16,17 @@ # library(SparkR) -sparkR.session() +sc <- sparkR.session() -helloTest <- SparkR:::callJStatic("sparkR.test.hello", +helloTest <- SparkR:::callJStatic("sparkrtest.DummyClass", "helloWorld", "Dave") +stopifnot(identical(helloTest, "Hello Dave")) -basicFunction <- SparkR:::callJStatic("sparkR.test.basicFunction", +basicFunction <- SparkR:::callJStatic("sparkrtest.DummyClass", "addStuff", 2L, 2L) +stopifnot(basicFunction == 4L) sparkR.session.stop() -output <- c(helloTest, basicFunction) -writeLines(output) http://git-wip-us.apache.org/repos/asf/spark/blob/6f209c8f/R/pkg/inst/tests/testthat/test_includeJAR.R -- diff --git a/R/pkg/inst/tests/testthat/test_includeJAR.R b/R/pkg/inst/tests/testthat/test_includeJAR.R deleted file mode 100644 index 512dd39..000 --- a/R/pkg/inst/tests/testthat/test_includeJAR.R +++ /dev/null @@ -1,36 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -context("include an external JAR in SparkContext") - -runScript <- function() { - sparkHome <- Sys.getenv("SPARK_HOME") - sparkTestJarPath <- "R/lib/SparkR/test_support/sparktestjar_2.10-1.0.jar" - jarPath <- paste("--jars", shQuote(file.path(sparkHome, sparkTestJarPath))) - scriptPath <- file.path(sparkHome, "R/lib/SparkR/tests/testthat/jarTest.R") - submitPath <-
spark git commit: [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to SparkSubmitSuite
Repository: spark Updated Branches: refs/heads/master 9674af6f6 -> fc2326362 [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to SparkSubmitSuite ## What changes were proposed in this pull request? This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar. ## How was this patch tested? SparkR unit tests, SparkSubmitSuite, check-cran.sh Author: Shivaram VenkataramanCloses #14243 from shivaram/sparkr-jar-move. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fc232636 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fc232636 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fc232636 Branch: refs/heads/master Commit: fc23263623d5dcd1167fa93c094fe41ace77c326 Parents: 9674af6 Author: Shivaram Venkataraman Authored: Tue Jul 19 19:28:08 2016 -0700 Committer: Shivaram Venkataraman Committed: Tue Jul 19 19:28:08 2016 -0700 -- .../inst/test_support/sparktestjar_2.10-1.0.jar | Bin 2886 -> 0 bytes R/pkg/inst/tests/testthat/jarTest.R | 10 ++--- R/pkg/inst/tests/testthat/test_includeJAR.R | 36 -- .../scala/org/apache/spark/api/r/RUtils.scala | 9 + .../apache/spark/deploy/SparkSubmitSuite.scala | 38 +++ 5 files changed, 52 insertions(+), 41 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fc232636/R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar -- diff --git a/R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar b/R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar deleted file mode 100644 index 1d5c2af..000 Binary files a/R/pkg/inst/test_support/sparktestjar_2.10-1.0.jar and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/fc232636/R/pkg/inst/tests/testthat/jarTest.R -- diff --git a/R/pkg/inst/tests/testthat/jarTest.R b/R/pkg/inst/tests/testthat/jarTest.R index 51754a4..c9615c8 100644 --- a/R/pkg/inst/tests/testthat/jarTest.R +++ b/R/pkg/inst/tests/testthat/jarTest.R @@ -16,17 +16,17 @@ # library(SparkR) -sparkR.session() +sc <- sparkR.session() -helloTest <- SparkR:::callJStatic("sparkR.test.hello", +helloTest <- SparkR:::callJStatic("sparkrtest.DummyClass", "helloWorld", "Dave") +stopifnot(identical(helloTest, "Hello Dave")) -basicFunction <- SparkR:::callJStatic("sparkR.test.basicFunction", +basicFunction <- SparkR:::callJStatic("sparkrtest.DummyClass", "addStuff", 2L, 2L) +stopifnot(basicFunction == 4L) sparkR.session.stop() -output <- c(helloTest, basicFunction) -writeLines(output) http://git-wip-us.apache.org/repos/asf/spark/blob/fc232636/R/pkg/inst/tests/testthat/test_includeJAR.R -- diff --git a/R/pkg/inst/tests/testthat/test_includeJAR.R b/R/pkg/inst/tests/testthat/test_includeJAR.R deleted file mode 100644 index 512dd39..000 --- a/R/pkg/inst/tests/testthat/test_includeJAR.R +++ /dev/null @@ -1,36 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -context("include an external JAR in SparkContext") - -runScript <- function() { - sparkHome <- Sys.getenv("SPARK_HOME") - sparkTestJarPath <- "R/lib/SparkR/test_support/sparktestjar_2.10-1.0.jar" - jarPath <- paste("--jars", shQuote(file.path(sparkHome, sparkTestJarPath))) - scriptPath <- file.path(sparkHome, "R/lib/SparkR/tests/testthat/jarTest.R") - submitPath <- file.path(sparkHome, paste("bin/", determineSparkSubmitBin(), sep = "")) - combinedArgs <- paste(jarPath, scriptPath, sep = " ") - res <-
spark git commit: [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code
Repository: spark Updated Branches: refs/heads/branch-2.0 307f8922b -> f58fd4620 [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code ## What changes were proposed in this pull request? update `refreshTable` API in python code of the sql-programming-guide. This API is added in SPARK-15820 ## How was this patch tested? N/A Author: WeichenXuCloses #14220 from WeichenXu123/update_sql_doc_catalog. (cherry picked from commit 9674af6f6f81066139ea675de724f951bd0d49c9) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f58fd462 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f58fd462 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f58fd462 Branch: refs/heads/branch-2.0 Commit: f58fd4620f703fba0c8be0724c0150b08e984a2b Parents: 307f892 Author: WeichenXu Authored: Tue Jul 19 18:48:41 2016 -0700 Committer: Reynold Xin Committed: Tue Jul 19 18:48:49 2016 -0700 -- docs/sql-programming-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f58fd462/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index a88efb7..8d92a43 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -869,8 +869,8 @@ spark.catalog().refreshTable("my_table"); {% highlight python %} -# spark is an existing HiveContext -spark.refreshTable("my_table") +# spark is an existing SparkSession +spark.catalog.refreshTable("my_table") {% endhighlight %} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code
Repository: spark Updated Branches: refs/heads/master 004e29cba -> 9674af6f6 [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code ## What changes were proposed in this pull request? update `refreshTable` API in python code of the sql-programming-guide. This API is added in SPARK-15820 ## How was this patch tested? N/A Author: WeichenXuCloses #14220 from WeichenXu123/update_sql_doc_catalog. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9674af6f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9674af6f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9674af6f Branch: refs/heads/master Commit: 9674af6f6f81066139ea675de724f951bd0d49c9 Parents: 004e29c Author: WeichenXu Authored: Tue Jul 19 18:48:41 2016 -0700 Committer: Reynold Xin Committed: Tue Jul 19 18:48:41 2016 -0700 -- docs/sql-programming-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9674af6f/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 71f3ee4..3af935a 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -869,8 +869,8 @@ spark.catalog().refreshTable("my_table"); {% highlight python %} -# spark is an existing HiveContext -spark.refreshTable("my_table") +# spark is an existing SparkSession +spark.catalog.refreshTable("my_table") {% endhighlight %} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14702] Make environment of SparkLauncher launched process more configurable
Repository: spark Updated Branches: refs/heads/master 2ae7b88a0 -> 004e29cba [SPARK-14702] Make environment of SparkLauncher launched process more configurable ## What changes were proposed in this pull request? Adds a few public methods to `SparkLauncher` to allow configuring some extra features of the `ProcessBuilder`, including the working directory, output and error stream redirection. ## How was this patch tested? Unit testing + simple Spark driver programs Author: Andrew DuffyCloses #14201 from andreweduffy/feature/launcher. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/004e29cb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/004e29cb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/004e29cb Branch: refs/heads/master Commit: 004e29cba518684d239d2d1661dce7c894a79f14 Parents: 2ae7b88 Author: Andrew Duffy Authored: Tue Jul 19 17:08:38 2016 -0700 Committer: Marcelo Vanzin Committed: Tue Jul 19 17:08:38 2016 -0700 -- .../spark/launcher/SparkLauncherSuite.java | 67 +++- .../spark/launcher/ChildProcAppHandle.java | 5 +- .../apache/spark/launcher/SparkLauncher.java| 167 --- 3 files changed, 208 insertions(+), 31 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/004e29cb/core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java -- diff --git a/core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java b/core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java index 8ca54b2..e393db0 100644 --- a/core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java +++ b/core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java @@ -21,6 +21,7 @@ import java.util.Arrays; import java.util.HashMap; import java.util.Map; +import org.junit.Before; import org.junit.Test; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -40,10 +41,15 @@ public class SparkLauncherSuite { private static final Logger LOG = LoggerFactory.getLogger(SparkLauncherSuite.class); private static final NamedThreadFactory TF = new NamedThreadFactory("SparkLauncherSuite-%d"); + private SparkLauncher launcher; + + @Before + public void configureLauncher() { +launcher = new SparkLauncher().setSparkHome(System.getProperty("spark.test.home")); + } + @Test public void testSparkArgumentHandling() throws Exception { -SparkLauncher launcher = new SparkLauncher() - .setSparkHome(System.getProperty("spark.test.home")); SparkSubmitOptionParser opts = new SparkSubmitOptionParser(); launcher.addSparkArg(opts.HELP); @@ -85,14 +91,67 @@ public class SparkLauncherSuite { assertEquals("bar", launcher.builder.conf.get("spark.foo")); } + @Test(expected=IllegalStateException.class) + public void testRedirectTwiceFails() throws Exception { +launcher.setAppResource("fake-resource.jar") + .setMainClass("my.fake.class.Fake") + .redirectError() + .redirectError(ProcessBuilder.Redirect.PIPE) + .launch(); + } + + @Test(expected=IllegalStateException.class) + public void testRedirectToLogWithOthersFails() throws Exception { +launcher.setAppResource("fake-resource.jar") + .setMainClass("my.fake.class.Fake") + .redirectToLog("fakeLog") + .redirectError(ProcessBuilder.Redirect.PIPE) + .launch(); + } + + @Test + public void testRedirectErrorToOutput() throws Exception { +launcher.redirectError(); +assertTrue(launcher.redirectErrorStream); + } + + @Test + public void testRedirectsSimple() throws Exception { +launcher.redirectError(ProcessBuilder.Redirect.PIPE); +assertNotNull(launcher.errorStream); +assertEquals(launcher.errorStream.type(), ProcessBuilder.Redirect.Type.PIPE); + +launcher.redirectOutput(ProcessBuilder.Redirect.PIPE); +assertNotNull(launcher.outputStream); +assertEquals(launcher.outputStream.type(), ProcessBuilder.Redirect.Type.PIPE); + } + + @Test + public void testRedirectLastWins() throws Exception { +launcher.redirectError(ProcessBuilder.Redirect.PIPE) + .redirectError(ProcessBuilder.Redirect.INHERIT); +assertEquals(launcher.errorStream.type(), ProcessBuilder.Redirect.Type.INHERIT); + +launcher.redirectOutput(ProcessBuilder.Redirect.PIPE) + .redirectOutput(ProcessBuilder.Redirect.INHERIT); +assertEquals(launcher.outputStream.type(), ProcessBuilder.Redirect.Type.INHERIT); + } + + @Test + public void testRedirectToLog() throws Exception { +launcher.redirectToLog("fakeLogger"); +assertTrue(launcher.redirectToLog); +assertTrue(launcher.builder.getEffectiveConfig() +
[2/2] spark git commit: Preparing development version 2.0.1-SNAPSHOT
Preparing development version 2.0.1-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/307f8922 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/307f8922 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/307f8922 Branch: refs/heads/branch-2.0 Commit: 307f8922be5c781d83c295edbbe9ad0f0d2095e3 Parents: 13650fc Author: Patrick WendellAuthored: Tue Jul 19 14:02:33 2016 -0700 Committer: Patrick Wendell Committed: Tue Jul 19 14:02:33 2016 -0700 -- assembly/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/java8-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/pom.xml | 2 +- 34 files changed, 34 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/307f8922/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 5f546bb..507ddc7 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.0.0 +2.0.1-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/307f8922/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index 2eaa810..bc3b0fe 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.0 +2.0.1-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/307f8922/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index f068d9d..2fb5835 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.0 +2.0.1-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/307f8922/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index fd22188..07d9f1c 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.0 +2.0.1-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/307f8922/common/sketch/pom.xml -- diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index a17aba5..5e02efd 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.0 +2.0.1-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/307f8922/common/tags/pom.xml -- diff --git a/common/tags/pom.xml b/common/tags/pom.xml index 0bd8846..e7fc6a2 100644 --- a/common/tags/pom.xml +++ b/common/tags/pom.xml @@ -22,7 +22,7 @@ org.apache.spark
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.0.0-rc5 [created] 13650fc58 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[1/2] spark git commit: Preparing Spark release v2.0.0-rc5
Repository: spark Updated Branches: refs/heads/branch-2.0 80ab8b666 -> 307f8922b Preparing Spark release v2.0.0-rc5 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13650fc5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13650fc5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13650fc5 Branch: refs/heads/branch-2.0 Commit: 13650fc58e1fcf2cf2a26ba11c819185ae1acc1f Parents: 80ab8b6 Author: Patrick WendellAuthored: Tue Jul 19 14:02:27 2016 -0700 Committer: Patrick Wendell Committed: Tue Jul 19 14:02:27 2016 -0700 -- assembly/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/java8-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/pom.xml | 2 +- 34 files changed, 34 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/13650fc5/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 507ddc7..5f546bb 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/13650fc5/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index bc3b0fe..2eaa810 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/13650fc5/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 2fb5835..f068d9d 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/13650fc5/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index 07d9f1c..fd22188 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/13650fc5/common/sketch/pom.xml -- diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index 5e02efd..a17aba5 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.0.1-SNAPSHOT +2.0.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/13650fc5/common/tags/pom.xml -- diff --git a/common/tags/pom.xml b/common/tags/pom.xml index e7fc6a2..0bd8846 100644 ---
spark git commit: [SPARK-15705][SQL] Change the default value of spark.sql.hive.convertMetastoreOrc to false.
Repository: spark Updated Branches: refs/heads/branch-2.0 f18f9ca5b -> 80ab8b666 [SPARK-15705][SQL] Change the default value of spark.sql.hive.convertMetastoreOrc to false. ## What changes were proposed in this pull request? In 2.0, we add a new logic to convert HiveTableScan on ORC tables to Spark's native code path. However, during this conversion, we drop the original metastore schema (https://issues.apache.org/jira/browse/SPARK-15705). Because of this regression, I am changing the default value of `spark.sql.hive.convertMetastoreOrc` to false. Author: Yin HuaiCloses #14267 from yhuai/SPARK-15705-changeDefaultValue. (cherry picked from commit 2ae7b88a07140e012b6c60db3c4a2a8ca360c684) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80ab8b66 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80ab8b66 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80ab8b66 Branch: refs/heads/branch-2.0 Commit: 80ab8b666f007de15fa9427f9734ed91238605b0 Parents: f18f9ca Author: Yin Huai Authored: Tue Jul 19 12:58:08 2016 -0700 Committer: Reynold Xin Committed: Tue Jul 19 12:58:13 2016 -0700 -- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/80ab8b66/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala index 9ed357c..bdec611 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala @@ -97,10 +97,11 @@ private[spark] object HiveUtils extends Logging { .createWithDefault(false) val CONVERT_METASTORE_ORC = SQLConfigBuilder("spark.sql.hive.convertMetastoreOrc") +.internal() .doc("When set to false, Spark SQL will use the Hive SerDe for ORC tables instead of " + "the built in support.") .booleanConf -.createWithDefault(true) +.createWithDefault(false) val HIVE_METASTORE_SHARED_PREFIXES = SQLConfigBuilder("spark.sql.hive.metastore.sharedPrefixes") .doc("A comma separated list of class prefixes that should be loaded using the classloader " + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15705][SQL] Change the default value of spark.sql.hive.convertMetastoreOrc to false.
Repository: spark Updated Branches: refs/heads/master 162d04a30 -> 2ae7b88a0 [SPARK-15705][SQL] Change the default value of spark.sql.hive.convertMetastoreOrc to false. ## What changes were proposed in this pull request? In 2.0, we add a new logic to convert HiveTableScan on ORC tables to Spark's native code path. However, during this conversion, we drop the original metastore schema (https://issues.apache.org/jira/browse/SPARK-15705). Because of this regression, I am changing the default value of `spark.sql.hive.convertMetastoreOrc` to false. Author: Yin HuaiCloses #14267 from yhuai/SPARK-15705-changeDefaultValue. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2ae7b88a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2ae7b88a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2ae7b88a Branch: refs/heads/master Commit: 2ae7b88a07140e012b6c60db3c4a2a8ca360c684 Parents: 162d04a Author: Yin Huai Authored: Tue Jul 19 12:58:08 2016 -0700 Committer: Reynold Xin Committed: Tue Jul 19 12:58:08 2016 -0700 -- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2ae7b88a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala index 9ed357c..bdec611 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala @@ -97,10 +97,11 @@ private[spark] object HiveUtils extends Logging { .createWithDefault(false) val CONVERT_METASTORE_ORC = SQLConfigBuilder("spark.sql.hive.convertMetastoreOrc") +.internal() .doc("When set to false, Spark SQL will use the Hive SerDe for ORC tables instead of " + "the built in support.") .booleanConf -.createWithDefault(true) +.createWithDefault(false) val HIVE_METASTORE_SHARED_PREFIXES = SQLConfigBuilder("spark.sql.hive.metastore.sharedPrefixes") .doc("A comma separated list of class prefixes that should be loaded using the classloader " + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16602][SQL] `Nvl` function should support numeric-string cases
Repository: spark Updated Branches: refs/heads/branch-2.0 6ca1d941b -> f18f9ca5b [SPARK-16602][SQL] `Nvl` function should support numeric-string cases ## What changes were proposed in this pull request? `Nvl` function should support numeric-straing cases like Hive/Spark1.6. Currently, `Nvl` finds the tightest common types among numeric types. This PR extends that to consider `String` type, too. ```scala - TypeCoercion.findTightestCommonTypeOfTwo(left.dataType, right.dataType).map { dtype => + TypeCoercion.findTightestCommonTypeToString(left.dataType, right.dataType).map { dtype => ``` **Before** ```scala scala> sql("select nvl('0', 1)").collect() org.apache.spark.sql.AnalysisException: cannot resolve `nvl("0", 1)` due to data type mismatch: input to function coalesce should all be the same type, but it's [string, int]; line 1 pos 7 ``` **After** ```scala scala> sql("select nvl('0', 1)").collect() res0: Array[org.apache.spark.sql.Row] = Array([0]) ``` ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon HyunCloses #14251 from dongjoon-hyun/SPARK-16602. (cherry picked from commit 162d04a30e38bb83d35865679145f8ea80b84c26) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f18f9ca5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f18f9ca5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f18f9ca5 Branch: refs/heads/branch-2.0 Commit: f18f9ca5b22ca11712694b1106463ae6efc1d646 Parents: 6ca1d94 Author: Dongjoon Hyun Authored: Tue Jul 19 10:28:17 2016 -0700 Committer: Reynold Xin Committed: Tue Jul 19 10:28:24 2016 -0700 -- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 2 +- .../sql/catalyst/expressions/nullExpressions.scala | 2 +- .../catalyst/expressions/NullFunctionsSuite.scala| 15 +++ 3 files changed, 17 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f18f9ca5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala index baec6d1..9a040f8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala @@ -100,7 +100,7 @@ object TypeCoercion { } /** Similar to [[findTightestCommonType]], but can promote all the way to StringType. */ - private def findTightestCommonTypeToString(left: DataType, right: DataType): Option[DataType] = { + def findTightestCommonTypeToString(left: DataType, right: DataType): Option[DataType] = { findTightestCommonTypeOfTwo(left, right).orElse((left, right) match { case (StringType, t2: AtomicType) if t2 != BinaryType && t2 != BooleanType => Some(StringType) case (t1: AtomicType, StringType) if t1 != BinaryType && t1 != BooleanType => Some(StringType) http://git-wip-us.apache.org/repos/asf/spark/blob/f18f9ca5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala index 523fb05..1c18265 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala @@ -134,7 +134,7 @@ case class Nvl(left: Expression, right: Expression) extends RuntimeReplaceable { override def replaceForTypeCoercion(): Expression = { if (left.dataType != right.dataType) { - TypeCoercion.findTightestCommonTypeOfTwo(left.dataType, right.dataType).map { dtype => + TypeCoercion.findTightestCommonTypeToString(left.dataType, right.dataType).map { dtype => copy(left = Cast(left, dtype), right = Cast(right, dtype)) }.getOrElse(this) } else { http://git-wip-us.apache.org/repos/asf/spark/blob/f18f9ca5/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullFunctionsSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullFunctionsSuite.scala
spark git commit: [SPARK-16602][SQL] `Nvl` function should support numeric-string cases
Repository: spark Updated Branches: refs/heads/master 0bd76e872 -> 162d04a30 [SPARK-16602][SQL] `Nvl` function should support numeric-string cases ## What changes were proposed in this pull request? `Nvl` function should support numeric-straing cases like Hive/Spark1.6. Currently, `Nvl` finds the tightest common types among numeric types. This PR extends that to consider `String` type, too. ```scala - TypeCoercion.findTightestCommonTypeOfTwo(left.dataType, right.dataType).map { dtype => + TypeCoercion.findTightestCommonTypeToString(left.dataType, right.dataType).map { dtype => ``` **Before** ```scala scala> sql("select nvl('0', 1)").collect() org.apache.spark.sql.AnalysisException: cannot resolve `nvl("0", 1)` due to data type mismatch: input to function coalesce should all be the same type, but it's [string, int]; line 1 pos 7 ``` **After** ```scala scala> sql("select nvl('0', 1)").collect() res0: Array[org.apache.spark.sql.Row] = Array([0]) ``` ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon HyunCloses #14251 from dongjoon-hyun/SPARK-16602. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/162d04a3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/162d04a3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/162d04a3 Branch: refs/heads/master Commit: 162d04a30e38bb83d35865679145f8ea80b84c26 Parents: 0bd76e8 Author: Dongjoon Hyun Authored: Tue Jul 19 10:28:17 2016 -0700 Committer: Reynold Xin Committed: Tue Jul 19 10:28:17 2016 -0700 -- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 2 +- .../sql/catalyst/expressions/nullExpressions.scala | 2 +- .../catalyst/expressions/NullFunctionsSuite.scala| 15 +++ 3 files changed, 17 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/162d04a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala index baec6d1..9a040f8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala @@ -100,7 +100,7 @@ object TypeCoercion { } /** Similar to [[findTightestCommonType]], but can promote all the way to StringType. */ - private def findTightestCommonTypeToString(left: DataType, right: DataType): Option[DataType] = { + def findTightestCommonTypeToString(left: DataType, right: DataType): Option[DataType] = { findTightestCommonTypeOfTwo(left, right).orElse((left, right) match { case (StringType, t2: AtomicType) if t2 != BinaryType && t2 != BooleanType => Some(StringType) case (t1: AtomicType, StringType) if t1 != BinaryType && t1 != BooleanType => Some(StringType) http://git-wip-us.apache.org/repos/asf/spark/blob/162d04a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala index 523fb05..1c18265 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala @@ -134,7 +134,7 @@ case class Nvl(left: Expression, right: Expression) extends RuntimeReplaceable { override def replaceForTypeCoercion(): Expression = { if (left.dataType != right.dataType) { - TypeCoercion.findTightestCommonTypeOfTwo(left.dataType, right.dataType).map { dtype => + TypeCoercion.findTightestCommonTypeToString(left.dataType, right.dataType).map { dtype => copy(left = Cast(left, dtype), right = Cast(right, dtype)) }.getOrElse(this) } else { http://git-wip-us.apache.org/repos/asf/spark/blob/162d04a3/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullFunctionsSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullFunctionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullFunctionsSuite.scala index ace6c15..712fe35 100644 ---
spark git commit: [SPARK-16620][CORE] Add back the tokenization process in `RDD.pipe(command: String)`
Repository: spark Updated Branches: refs/heads/branch-2.0 2c74b6d73 -> 6ca1d941b [SPARK-16620][CORE] Add back the tokenization process in `RDD.pipe(command: String)` ## What changes were proposed in this pull request? Currently `RDD.pipe(command: String)`: - works only when the command is specified without any options, such as `RDD.pipe("wc")` - does NOT work when the command is specified with some options, such as `RDD.pipe("wc -l")` This is a regression from Spark 1.6. This patch adds back the tokenization process in `RDD.pipe(command: String)` to fix this regression. ## How was this patch tested? Added a test which: - would pass in `1.6` - _[prior to this patch]_ would fail in `master` - _[after this patch]_ would pass in `master` Author: Liwei LinCloses #14256 from lw-lin/rdd-pipe. (cherry picked from commit 0bd76e872b60cb80295fc12654e370cf22390056) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ca1d941 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ca1d941 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ca1d941 Branch: refs/heads/branch-2.0 Commit: 6ca1d941b0b417f10533ab3506a9f3cf60e6a7fe Parents: 2c74b6d Author: Liwei Lin Authored: Tue Jul 19 10:24:48 2016 -0700 Committer: Reynold Xin Committed: Tue Jul 19 10:25:24 2016 -0700 -- core/src/main/scala/org/apache/spark/rdd/RDD.scala | 8 ++-- .../scala/org/apache/spark/rdd/PipedRDDSuite.scala | 16 2 files changed, 22 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6ca1d941/core/src/main/scala/org/apache/spark/rdd/RDD.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala b/core/src/main/scala/org/apache/spark/rdd/RDD.scala index b7a5b22..0804cde 100644 --- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala @@ -699,14 +699,18 @@ abstract class RDD[T: ClassTag]( * Return an RDD created by piping elements to a forked external process. */ def pipe(command: String): RDD[String] = withScope { -pipe(command) +// Similar to Runtime.exec(), if we are given a single string, split it into words +// using a standard StringTokenizer (i.e. by spaces) +pipe(PipedRDD.tokenize(command)) } /** * Return an RDD created by piping elements to a forked external process. */ def pipe(command: String, env: Map[String, String]): RDD[String] = withScope { -pipe(command, env) +// Similar to Runtime.exec(), if we are given a single string, split it into words +// using a standard StringTokenizer (i.e. by spaces) +pipe(PipedRDD.tokenize(command), env) } /** http://git-wip-us.apache.org/repos/asf/spark/blob/6ca1d941/core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala b/core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala index 27cfdc7..5d56fc1 100644 --- a/core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala +++ b/core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala @@ -51,6 +51,22 @@ class PipedRDDSuite extends SparkFunSuite with SharedSparkContext { } } + test("basic pipe with tokenization") { +if (testCommandAvailable("wc")) { + val nums = sc.makeRDD(Array(1, 2, 3, 4), 2) + + // verify that both RDD.pipe(command: String) and RDD.pipe(command: String, env) work good + for (piped <- Seq(nums.pipe("wc -l"), nums.pipe("wc -l", Map[String, String]( { +val c = piped.collect() +assert(c.size === 2) +assert(c(0).trim === "2") +assert(c(1).trim === "2") + } +} else { + assert(true) +} + } + test("failure in iterating over pipe input") { if (testCommandAvailable("cat")) { val nums = - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16494][ML] Upgrade breeze version to 0.12
Repository: spark Updated Branches: refs/heads/master 5d92326be -> 670891496 [SPARK-16494][ML] Upgrade breeze version to 0.12 ## What changes were proposed in this pull request? breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes. One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case. We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12. For more features, improvements and bug fixes of breeze 0.12, you can refer the following link: https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c ## How was this patch tested? No new tests, should pass the existing ones. Author: Yanbo LiangCloses #14150 from yanboliang/spark-16494. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/67089149 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/67089149 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/67089149 Branch: refs/heads/master Commit: 670891496a82538a5e2bf981a4044fb6f4cbb062 Parents: 5d92326 Author: Yanbo Liang Authored: Tue Jul 19 12:31:04 2016 +0100 Committer: Sean Owen Committed: Tue Jul 19 12:31:04 2016 +0100 -- dev/deps/spark-deps-hadoop-2.2 | 5 +++-- dev/deps/spark-deps-hadoop-2.3 | 5 +++-- dev/deps/spark-deps-hadoop-2.4 | 5 +++-- dev/deps/spark-deps-hadoop-2.6 | 5 +++-- dev/deps/spark-deps-hadoop-2.7 | 5 +++-- .../apache/spark/ml/classification/LogisticRegression.scala | 5 - .../apache/spark/ml/regression/AFTSurvivalRegression.scala | 6 -- .../org/apache/spark/ml/regression/LinearRegression.scala | 5 - .../scala/org/apache/spark/mllib/clustering/LDAModel.scala | 8 +++- .../org/apache/spark/mllib/clustering/LDAOptimizer.scala| 5 +++-- .../scala/org/apache/spark/mllib/optimization/LBFGS.scala | 5 - .../test/java/org/apache/spark/ml/feature/JavaPCASuite.java | 6 +- .../scala/org/apache/spark/mllib/clustering/LDASuite.scala | 4 ++-- .../scala/org/apache/spark/mllib/feature/PCASuite.scala | 9 ++--- pom.xml | 2 +- python/pyspark/ml/classification.py | 2 +- 16 files changed, 40 insertions(+), 42 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/67089149/dev/deps/spark-deps-hadoop-2.2 -- diff --git a/dev/deps/spark-deps-hadoop-2.2 b/dev/deps/spark-deps-hadoop-2.2 index feb3474..5d536b7 100644 --- a/dev/deps/spark-deps-hadoop-2.2 +++ b/dev/deps/spark-deps-hadoop-2.2 @@ -12,8 +12,8 @@ avro-1.7.7.jar avro-ipc-1.7.7.jar avro-mapred-1.7.7-hadoop2.jar bonecp-0.8.0.RELEASE.jar -breeze-macros_2.11-0.11.2.jar -breeze_2.11-0.11.2.jar +breeze-macros_2.11-0.12.jar +breeze_2.11-0.12.jar calcite-avatica-1.2.0-incubating.jar calcite-core-1.2.0-incubating.jar calcite-linq4j-1.2.0-incubating.jar @@ -147,6 +147,7 @@ scala-parser-combinators_2.11-1.0.4.jar scala-reflect-2.11.8.jar scala-xml_2.11-1.0.2.jar scalap-2.11.8.jar +shapeless_2.11-2.0.0.jar slf4j-api-1.7.16.jar slf4j-log4j12-1.7.16.jar snappy-0.2.jar http://git-wip-us.apache.org/repos/asf/spark/blob/67089149/dev/deps/spark-deps-hadoop-2.3 -- diff --git a/dev/deps/spark-deps-hadoop-2.3 b/dev/deps/spark-deps-hadoop-2.3 index 3e96035..d16f42a 100644 --- a/dev/deps/spark-deps-hadoop-2.3 +++ b/dev/deps/spark-deps-hadoop-2.3 @@ -15,8 +15,8 @@ avro-mapred-1.7.7-hadoop2.jar base64-2.3.8.jar bcprov-jdk15on-1.51.jar bonecp-0.8.0.RELEASE.jar -breeze-macros_2.11-0.11.2.jar -breeze_2.11-0.11.2.jar +breeze-macros_2.11-0.12.jar +breeze_2.11-0.12.jar calcite-avatica-1.2.0-incubating.jar calcite-core-1.2.0-incubating.jar calcite-linq4j-1.2.0-incubating.jar @@ -154,6 +154,7 @@ scala-parser-combinators_2.11-1.0.4.jar scala-reflect-2.11.8.jar scala-xml_2.11-1.0.2.jar scalap-2.11.8.jar +shapeless_2.11-2.0.0.jar slf4j-api-1.7.16.jar slf4j-log4j12-1.7.16.jar snappy-0.2.jar http://git-wip-us.apache.org/repos/asf/spark/blob/67089149/dev/deps/spark-deps-hadoop-2.4 -- diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4 index 3fc14a6..2e261cb 100644 --- a/dev/deps/spark-deps-hadoop-2.4 +++
spark git commit: [SPARK-16478] graphX (added graph caching in strongly connected components)
Repository: spark Updated Branches: refs/heads/master 6c4b9f4be -> 5d92326be [SPARK-16478] graphX (added graph caching in strongly connected components) ## What changes were proposed in this pull request? I added caching in every iteration for sccGraph that is returned in strongly connected components. Without this cache strongly connected components returned graph that needed to be computed from scratch when some intermediary caches didn't existed anymore. ## How was this patch tested? I tested it by running code similar to the one [on databrics](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html). Basically I generated large graph and computed strongly connected components with changed code, than simply run count on vertices and edges. Count after this update takes few seconds instead 20 minutes. # statement contribution is my original work and I license the work to the project under the project's open source license. Author: MichaÅ WesoÅowskiCloses #14137 from wesolowskim/SPARK-16478. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5d92326b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5d92326b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5d92326b Branch: refs/heads/master Commit: 5d92326be76cb15edc6e18e94a373e197f696803 Parents: 6c4b9f4 Author: MichaÅ WesoÅowski Authored: Tue Jul 19 12:18:42 2016 +0100 Committer: Sean Owen Committed: Tue Jul 19 12:18:42 2016 +0100 -- .../lib/StronglyConnectedComponents.scala | 86 1 file changed, 50 insertions(+), 36 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5d92326b/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala -- diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala old mode 100644 new mode 100755 index 1fa92b0..e4f80ff --- a/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala +++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala @@ -44,6 +44,9 @@ object StronglyConnectedComponents { // graph we are going to work with in our iterations var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) }.cache() +// helper variables to unpersist cached graphs +var prevSccGraph = sccGraph + var numVertices = sccWorkGraph.numVertices var iter = 0 while (sccWorkGraph.numVertices > 0 && iter < numIter) { @@ -64,48 +67,59 @@ object StronglyConnectedComponents { // write values to sccGraph sccGraph = sccGraph.outerJoinVertices(finalVertices) { (vid, scc, opt) => opt.getOrElse(scc) -} +}.cache() +// materialize vertices and edges +sccGraph.vertices.count() +sccGraph.edges.count() +// sccGraph materialized so, unpersist can be done on previous +prevSccGraph.unpersist(blocking = false) +prevSccGraph = sccGraph + // only keep vertices that are not final sccWorkGraph = sccWorkGraph.subgraph(vpred = (vid, data) => !data._2).cache() } while (sccWorkGraph.numVertices < numVertices) - sccWorkGraph = sccWorkGraph.mapVertices{ case (vid, (color, isFinal)) => (vid, isFinal) } + // if iter < numIter at this point sccGraph that is returned + // will not be recomputed and pregel executions are pointless + if (iter < numIter) { +sccWorkGraph = sccWorkGraph.mapVertices { case (vid, (color, isFinal)) => (vid, isFinal) } - // collect min of all my neighbor's scc values, update if it's smaller than mine - // then notify any neighbors with scc values larger than mine - sccWorkGraph = Pregel[(VertexId, Boolean), ED, VertexId]( -sccWorkGraph, Long.MaxValue, activeDirection = EdgeDirection.Out)( -(vid, myScc, neighborScc) => (math.min(myScc._1, neighborScc), myScc._2), -e => { - if (e.srcAttr._1 < e.dstAttr._1) { -Iterator((e.dstId, e.srcAttr._1)) - } else { -Iterator() - } -}, -(vid1, vid2) => math.min(vid1, vid2)) +// collect min of all my neighbor's scc values, update if it's smaller than mine +// then notify any neighbors with scc values larger than mine +sccWorkGraph = Pregel[(VertexId, Boolean), ED, VertexId]( + sccWorkGraph, Long.MaxValue, activeDirection =
spark git commit: [SPARK-16395][STREAMING] Fail if too many CheckpointWriteHandlers are queued up in the fixed thread pool
Repository: spark Updated Branches: refs/heads/master 8310c0741 -> 6c4b9f4be [SPARK-16395][STREAMING] Fail if too many CheckpointWriteHandlers are queued up in the fixed thread pool ## What changes were proposed in this pull request? Begin failing if checkpoint writes will likely keep up with storage's ability to write them, to fail fast instead of slowly filling memory ## How was this patch tested? Jenkins tests Author: Sean OwenCloses #14152 from srowen/SPARK-16395. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6c4b9f4b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6c4b9f4b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6c4b9f4b Branch: refs/heads/master Commit: 6c4b9f4be6b429197c6a53f937a82c2ac5866d65 Parents: 8310c07 Author: Sean Owen Authored: Tue Jul 19 12:10:24 2016 +0100 Committer: Sean Owen Committed: Tue Jul 19 12:10:24 2016 +0100 -- .../scala/org/apache/spark/streaming/Checkpoint.scala | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6c4b9f4b/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala -- diff --git a/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala b/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala index 0b11026..398fa65 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala @@ -18,8 +18,8 @@ package org.apache.spark.streaming import java.io._ -import java.util.concurrent.Executors -import java.util.concurrent.RejectedExecutionException +import java.util.concurrent.{ArrayBlockingQueue, RejectedExecutionException, + ThreadPoolExecutor, TimeUnit} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} @@ -184,7 +184,14 @@ class CheckpointWriter( hadoopConf: Configuration ) extends Logging { val MAX_ATTEMPTS = 3 - val executor = Executors.newFixedThreadPool(1) + + // Single-thread executor which rejects executions when a large amount have queued up. + // This fails fast since this typically means the checkpoint store will never keep up, and + // will otherwise lead to filling memory with waiting payloads of byte[] to write. + val executor = new ThreadPoolExecutor( +1, 1, +0L, TimeUnit.MILLISECONDS, +new ArrayBlockingQueue[Runnable](1000)) val compressionCodec = CompressionCodec.createCodec(conf) private var stopped = false @volatile private[this] var fs: FileSystem = null - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16600][MLLIB] fix some latex formula syntax error
Repository: spark Updated Branches: refs/heads/branch-2.0 929fa287e -> 2c74b6d73 [SPARK-16600][MLLIB] fix some latex formula syntax error ## What changes were proposed in this pull request? `\partial\x` ==> `\partial x` `har{x_i}` ==> `hat{x_i}` ## How was this patch tested? N/A Author: WeichenXuCloses #14246 from WeichenXu123/fix_formular_err. (cherry picked from commit 8310c0741c0ca805ec74c1a78ba4a0f18e82d459) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2c74b6d7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2c74b6d7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2c74b6d7 Branch: refs/heads/branch-2.0 Commit: 2c74b6d73beab4510fa7933dde9c0a5c218cce92 Parents: 929fa28 Author: WeichenXu Authored: Tue Jul 19 12:07:40 2016 +0100 Committer: Sean Owen Committed: Tue Jul 19 12:07:49 2016 +0100 -- .../org/apache/spark/ml/regression/LinearRegression.scala| 8 1 file changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2c74b6d7/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala index 401f2c6..0a155e1 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala @@ -794,16 +794,16 @@ class LinearRegressionSummary private[regression] ( * * Now, the first derivative of the objective function in scaled space is * {{{ - * \frac{\partial L}{\partial\w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i} + * \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i} * }}} * However, ($x_i - \bar{x_i}$) will densify the computation, so it's not * an ideal formula when the training dataset is sparse format. * - * This can be addressed by adding the dense \bar{x_i} / \har{x_i} terms + * This can be addressed by adding the dense \bar{x_i} / \hat{x_i} terms * in the end by keeping the sum of diff. The first derivative of total * objective function from all the samples is * {{{ - * \frac{\partial L}{\partial\w_i} = + * \frac{\partial L}{\partial w_i} = * 1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i} * = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i}) / \hat{x_i}) * = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i) @@ -822,7 +822,7 @@ class LinearRegressionSummary private[regression] ( * the training dataset, which can be easily computed in distributed fashion, and is * sparse format friendly. * {{{ - * \frac{\partial L}{\partial\w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + * \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) * }}}, * * @param coefficients The coefficients corresponding to the features. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16600][MLLIB] fix some latex formula syntax error
Repository: spark Updated Branches: refs/heads/master 6caa22050 -> 8310c0741 [SPARK-16600][MLLIB] fix some latex formula syntax error ## What changes were proposed in this pull request? `\partial\x` ==> `\partial x` `har{x_i}` ==> `hat{x_i}` ## How was this patch tested? N/A Author: WeichenXuCloses #14246 from WeichenXu123/fix_formular_err. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8310c074 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8310c074 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8310c074 Branch: refs/heads/master Commit: 8310c0741c0ca805ec74c1a78ba4a0f18e82d459 Parents: 6caa220 Author: WeichenXu Authored: Tue Jul 19 12:07:40 2016 +0100 Committer: Sean Owen Committed: Tue Jul 19 12:07:40 2016 +0100 -- .../org/apache/spark/ml/regression/LinearRegression.scala| 8 1 file changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8310c074/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala index 401f2c6..0a155e1 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala @@ -794,16 +794,16 @@ class LinearRegressionSummary private[regression] ( * * Now, the first derivative of the objective function in scaled space is * {{{ - * \frac{\partial L}{\partial\w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i} + * \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i} * }}} * However, ($x_i - \bar{x_i}$) will densify the computation, so it's not * an ideal formula when the training dataset is sparse format. * - * This can be addressed by adding the dense \bar{x_i} / \har{x_i} terms + * This can be addressed by adding the dense \bar{x_i} / \hat{x_i} terms * in the end by keeping the sum of diff. The first derivative of total * objective function from all the samples is * {{{ - * \frac{\partial L}{\partial\w_i} = + * \frac{\partial L}{\partial w_i} = * 1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i} * = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i}) / \hat{x_i}) * = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i) @@ -822,7 +822,7 @@ class LinearRegressionSummary private[regression] ( * the training dataset, which can be easily computed in distributed fashion, and is * sparse format friendly. * {{{ - * \frac{\partial L}{\partial\w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + * \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) * }}}, * * @param coefficients The coefficients corresponding to the features. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar
Repository: spark Updated Branches: refs/heads/branch-2.0 eb1c20fa0 -> 929fa287e [MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar ## What changes were proposed in this pull request? Minor fixes correcting some typos, punctuations, grammar. Adding more anchors for easy navigation. Fixing minor issues with code snippets. ## How was this patch tested? `jekyll serve` Author: Ahmed MahranCloses #14234 from ahmed-mahran/b-struct-streaming-docs. (cherry picked from commit 6caa22050e221cf14e2db0544fd2766dd1102bda) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/929fa287 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/929fa287 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/929fa287 Branch: refs/heads/branch-2.0 Commit: 929fa287e700c0e112f43e0c7b9bc746b5546c64 Parents: eb1c20f Author: Ahmed Mahran Authored: Tue Jul 19 12:01:54 2016 +0100 Committer: Sean Owen Committed: Tue Jul 19 12:06:26 2016 +0100 -- docs/structured-streaming-programming-guide.md | 154 +--- 1 file changed, 71 insertions(+), 83 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/929fa287/docs/structured-streaming-programming-guide.md -- diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 3ef39e4..aac8817 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -22,14 +22,49 @@ Letâs say you want to maintain a running word count of text data received from +{% highlight scala %} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.SparkSession + +val spark = SparkSession + .builder + .appName("StructuredNetworkWordCount") + .getOrCreate() + +import spark.implicits._ +{% endhighlight %} +{% highlight java %} +import org.apache.spark.api.java.function.FlatMapFunction; +import org.apache.spark.sql.*; +import org.apache.spark.sql.streaming.StreamingQuery; + +import java.util.Arrays; +import java.util.Iterator; + +SparkSession spark = SparkSession +.builder() +.appName("JavaStructuredNetworkWordCount") +.getOrCreate(); +{% endhighlight %} +{% highlight python %} +from pyspark.sql import SparkSession +from pyspark.sql.functions import explode +from pyspark.sql.functions import split + +spark = SparkSession\ +.builder()\ +.appName("StructuredNetworkWordCount")\ +.getOrCreate() +{% endhighlight %} + @@ -39,18 +74,6 @@ Next, letâs create a streaming DataFrame that represents text data received fr {% highlight scala %} -import org.apache.spark.sql.functions._ -import org.apache.spark.sql.SparkSession - -val spark = SparkSession - .builder - .appName("StructuredNetworkWordCount") - .getOrCreate() -{% endhighlight %} - -Next, letâs create a streaming DataFrame that represents text data received from a server listening on localhost:, and transform the DataFrame to calculate word counts. - -{% highlight scala %} // Create DataFrame representing the stream of input lines from connection to localhost: val lines = spark.readStream .format("socket") @@ -65,30 +88,12 @@ val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count() {% endhighlight %} -This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named âvalueâ, and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a Dataset of String using `.as(Encoders.STRING())`, so that we can apply the `flatMap` operation to split each line into multiple words. The resultant `words` Dataset contains all the words. Finally, we have defined the `wordCounts` DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream. +This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named âvalueâ, and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a Dataset of String using `.as[String]`, so that we can apply the `flatMap` operation to split each line
spark git commit: [MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar
Repository: spark Updated Branches: refs/heads/master 21a6dd2ae -> 6caa22050 [MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar ## What changes were proposed in this pull request? Minor fixes correcting some typos, punctuations, grammar. Adding more anchors for easy navigation. Fixing minor issues with code snippets. ## How was this patch tested? `jekyll serve` Author: Ahmed MahranCloses #14234 from ahmed-mahran/b-struct-streaming-docs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6caa2205 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6caa2205 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6caa2205 Branch: refs/heads/master Commit: 6caa22050e221cf14e2db0544fd2766dd1102bda Parents: 21a6dd2 Author: Ahmed Mahran Authored: Tue Jul 19 12:01:54 2016 +0100 Committer: Sean Owen Committed: Tue Jul 19 12:01:54 2016 +0100 -- docs/structured-streaming-programming-guide.md | 154 +--- 1 file changed, 71 insertions(+), 83 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6caa2205/docs/structured-streaming-programming-guide.md -- diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 3ef39e4..aac8817 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -22,14 +22,49 @@ Letâs say you want to maintain a running word count of text data received from +{% highlight scala %} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.SparkSession + +val spark = SparkSession + .builder + .appName("StructuredNetworkWordCount") + .getOrCreate() + +import spark.implicits._ +{% endhighlight %} +{% highlight java %} +import org.apache.spark.api.java.function.FlatMapFunction; +import org.apache.spark.sql.*; +import org.apache.spark.sql.streaming.StreamingQuery; + +import java.util.Arrays; +import java.util.Iterator; + +SparkSession spark = SparkSession +.builder() +.appName("JavaStructuredNetworkWordCount") +.getOrCreate(); +{% endhighlight %} +{% highlight python %} +from pyspark.sql import SparkSession +from pyspark.sql.functions import explode +from pyspark.sql.functions import split + +spark = SparkSession\ +.builder()\ +.appName("StructuredNetworkWordCount")\ +.getOrCreate() +{% endhighlight %} + @@ -39,18 +74,6 @@ Next, letâs create a streaming DataFrame that represents text data received fr {% highlight scala %} -import org.apache.spark.sql.functions._ -import org.apache.spark.sql.SparkSession - -val spark = SparkSession - .builder - .appName("StructuredNetworkWordCount") - .getOrCreate() -{% endhighlight %} - -Next, letâs create a streaming DataFrame that represents text data received from a server listening on localhost:, and transform the DataFrame to calculate word counts. - -{% highlight scala %} // Create DataFrame representing the stream of input lines from connection to localhost: val lines = spark.readStream .format("socket") @@ -65,30 +88,12 @@ val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count() {% endhighlight %} -This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named âvalueâ, and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a Dataset of String using `.as(Encoders.STRING())`, so that we can apply the `flatMap` operation to split each line into multiple words. The resultant `words` Dataset contains all the words. Finally, we have defined the `wordCounts` DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream. +This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named âvalueâ, and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a Dataset of String using `.as[String]`, so that we can apply the `flatMap` operation to split each line into multiple words. The resultant `words` Dataset contains all the words. Finally, we have defined the `wordCounts`
spark git commit: [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent
Repository: spark Updated Branches: refs/heads/master 556a9437a -> 21a6dd2ae [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent https://issues.apache.org/jira/browse/SPARK-16535 ## What changes were proposed in this pull request? When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot ``` Definition of groupId is redundant, because it's inherited from the parent ``` ![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png) I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok. ``` org.apache.spark ``` As I just find now `3.3.9` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1). ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762 ## How was this patch tested? I've tested by re-building the project, and build succeeded. Author: Xin RenCloses #14189 from keypointt/SPARK-16535. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/21a6dd2a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/21a6dd2a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/21a6dd2a Branch: refs/heads/master Commit: 21a6dd2aef65a23d92f93c43fa731c0505250363 Parents: 556a943 Author: Xin Ren Authored: Tue Jul 19 11:59:46 2016 +0100 Committer: Sean Owen Committed: Tue Jul 19 11:59:46 2016 +0100 -- assembly/pom.xml | 1 - common/network-common/pom.xml | 1 - common/network-shuffle/pom.xml| 1 - common/network-yarn/pom.xml | 1 - common/sketch/pom.xml | 1 - common/tags/pom.xml | 1 - common/unsafe/pom.xml | 1 - core/pom.xml | 1 - examples/pom.xml | 1 - external/flume-assembly/pom.xml | 1 - external/flume-sink/pom.xml | 1 - external/flume/pom.xml| 1 - external/java8-tests/pom.xml | 1 - external/kafka-0-10-assembly/pom.xml | 1 - external/kafka-0-10/pom.xml | 1 - external/kafka-0-8-assembly/pom.xml | 1 - external/kafka-0-8/pom.xml| 1 - external/kinesis-asl-assembly/pom.xml | 1 - external/kinesis-asl/pom.xml | 1 - external/spark-ganglia-lgpl/pom.xml | 1 - graphx/pom.xml| 1 - launcher/pom.xml | 1 - mllib-local/pom.xml | 1 - mllib/pom.xml | 1 - repl/pom.xml | 1 - sql/catalyst/pom.xml | 1 - sql/core/pom.xml | 1 - sql/hive-thriftserver/pom.xml | 1 - sql/hive/pom.xml | 1 - streaming/pom.xml | 1 - tools/pom.xml | 1 - yarn/pom.xml | 1 - 32 files changed, 32 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/21a6dd2a/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 1b25b7c..971a62f 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -25,7 +25,6 @@ ../pom.xml - org.apache.spark spark-assembly_2.11 Spark Project Assembly http://spark.apache.org/ http://git-wip-us.apache.org/repos/asf/spark/blob/21a6dd2a/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index b1a37e8..81f0c6e 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -26,7 +26,6 @@ ../../pom.xml - org.apache.spark spark-network-common_2.11 jar Spark Project Networking http://git-wip-us.apache.org/repos/asf/spark/blob/21a6dd2a/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 51c06b9..d211bd5 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -26,7 +26,6 @@ ../../pom.xml - org.apache.spark spark-network-shuffle_2.11 jar Spark Project Shuffle Streaming Service http://git-wip-us.apache.org/repos/asf/spark/blob/21a6dd2a/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index d43eb71..606ad15 100644 ---
spark git commit: [MINOR][BUILD] Fix Java Linter `LineLength` errors
Repository: spark Updated Branches: refs/heads/master 6ee40d2cc -> 556a9437a [MINOR][BUILD] Fix Java Linter `LineLength` errors ## What changes were proposed in this pull request? This PR fixes four java linter `LineLength` errors. Those are all `LineLength` errors, but we had better remove all java linter errors before release. ## How was this patch tested? After pass the Jenkins, `./dev/lint-java`. Author: Dongjoon HyunCloses #14255 from dongjoon-hyun/minor_java_linter. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/556a9437 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/556a9437 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/556a9437 Branch: refs/heads/master Commit: 556a9437ac7b55079f5a8a91e669dcc36ca02696 Parents: 6ee40d2 Author: Dongjoon Hyun Authored: Tue Jul 19 11:51:43 2016 +0100 Committer: Sean Owen Committed: Tue Jul 19 11:51:43 2016 +0100 -- .../spark/network/shuffle/ExternalShuffleBlockHandler.java | 3 ++- .../java/org/apache/spark/network/yarn/YarnShuffleService.java | 2 +- .../apache/spark/examples/sql/JavaSQLDataSourceExample.java| 6 -- 3 files changed, 7 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/556a9437/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java -- diff --git a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java index 1cc0fb6..1270cef 100644 --- a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java +++ b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java @@ -113,7 +113,8 @@ public class ExternalShuffleBlockHandler extends RpcHandler { } } else if (msgObj instanceof RegisterExecutor) { - final Timer.Context responseDelayContext = metrics.registerExecutorRequestLatencyMillis.time(); + final Timer.Context responseDelayContext = +metrics.registerExecutorRequestLatencyMillis.time(); try { RegisterExecutor msg = (RegisterExecutor) msgObj; checkAuth(client, msg.appId); http://git-wip-us.apache.org/repos/asf/spark/blob/556a9437/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java -- diff --git a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java index df17dac..22e47ac 100644 --- a/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java +++ b/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java @@ -131,7 +131,7 @@ public class YarnShuffleService extends AuxiliaryService { try { // In case this NM was killed while there were running spark applications, we need to restore - // lost state for the existing executors. We look for an existing file in the NM's local dirs. + // lost state for the existing executors. We look for an existing file in the NM's local dirs. // If we don't find one, then we choose a file to use to save the state next time. Even if // an application was stopped while the NM was down, we expect yarn to call stopApplication() // when it comes back http://git-wip-us.apache.org/repos/asf/spark/blob/556a9437/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java index 2b94b9f..ec02c8b 100644 --- a/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java +++ b/examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java @@ -110,11 +110,13 @@ public class JavaSQLDataSourceExample { usersDF.select("name", "favorite_color").write().save("namesAndFavColors.parquet"); // $example off:generic_load_save_functions$ // $example on:manual_load_options$ -Dataset peopleDF = spark.read().format("json").load("examples/src/main/resources/people.json"); +Dataset peopleDF = + spark.read().format("json").load("examples/src/main/resources/people.json"); peopleDF.select("name",
spark git commit: [DOC] improve python doc for rdd.histogram and dataframe.join
Repository: spark Updated Branches: refs/heads/master 1426a0805 -> 6ee40d2cc [DOC] improve python doc for rdd.histogram and dataframe.join ## What changes were proposed in this pull request? doc change only ## How was this patch tested? doc change only Author: Mortada MehyarCloses #14253 from mortada/histogram_typos. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ee40d2c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ee40d2c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ee40d2c Branch: refs/heads/master Commit: 6ee40d2cc5f467c78be662c1639fc3d5b7f796cf Parents: 1426a08 Author: Mortada Mehyar Authored: Mon Jul 18 23:49:47 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 18 23:49:47 2016 -0700 -- python/pyspark/rdd.py | 18 +- python/pyspark/sql/dataframe.py | 10 +- 2 files changed, 14 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6ee40d2c/python/pyspark/rdd.py -- diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index 6afe769..0508235 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -1027,20 +1027,20 @@ class RDD(object): If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per -element(where n = # buckets). +element (where n is the number of buckets). -Buckets must be sorted and not contain any duplicates, must be +Buckets must be sorted, not contain any duplicates, and have at least two elements. -If `buckets` is a number, it will generates buckets which are +If `buckets` is a number, it will generate buckets which are evenly spaced between the minimum and maximum of the RDD. For -example, if the min value is 0 and the max is 100, given buckets -as 2, the resulting buckets will be [0,50) [50,100]. buckets must -be at least 1 If the RDD contains infinity, NaN throws an exception -If the elements in RDD do not vary (max == min) always returns -a single bucket. +example, if the min value is 0 and the max is 100, given `buckets` +as 2, the resulting buckets will be [0,50) [50,100]. `buckets` must +be at least 1. An exception is raised if the RDD contains infinity. +If the elements in the RDD do not vary (max == min), a single bucket +will be used. -It will return a tuple of buckets and histogram. +The return value is a tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) http://git-wip-us.apache.org/repos/asf/spark/blob/6ee40d2c/python/pyspark/sql/dataframe.py -- diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index adf549d..8ff9403 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -613,16 +613,16 @@ class DataFrame(object): def join(self, other, on=None, how=None): """Joins with another :class:`DataFrame`, using the given join expression. -The following performs a full outer join between ``df1`` and ``df2``. - :param other: Right side of the join -:param on: a string for join column name, a list of column names, -, a join expression (Column) or a list of Columns. -If `on` is a string or a list of string indicating the name of the join column(s), +:param on: a string for the join column name, a list of column names, +a join expression (Column), or a list of Columns. +If `on` is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join. :param how: str, default 'inner'. One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`. +The following performs a full outer join between ``df1`` and ``df2``. + >>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect() [Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [DOC] improve python doc for rdd.histogram and dataframe.join
Repository: spark Updated Branches: refs/heads/branch-2.0 ef2a6f131 -> 504aa6f7a [DOC] improve python doc for rdd.histogram and dataframe.join ## What changes were proposed in this pull request? doc change only ## How was this patch tested? doc change only Author: Mortada MehyarCloses #14253 from mortada/histogram_typos. (cherry picked from commit 6ee40d2cc5f467c78be662c1639fc3d5b7f796cf) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/504aa6f7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/504aa6f7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/504aa6f7 Branch: refs/heads/branch-2.0 Commit: 504aa6f7a87973de0955aa8c124e2a036f8b3369 Parents: ef2a6f1 Author: Mortada Mehyar Authored: Mon Jul 18 23:49:47 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 18 23:50:01 2016 -0700 -- python/pyspark/rdd.py | 18 +- python/pyspark/sql/dataframe.py | 10 +- 2 files changed, 14 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/504aa6f7/python/pyspark/rdd.py -- diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index 6afe769..0508235 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -1027,20 +1027,20 @@ class RDD(object): If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per -element(where n = # buckets). +element (where n is the number of buckets). -Buckets must be sorted and not contain any duplicates, must be +Buckets must be sorted, not contain any duplicates, and have at least two elements. -If `buckets` is a number, it will generates buckets which are +If `buckets` is a number, it will generate buckets which are evenly spaced between the minimum and maximum of the RDD. For -example, if the min value is 0 and the max is 100, given buckets -as 2, the resulting buckets will be [0,50) [50,100]. buckets must -be at least 1 If the RDD contains infinity, NaN throws an exception -If the elements in RDD do not vary (max == min) always returns -a single bucket. +example, if the min value is 0 and the max is 100, given `buckets` +as 2, the resulting buckets will be [0,50) [50,100]. `buckets` must +be at least 1. An exception is raised if the RDD contains infinity. +If the elements in the RDD do not vary (max == min), a single bucket +will be used. -It will return a tuple of buckets and histogram. +The return value is a tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) http://git-wip-us.apache.org/repos/asf/spark/blob/504aa6f7/python/pyspark/sql/dataframe.py -- diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index c7d704a..b9f50ff 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -601,16 +601,16 @@ class DataFrame(object): def join(self, other, on=None, how=None): """Joins with another :class:`DataFrame`, using the given join expression. -The following performs a full outer join between ``df1`` and ``df2``. - :param other: Right side of the join -:param on: a string for join column name, a list of column names, -, a join expression (Column) or a list of Columns. -If `on` is a string or a list of string indicating the name of the join column(s), +:param on: a string for the join column name, a list of column names, +a join expression (Column), or a list of Columns. +If `on` is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join. :param how: str, default 'inner'. One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`. +The following performs a full outer join between ``df1`` and ``df2``. + >>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect() [Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update
Repository: spark Updated Branches: refs/heads/branch-2.0 24ea87519 -> ef2a6f131 [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update ## What changes were proposed in this pull request? This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL". ## How was this patch tested? Manually verified the generated HTML page. Author: Cheng LianCloses #14245 from liancheng/minor-scala-example-update. (cherry picked from commit 1426a080528bdb470b5e81300d892af45dd188bf) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ef2a6f13 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ef2a6f13 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ef2a6f13 Branch: refs/heads/branch-2.0 Commit: ef2a6f1310777bb6ea2b157a873c3785231b104a Parents: 24ea875 Author: Cheng Lian Authored: Mon Jul 18 23:07:59 2016 -0700 Committer: Yin Huai Committed: Mon Jul 18 23:08:11 2016 -0700 -- docs/sql-programming-guide.md | 57 ++-- .../examples/sql/JavaSQLDataSourceExample.java | 217 .../spark/examples/sql/JavaSparkSQLExample.java | 336 +++ .../spark/examples/sql/JavaSparkSqlExample.java | 336 --- .../examples/sql/JavaSqlDataSourceExample.java | 217 .../examples/sql/SQLDataSourceExample.scala | 148 .../spark/examples/sql/SparkSQLExample.scala| 254 ++ .../spark/examples/sql/SparkSqlExample.scala| 254 -- .../examples/sql/SqlDataSourceExample.scala | 148 9 files changed, 983 insertions(+), 984 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ef2a6f13/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index a4127da..a88efb7 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -65,14 +65,14 @@ Throughout this document, we will often refer to Scala/Java Datasets of `Row`s a The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: -{% include_example init_session scala/org/apache/spark/examples/sql/SparkSqlExample.scala %} +{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: -{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSqlExample.java %} +{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} @@ -105,7 +105,7 @@ from a Hive table, or from [Spark data sources](#data-sources). As an example, the following creates a DataFrame based on the content of a JSON file: -{% include_example create_df scala/org/apache/spark/examples/sql/SparkSqlExample.scala %} +{% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} @@ -114,7 +114,7 @@ from a Hive table, or from [Spark data sources](#data-sources). As an example, the following creates a DataFrame based on the content of a JSON file: -{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSqlExample.java %} +{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} @@ -155,7 +155,7 @@ Here we include some basic examples of structured data processing using Datasets -{% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSqlExample.scala %} +{% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.Dataset). @@ -164,7 +164,7 @@ In addition to simple column references and expressions, Datasets also have a ri -{% include_example untyped_ops java/org/apache/spark/examples/sql/JavaSparkSqlExample.java %} +{% include_example untyped_ops java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/java/org/apache/spark/sql/Dataset.html). @@
spark git commit: [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update
Repository: spark Updated Branches: refs/heads/master e5fbb182c -> 1426a0805 [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update ## What changes were proposed in this pull request? This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL". ## How was this patch tested? Manually verified the generated HTML page. Author: Cheng LianCloses #14245 from liancheng/minor-scala-example-update. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1426a080 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1426a080 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1426a080 Branch: refs/heads/master Commit: 1426a080528bdb470b5e81300d892af45dd188bf Parents: e5fbb18 Author: Cheng Lian Authored: Mon Jul 18 23:07:59 2016 -0700 Committer: Yin Huai Committed: Mon Jul 18 23:07:59 2016 -0700 -- docs/sql-programming-guide.md | 57 ++-- .../examples/sql/JavaSQLDataSourceExample.java | 217 .../spark/examples/sql/JavaSparkSQLExample.java | 336 +++ .../spark/examples/sql/JavaSparkSqlExample.java | 336 --- .../examples/sql/JavaSqlDataSourceExample.java | 217 .../examples/sql/SQLDataSourceExample.scala | 148 .../spark/examples/sql/SparkSQLExample.scala| 254 ++ .../spark/examples/sql/SparkSqlExample.scala| 254 -- .../examples/sql/SqlDataSourceExample.scala | 148 9 files changed, 983 insertions(+), 984 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1426a080/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 4413fdd..71f3ee4 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -65,14 +65,14 @@ Throughout this document, we will often refer to Scala/Java Datasets of `Row`s a The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: -{% include_example init_session scala/org/apache/spark/examples/sql/SparkSqlExample.scala %} +{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: -{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSqlExample.java %} +{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} @@ -105,7 +105,7 @@ from a Hive table, or from [Spark data sources](#data-sources). As an example, the following creates a DataFrame based on the content of a JSON file: -{% include_example create_df scala/org/apache/spark/examples/sql/SparkSqlExample.scala %} +{% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} @@ -114,7 +114,7 @@ from a Hive table, or from [Spark data sources](#data-sources). As an example, the following creates a DataFrame based on the content of a JSON file: -{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSqlExample.java %} +{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} @@ -155,7 +155,7 @@ Here we include some basic examples of structured data processing using Datasets -{% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSqlExample.scala %} +{% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.Dataset). @@ -164,7 +164,7 @@ In addition to simple column references and expressions, Datasets also have a ri -{% include_example untyped_ops java/org/apache/spark/examples/sql/JavaSparkSqlExample.java %} +{% include_example untyped_ops java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/java/org/apache/spark/sql/Dataset.html). @@ -249,13 +249,13 @@ In addition to simple column references and expressions, DataFrames also have a The `sql` function on a