[spark] branch branch-3.1 updated: [SPARK-34681][SQL] Fix bug for full outer shuffled hash join when building left side with non-equal condition
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 9e61043 [SPARK-34681][SQL] Fix bug for full outer shuffled hash join when building left side with non-equal condition 9e61043 is described below commit 9e610438a6211aa8629637644c512a41332d12a5 Author: Cheng Su AuthorDate: Tue Mar 9 22:55:27 2021 -0800 [SPARK-34681][SQL] Fix bug for full outer shuffled hash join when building left side with non-equal condition ### What changes were proposed in this pull request? For full outer shuffled hash join with building hash map on left side, and having non-equal condition, the join can produce wrong result. The root cause is `boundCondition` in `HashJoin.scala` always assumes the left side row is `streamedPlan` and right side row is `buildPlan` ([streamedPlan.output ++ buildPlan.output](https://github.com/apache/spark/blob/branch-3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L141)). This is valid assumption, except for full outer + build left case. The fix is to correct `boundCondition` in `HashJoin.scala` to handle full outer + build left case properly. See reproduce in https://issues.apache.org/jira/browse/SPARK-32399?focusedCommentId=17298414&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17298414 . ### Why are the changes needed? Fix data correctness bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Changed the test in `OuterJoinSuite.scala` to cover full outer shuffled hash join. Before this change, the unit test `basic full outer join using ShuffledHashJoin` in `OuterJoinSuite.scala` is failed. Closes #31792 from c21/join-bugfix. Authored-by: Cheng Su Signed-off-by: Dongjoon Hyun (cherry picked from commit a916690dd9aac40df38922dbea233785354a2f2a) Signed-off-by: Dongjoon Hyun --- .../spark/sql/execution/joins/HashJoin.scala | 8 +++- .../spark/sql/execution/joins/OuterJoinSuite.scala | 22 ++ 2 files changed, 17 insertions(+), 13 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala index 53bd591..42219ee 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala @@ -138,7 +138,13 @@ trait HashJoin extends BaseJoinExec with CodegenSupport { UnsafeProjection.create(streamedBoundKeys) @transient protected[this] lazy val boundCondition = if (condition.isDefined) { -Predicate.create(condition.get, streamedPlan.output ++ buildPlan.output).eval _ +if (joinType == FullOuter && buildSide == BuildLeft) { + // Put join left side before right side. This is to be consistent with + // `ShuffledHashJoinExec.fullOuterJoin`. + Predicate.create(condition.get, buildPlan.output ++ streamedPlan.output).eval _ +} else { + Predicate.create(condition.get, streamedPlan.output ++ buildPlan.output).eval _ +} } else { (r: InternalRow) => true } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala index 9f7e0a1..238d37a 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala @@ -104,18 +104,16 @@ class OuterJoinSuite extends SparkPlanTest with SharedSparkSession { ExtractEquiJoinKeys.unapply(join) } -if (joinType != FullOuter) { - test(s"$testName using ShuffledHashJoin") { -extractJoinParts().foreach { case (_, leftKeys, rightKeys, boundCondition, _, _, _) => - withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "1") { -val buildSide = if (joinType == LeftOuter) BuildRight else BuildLeft -checkAnswer2(leftRows, rightRows, (left: SparkPlan, right: SparkPlan) => - EnsureRequirements.apply( -ShuffledHashJoinExec( - leftKeys, rightKeys, joinType, buildSide, boundCondition, left, right)), - expectedAnswer.map(Row.fromTuple), - sortAnswers = true) - } +test(s"$testName using ShuffledHashJoin") { + extractJoinParts().foreach { case (_, leftKeys, rightKeys, boundCondition, _, _, _) => +withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "1") { + val buildSide = if (joinType == LeftOuter) BuildRight else BuildLeft + checkAnswer2(leftRow
[spark] branch master updated (48377d5 -> a916690)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 48377d5 [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite add a916690 [SPARK-34681][SQL] Fix bug for full outer shuffled hash join when building left side with non-equal condition No new revisions were added by this update. Summary of changes: .../spark/sql/execution/joins/HashJoin.scala | 8 +++- .../spark/sql/execution/joins/OuterJoinSuite.scala | 22 ++ 2 files changed, 17 insertions(+), 13 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 7502624 [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite 7502624 is described below commit 75026248632df8804aeeb439a5f0f0b3729ef6b3 Author: Wenchen Fan AuthorDate: Tue Mar 9 09:02:31 2021 -0800 [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite ### What changes were proposed in this pull request? Fixes a mistake in `TableCapabilityCheckSuite`, which runs some tests repeatedly. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #31788 from cloud-fan/minor. Authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun (cherry picked from commit 48377d5bd9544baf7df928aa315df2504c062ac2) Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala| 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala index 23e4c293..1b1737f 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala @@ -22,7 +22,7 @@ import java.util import scala.collection.JavaConverters._ import org.apache.spark.sql.{AnalysisException, DataFrame, SQLContext} -import org.apache.spark.sql.catalyst.analysis.{AnalysisSuite, NamedRelation} +import org.apache.spark.sql.catalyst.analysis.{AnalysisTest, NamedRelation} import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, Literal} import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.connector.catalog.{CatalogPlugin, Identifier, Table, TableCapability, TableProvider} @@ -35,7 +35,7 @@ import org.apache.spark.sql.test.SharedSparkSession import org.apache.spark.sql.types.{LongType, StringType, StructType} import org.apache.spark.sql.util.CaseInsensitiveStringMap -class TableCapabilityCheckSuite extends AnalysisSuite with SharedSparkSession { +class TableCapabilityCheckSuite extends AnalysisTest with SharedSparkSession { private val emptyMap = CaseInsensitiveStringMap.empty private def createStreamingRelation(table: Table, v1Relation: Option[StreamingRelation]) = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 525aa13 [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite 525aa13 is described below commit 525aa136d29b520d6dbc5df9962a13eb316d12a5 Author: Wenchen Fan AuthorDate: Tue Mar 9 09:02:31 2021 -0800 [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite ### What changes were proposed in this pull request? Fixes a mistake in `TableCapabilityCheckSuite`, which runs some tests repeatedly. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #31788 from cloud-fan/minor. Authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun (cherry picked from commit 48377d5bd9544baf7df928aa315df2504c062ac2) Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala| 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala index bad21aa..ce94d3b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala @@ -22,7 +22,7 @@ import java.util import scala.collection.JavaConverters._ import org.apache.spark.sql.{AnalysisException, DataFrame, SQLContext} -import org.apache.spark.sql.catalyst.analysis.{AnalysisSuite, NamedRelation} +import org.apache.spark.sql.catalyst.analysis.{AnalysisTest, NamedRelation} import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, Literal} import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.streaming.StreamingRelationV2 @@ -36,7 +36,7 @@ import org.apache.spark.sql.test.SharedSparkSession import org.apache.spark.sql.types.{LongType, StringType, StructType} import org.apache.spark.sql.util.CaseInsensitiveStringMap -class TableCapabilityCheckSuite extends AnalysisSuite with SharedSparkSession { +class TableCapabilityCheckSuite extends AnalysisTest with SharedSparkSession { private val emptyMap = CaseInsensitiveStringMap.empty private def createStreamingRelation(table: Table, v1Relation: Option[StreamingRelation]) = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (2fd8517 -> 48377d5)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 2fd8517 [SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES command add 48377d5 [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala| 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated (eb4601e -> 191b24c)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git. from eb4601e [SPARK-32924][WEBUI] Make duration column in master UI sorted in the correct order add 191b24c [SPARK-34672][BUILD][2.4] Fix docker file for creating release No new revisions were added by this update. Summary of changes: dev/create-release/spark-rm/Dockerfile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES command
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2fd8517 [SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES command 2fd8517 is described below commit 2fd85174e9423673edec5ecb1f1c402ec33472fe Author: Kousuke Saruta AuthorDate: Tue Mar 9 21:28:35 2021 +0900 [SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES command ### What changes were proposed in this pull request? This PR adds `ADD ARCHIVE` and `LIST ARCHIVES` commands to SQL and updates relevant documents. SPARK-33530 added `addArchive` and `listArchives` to `SparkContext` but it's not supported yet to add/list archives with SQL. ### Why are the changes needed? To complement features. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test and confirmed the generated HTML from the updated documents. Closes #31721 from sarutak/sql-archive. Authored-by: Kousuke Saruta Signed-off-by: HyukjinKwon --- docs/_data/menu-sql.yaml | 4 + ...ql-ref-syntax-aux-resource-mgmt-add-archive.md} | 28 +++ docs/sql-ref-syntax-aux-resource-mgmt-add-file.md | 3 +- docs/sql-ref-syntax-aux-resource-mgmt-add-jar.md | 2 + ...l-ref-syntax-aux-resource-mgmt-list-archive.md} | 33 docs/sql-ref-syntax-aux-resource-mgmt-list-file.md | 5 +- docs/sql-ref-syntax-aux-resource-mgmt-list-jar.md | 6 +- docs/sql-ref-syntax-aux-resource-mgmt.md | 2 + .../spark/sql/execution/SparkSqlParser.scala | 10 ++- .../spark/sql/execution/command/resources.scala| 31 +++ .../spark/sql/hive/execution/HiveQuerySuite.scala | 95 ++ 11 files changed, 183 insertions(+), 36 deletions(-) diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml index a9ea6fe..5192422 100644 --- a/docs/_data/menu-sql.yaml +++ b/docs/_data/menu-sql.yaml @@ -263,7 +263,11 @@ url: sql-ref-syntax-aux-resource-mgmt-add-file.html - text: ADD JAR url: sql-ref-syntax-aux-resource-mgmt-add-jar.html +- text: ADD ARCHIVE + url: sql-ref-syntax-aux-resource-mgmt-add-archive.html - text: LIST FILE url: sql-ref-syntax-aux-resource-mgmt-list-file.html - text: LIST JAR url: sql-ref-syntax-aux-resource-mgmt-list-jar.html +- text: LIST ARCHIVE + url: sql-ref-syntax-aux-resource-mgmt-list-archive.html diff --git a/docs/sql-ref-syntax-aux-resource-mgmt-add-file.md b/docs/sql-ref-syntax-aux-resource-mgmt-add-archive.md similarity index 60% copy from docs/sql-ref-syntax-aux-resource-mgmt-add-file.md copy to docs/sql-ref-syntax-aux-resource-mgmt-add-archive.md index 9203293..fa86acd 100644 --- a/docs/sql-ref-syntax-aux-resource-mgmt-add-file.md +++ b/docs/sql-ref-syntax-aux-resource-mgmt-add-archive.md @@ -1,7 +1,7 @@ --- layout: global -title: ADD FILE -displayTitle: ADD FILE +title: ADD ARCHIVE +displayTitle: ADD ARCHIVE license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -9,9 +9,9 @@ license: | The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at - + http://www.apache.org/licenses/LICENSE-2.0 - + Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. @@ -21,33 +21,33 @@ license: | ### Description -`ADD FILE` can be used to add a single file as well as a directory to the list of resources. The added resource can be listed using [LIST FILE](sql-ref-syntax-aux-resource-mgmt-list-file.html). +`ADD ARCHIVE` can be used to add an archive file to the list of resources. The given archive file should be one of .zip, .tar, .tar.gz, .tgz and .jar. The added archive file can be listed using [LIST ARCHIVE](sql-ref-syntax-aux-resource-mgmt-list-archive.html). ### Syntax ```sql -ADD FILE resource_name +ADD ARCHIVE file_name ``` ### Parameters -* **resource_name** +* **file_name** -The name of the file or directory to be added. +The name of the archive file to be added. It could be either on a local file system or a distributed file system. ### Examples ```sql -ADD FILE /tmp/test; -ADD FILE "/path/to/file/abc.txt"; -ADD FILE '/another/test.txt'; -ADD FILE "/path with space/abc.txt"; -ADD FILE "/path/to/some/directory"; +ADD ARCHIVE /tmp/test.t
[spark] branch branch-3.0 updated: [SPARK-34545][SQL][3.0] Fix issues with valueCompare feature of pyrolite
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e52da94 [SPARK-34545][SQL][3.0] Fix issues with valueCompare feature of pyrolite e52da94 is described below commit e52da94926b2de7184936f92d454862ba4fff349 Author: Peter Toth AuthorDate: Tue Mar 9 06:16:18 2021 -0600 [SPARK-34545][SQL][3.0] Fix issues with valueCompare feature of pyrolite ### What changes were proposed in this pull request? pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122 This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that ``` new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType ``` and ``` new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType ``` are currently equal and the second instance is replaced to the short code of the first one during serialization. ### Why are the changes needed? The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description: ``` >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import * >>> >>> def udf1(data_type): def u1(e): return e[0] return udf(u1, data_type) >>> >>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2']) >>> >>> df = df.withColumn("c3", udf1(DoubleType())("c1")) >>> df = df.withColumn("c4", udf1(IntegerType())("c2")) >>> >>> df.select("c3").show() +---+ | c3| +---+ |1.0| +---+ >>> df.select("c4").show() +---+ | c4| +---+ | 1| +---+ >>> df.select("c3", "c4").show() +---++ | c3| c4| +---++ |1.0|null| +---++ ``` This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/bra [...] After this PR: ``` >>> df.select("c3", "c4").show() +---+---+ | c3| c4| +---+---+ |1.0| 1| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UT + manual tests. Closes #31778 from peter-toth/SPARK-34545-fix-row-comparison-3.0. Authored-by: Peter Toth Signed-off-by: Sean Owen --- .../main/scala/org/apache/spark/api/python/SerDeUtil.scala | 9 ++--- .../org/apache/spark/mllib/api/python/PythonMLLibAPI.scala | 6 -- python/pyspark/sql/tests/test_udf.py| 11 +++ .../spark/sql/execution/python/BatchEvalPythonExec.scala| 13 - 4 files changed, 33 insertions(+), 6 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala index 01e64b6..0405615 100644 --- a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala +++ b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala @@ -146,7 +146,8 @@ private[spark] object SerDeUtil extends Logging { * Choose batch size based on size of objects */ private[spark] class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] { -private val pickle = new Pickler() +private val pickle = new Pickler(/* useMemo = */ true, + /* valueCompare = */ false) private var batch = 1 private val buffer = new mutable.ArrayBuffer[Any] @@ -199,7 +200,8 @@ private[spark] object SerDeUtil extends Logging { } private def checkPickle(t: (Any, Any)): (Boolean, Boolean) = { -val pickle = new Pickler +val pickle = new Pickler(/* useMemo = */ true, + /* valueCompare = */ false) val kt = Try { pickle.dumps(t._1) } @@ -250,7 +252,8 @@ private[spark] object SerDeUtil extends Logging { if (batchSize == 0) { new AutoBatchedPickler(cleaned)
[spark] branch master updated: [SPARK-34649][SQL][DOCS] org.apache.spark.sql.DataFrameNaFunctions.replace() fails for column name having a dot
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a9c1189 [SPARK-34649][SQL][DOCS] org.apache.spark.sql.DataFrameNaFunctions.replace() fails for column name having a dot a9c1189 is described below commit a9c11896a5db3cd6844d5e444ad59e65d9441e7c Author: Amandeep Sharma AuthorDate: Tue Mar 9 11:47:01 2021 + [SPARK-34649][SQL][DOCS] org.apache.spark.sql.DataFrameNaFunctions.replace() fails for column name having a dot ### What changes were proposed in this pull request? Use resolved attributes instead of data-frame fields for replacing values. ### Why are the changes needed? dataframe.na.replace() does not work for column having a dot in the name ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Added unit tests for the same Closes #31769 from amandeep-sharma/master. Authored-by: Amandeep Sharma Signed-off-by: Wenchen Fan --- docs/sql-migration-guide.md| 2 + .../apache/spark/sql/DataFrameNaFunctions.scala| 42 - .../spark/sql/DataFrameNaFunctionsSuite.scala | 43 -- 3 files changed, 67 insertions(+), 20 deletions(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 0e96c6d..5551d56 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -66,6 +66,8 @@ license: | - In Spark 3.2, the output schema of `SHOW TBLPROPERTIES` becomes `key: string, value: string` whether you specify the table property key or not. In Spark 3.1 and earlier, the output schema of `SHOW TBLPROPERTIES` is `value: string` when you specify the table property key. To restore the old schema with the builtin catalog, you can set `spark.sql.legacy.keepCommandOutputSchema` to `true`. - In Spark 3.2, we support typed literals in the partition spec of INSERT and ADD/DROP/RENAME PARTITION. For example, `ADD PARTITION(dt = date'2020-01-01')` adds a partition with date value `2020-01-01`. In Spark 3.1 and earlier, the partition value will be parsed as string value `date '2020-01-01'`, which is an illegal date value, and we add a partition with null value at the end. + + - In Spark 3.2, `DataFrameNaFunctions.replace()` no longer uses exact string match for the input column names, to match the SQL syntax and support qualified column names. Input column name having a dot in the name (not nested) needs to be escaped with backtick \`. Now, it throws `AnalysisException` if the column is not found in the data frame schema. It also throws `IllegalArgumentException` if the input column name is a nested column. In Spark 3.1 and earlier, it used to ignore invali [...] ## Upgrading from Spark SQL 3.0 to 3.1 diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala index 308bb96..91905f2 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala @@ -327,9 +327,9 @@ final class DataFrameNaFunctions private[sql](df: DataFrame) { */ def replace[T](col: String, replacement: Map[T, T]): DataFrame = { if (col == "*") { - replace0(df.columns, replacement) + replace0(df.logicalPlan.output, replacement) } else { - replace0(Seq(col), replacement) + replace(Seq(col), replacement) } } @@ -352,10 +352,21 @@ final class DataFrameNaFunctions private[sql](df: DataFrame) { * * @since 1.3.1 */ - def replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame = replace0(cols, replacement) + def replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame = { +val attrs = cols.map { colName => + // Check column name exists + val attr = df.resolve(colName) match { +case a: Attribute => a +case _ => throw new UnsupportedOperationException( + s"Nested field ${colName} is not supported.") + } + attr +} +replace0(attrs, replacement) + } - private def replace0[T](cols: Seq[String], replacement: Map[T, T]): DataFrame = { -if (replacement.isEmpty || cols.isEmpty) { + private def replace0[T](attrs: Seq[Attribute], replacement: Map[T, T]): DataFrame = { +if (replacement.isEmpty || attrs.isEmpty) { return df } @@ -379,15 +390,13 @@ final class DataFrameNaFunctions private[sql](df: DataFrame) { case _: String => StringType } -val columnEquals = df.sparkSession.sessionState.analyzer.resolver -val projections = df.schema.fields.map { f => - val shouldReplace = cols.exists(colName => columnEquals(colName, f.
[spark] branch master updated (43b23fd -> b5b1985)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 43b23fd [SPARK-33498][SQL][TESTS][FOLLOWUP] Remove SQLConf.withExistingConf in CastSuite add b5b1985 [SPARK-34620][SQL] Code-gen broadcast nested loop join (inner/cross) No new revisions were added by this update. Summary of changes: .../benchmarks/JoinBenchmark-jdk11-results.txt | 71 --- sql/core/benchmarks/JoinBenchmark-results.txt | 71 --- .../sql/execution/WholeStageCodegenExec.scala | 3 +- .../joins/BroadcastNestedLoopJoinExec.scala| 63 +- .../spark/sql/execution/joins/HashJoin.scala | 71 +-- .../sql/execution/joins/JoinCodegenSupport.scala | 96 + .../approved-plans-v1_4/q28.sf100/explain.txt | 92 - .../approved-plans-v1_4/q28.sf100/simplified.txt | 119 +-- .../approved-plans-v1_4/q28/explain.txt| 92 - .../approved-plans-v1_4/q28/simplified.txt | 119 +-- .../approved-plans-v1_4/q61.sf100/explain.txt | 34 ++-- .../approved-plans-v1_4/q61.sf100/simplified.txt | 127 ++-- .../approved-plans-v1_4/q61/explain.txt| 38 ++-- .../approved-plans-v1_4/q61/simplified.txt | 127 ++-- .../approved-plans-v1_4/q77.sf100/explain.txt | 66 +++--- .../approved-plans-v1_4/q77.sf100/simplified.txt | 51 +++-- .../approved-plans-v1_4/q77/explain.txt| 56 ++--- .../approved-plans-v1_4/q77/simplified.txt | 47 +++-- .../approved-plans-v1_4/q88.sf100/explain.txt | 226 ++--- .../approved-plans-v1_4/q88.sf100/simplified.txt | 201 +- .../approved-plans-v1_4/q88/explain.txt| 226 ++--- .../approved-plans-v1_4/q88/simplified.txt | 201 +- .../approved-plans-v1_4/q90.sf100/explain.txt | 38 ++-- .../approved-plans-v1_4/q90.sf100/simplified.txt | 81 .../approved-plans-v1_4/q90/explain.txt| 38 ++-- .../approved-plans-v1_4/q90/simplified.txt | 81 .../approved-plans-v2_7/q22.sf100/explain.txt | 16 +- .../approved-plans-v2_7/q22.sf100/simplified.txt | 75 --- .../approved-plans-v2_7/q22/explain.txt| 24 +-- .../approved-plans-v2_7/q22/simplified.txt | 57 +++--- .../approved-plans-v2_7/q77a.sf100/explain.txt | 80 .../approved-plans-v2_7/q77a.sf100/simplified.txt | 63 +++--- .../approved-plans-v2_7/q77a/explain.txt | 70 +++ .../approved-plans-v2_7/q77a/simplified.txt| 59 +++--- .../sql/execution/WholeStageCodegenSuite.scala | 42 +++- .../sql/execution/benchmark/JoinBenchmark.scala| 14 ++ 36 files changed, 1557 insertions(+), 1378 deletions(-) create mode 100644 sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org