date:20210309

[spark] branch branch-3.1 updated: [SPARK-34681][SQL] Fix bug for full outer shuffled hash join when building left side with non-equal condition

2021-03-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 9e61043  [SPARK-34681][SQL] Fix bug for full outer shuffled hash join 
when building left side with non-equal condition
9e61043 is described below

commit 9e610438a6211aa8629637644c512a41332d12a5
Author: Cheng Su 
AuthorDate: Tue Mar 9 22:55:27 2021 -0800

[SPARK-34681][SQL] Fix bug for full outer shuffled hash join when building 
left side with non-equal condition

### What changes were proposed in this pull request?

For full outer shuffled hash join with building hash map on left side, and 
having non-equal condition, the join can produce wrong result.

The root cause is `boundCondition` in `HashJoin.scala` always assumes the 
left side row is `streamedPlan` and right side row is `buildPlan` 
([streamedPlan.output ++ 
buildPlan.output](https://github.com/apache/spark/blob/branch-3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L141)).
 This is valid assumption, except for full outer + build left case.

The fix is to correct `boundCondition` in `HashJoin.scala` to handle full 
outer + build left case properly. See reproduce in 
https://issues.apache.org/jira/browse/SPARK-32399?focusedCommentId=17298414&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17298414
 .

### Why are the changes needed?

Fix data correctness bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Changed the test in `OuterJoinSuite.scala` to cover full outer shuffled 
hash join.
Before this change, the unit test `basic full outer join using 
ShuffledHashJoin` in `OuterJoinSuite.scala` is failed.

Closes #31792 from c21/join-bugfix.

Authored-by: Cheng Su 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit a916690dd9aac40df38922dbea233785354a2f2a)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/joins/HashJoin.scala   |  8 +++-
 .../spark/sql/execution/joins/OuterJoinSuite.scala | 22 ++
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
index 53bd591..42219ee 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
@@ -138,7 +138,13 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
 UnsafeProjection.create(streamedBoundKeys)
 
   @transient protected[this] lazy val boundCondition = if 
(condition.isDefined) {
-Predicate.create(condition.get, streamedPlan.output ++ 
buildPlan.output).eval _
+if (joinType == FullOuter && buildSide == BuildLeft) {
+  // Put join left side before right side. This is to be consistent with
+  // `ShuffledHashJoinExec.fullOuterJoin`.
+  Predicate.create(condition.get, buildPlan.output ++ 
streamedPlan.output).eval _
+} else {
+  Predicate.create(condition.get, streamedPlan.output ++ 
buildPlan.output).eval _
+}
   } else {
 (r: InternalRow) => true
   }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
index 9f7e0a1..238d37a 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala
@@ -104,18 +104,16 @@ class OuterJoinSuite extends SparkPlanTest with 
SharedSparkSession {
   ExtractEquiJoinKeys.unapply(join)
 }
 
-if (joinType != FullOuter) {
-  test(s"$testName using ShuffledHashJoin") {
-extractJoinParts().foreach { case (_, leftKeys, rightKeys, 
boundCondition, _, _, _) =>
-  withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "1") {
-val buildSide = if (joinType == LeftOuter) BuildRight else 
BuildLeft
-checkAnswer2(leftRows, rightRows, (left: SparkPlan, right: 
SparkPlan) =>
-  EnsureRequirements.apply(
-ShuffledHashJoinExec(
-  leftKeys, rightKeys, joinType, buildSide, boundCondition, 
left, right)),
-  expectedAnswer.map(Row.fromTuple),
-  sortAnswers = true)
-  }
+test(s"$testName using ShuffledHashJoin") {
+  extractJoinParts().foreach { case (_, leftKeys, rightKeys, 
boundCondition, _, _, _) =>
+withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "1") {
+  val buildSide = if (joinType == LeftOuter) BuildRight else BuildLeft
+  checkAnswer2(leftRow

[spark] branch master updated (48377d5 -> a916690)

2021-03-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 48377d5  [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not 
inherit all tests from AnalysisSuite
 add a916690  [SPARK-34681][SQL] Fix bug for full outer shuffled hash join 
when building left side with non-equal condition

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/joins/HashJoin.scala   |  8 +++-
 .../spark/sql/execution/joins/OuterJoinSuite.scala | 22 ++
 2 files changed, 17 insertions(+), 13 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite

2021-03-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 7502624  [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not 
inherit all tests from AnalysisSuite
7502624 is described below

commit 75026248632df8804aeeb439a5f0f0b3729ef6b3
Author: Wenchen Fan 
AuthorDate: Tue Mar 9 09:02:31 2021 -0800

[SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all 
tests from AnalysisSuite

### What changes were proposed in this pull request?

Fixes a mistake in `TableCapabilityCheckSuite`, which runs some tests 
repeatedly.

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #31788 from cloud-fan/minor.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 48377d5bd9544baf7df928aa315df2504c062ac2)
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala| 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala
index 23e4c293..1b1737f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala
@@ -22,7 +22,7 @@ import java.util
 import scala.collection.JavaConverters._
 
 import org.apache.spark.sql.{AnalysisException, DataFrame, SQLContext}
-import org.apache.spark.sql.catalyst.analysis.{AnalysisSuite, NamedRelation}
+import org.apache.spark.sql.catalyst.analysis.{AnalysisTest, NamedRelation}
 import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, 
Literal}
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.connector.catalog.{CatalogPlugin, Identifier, 
Table, TableCapability, TableProvider}
@@ -35,7 +35,7 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types.{LongType, StringType, StructType}
 import org.apache.spark.sql.util.CaseInsensitiveStringMap
 
-class TableCapabilityCheckSuite extends AnalysisSuite with SharedSparkSession {
+class TableCapabilityCheckSuite extends AnalysisTest with SharedSparkSession {
 
   private val emptyMap = CaseInsensitiveStringMap.empty
   private def createStreamingRelation(table: Table, v1Relation: 
Option[StreamingRelation]) = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite

2021-03-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 525aa13  [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not 
inherit all tests from AnalysisSuite
525aa13 is described below

commit 525aa136d29b520d6dbc5df9962a13eb316d12a5
Author: Wenchen Fan 
AuthorDate: Tue Mar 9 09:02:31 2021 -0800

[SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all 
tests from AnalysisSuite

### What changes were proposed in this pull request?

Fixes a mistake in `TableCapabilityCheckSuite`, which runs some tests 
repeatedly.

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #31788 from cloud-fan/minor.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 48377d5bd9544baf7df928aa315df2504c062ac2)
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala| 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala
index bad21aa..ce94d3b 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala
@@ -22,7 +22,7 @@ import java.util
 import scala.collection.JavaConverters._
 
 import org.apache.spark.sql.{AnalysisException, DataFrame, SQLContext}
-import org.apache.spark.sql.catalyst.analysis.{AnalysisSuite, NamedRelation}
+import org.apache.spark.sql.catalyst.analysis.{AnalysisTest, NamedRelation}
 import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, 
Literal}
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.streaming.StreamingRelationV2
@@ -36,7 +36,7 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types.{LongType, StringType, StructType}
 import org.apache.spark.sql.util.CaseInsensitiveStringMap
 
-class TableCapabilityCheckSuite extends AnalysisSuite with SharedSparkSession {
+class TableCapabilityCheckSuite extends AnalysisTest with SharedSparkSession {
 
   private val emptyMap = CaseInsensitiveStringMap.empty
   private def createStreamingRelation(table: Table, v1Relation: 
Option[StreamingRelation]) = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (2fd8517 -> 48377d5)

2021-03-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 2fd8517  [SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES 
command
 add 48377d5  [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not 
inherit all tests from AnalysisSuite

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/connector/TableCapabilityCheckSuite.scala| 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated (eb4601e -> 191b24c)

2021-03-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git.


from eb4601e  [SPARK-32924][WEBUI] Make duration column in master UI sorted 
in the correct order
 add 191b24c  [SPARK-34672][BUILD][2.4] Fix docker file for creating release

No new revisions were added by this update.

Summary of changes:
 dev/create-release/spark-rm/Dockerfile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES command

2021-03-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2fd8517  [SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES 
command
2fd8517 is described below

commit 2fd85174e9423673edec5ecb1f1c402ec33472fe
Author: Kousuke Saruta 
AuthorDate: Tue Mar 9 21:28:35 2021 +0900

[SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES command

### What changes were proposed in this pull request?

This PR adds `ADD ARCHIVE` and `LIST ARCHIVES` commands to SQL and updates 
relevant documents.
SPARK-33530 added `addArchive` and `listArchives` to `SparkContext` but 
it's not supported yet to add/list archives with SQL.

### Why are the changes needed?

To complement features.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new test and confirmed the generated HTML from the updated documents.

Closes #31721 from sarutak/sql-archive.

Authored-by: Kousuke Saruta 
Signed-off-by: HyukjinKwon 
---
 docs/_data/menu-sql.yaml   |  4 +
 ...ql-ref-syntax-aux-resource-mgmt-add-archive.md} | 28 +++
 docs/sql-ref-syntax-aux-resource-mgmt-add-file.md  |  3 +-
 docs/sql-ref-syntax-aux-resource-mgmt-add-jar.md   |  2 +
 ...l-ref-syntax-aux-resource-mgmt-list-archive.md} | 33 
 docs/sql-ref-syntax-aux-resource-mgmt-list-file.md |  5 +-
 docs/sql-ref-syntax-aux-resource-mgmt-list-jar.md  |  6 +-
 docs/sql-ref-syntax-aux-resource-mgmt.md   |  2 +
 .../spark/sql/execution/SparkSqlParser.scala   | 10 ++-
 .../spark/sql/execution/command/resources.scala| 31 +++
 .../spark/sql/hive/execution/HiveQuerySuite.scala  | 95 ++
 11 files changed, 183 insertions(+), 36 deletions(-)

diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
index a9ea6fe..5192422 100644
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@@ -263,7 +263,11 @@
   url: sql-ref-syntax-aux-resource-mgmt-add-file.html
 - text: ADD JAR
   url: sql-ref-syntax-aux-resource-mgmt-add-jar.html
+- text: ADD ARCHIVE
+  url: sql-ref-syntax-aux-resource-mgmt-add-archive.html
 - text: LIST FILE
   url: sql-ref-syntax-aux-resource-mgmt-list-file.html
 - text: LIST JAR
   url: sql-ref-syntax-aux-resource-mgmt-list-jar.html
+- text: LIST ARCHIVE
+  url: sql-ref-syntax-aux-resource-mgmt-list-archive.html
diff --git a/docs/sql-ref-syntax-aux-resource-mgmt-add-file.md 
b/docs/sql-ref-syntax-aux-resource-mgmt-add-archive.md
similarity index 60%
copy from docs/sql-ref-syntax-aux-resource-mgmt-add-file.md
copy to docs/sql-ref-syntax-aux-resource-mgmt-add-archive.md
index 9203293..fa86acd 100644
--- a/docs/sql-ref-syntax-aux-resource-mgmt-add-file.md
+++ b/docs/sql-ref-syntax-aux-resource-mgmt-add-archive.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: ADD FILE
-displayTitle: ADD FILE
+title: ADD ARCHIVE
+displayTitle: ADD ARCHIVE
 license: |
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
@@ -9,9 +9,9 @@ license: |
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at
- 
+
  http://www.apache.org/licenses/LICENSE-2.0
- 
+
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -21,33 +21,33 @@ license: |
 
 ### Description
 
-`ADD FILE` can be used to add a single file as well as a directory to the list 
of resources. The added resource can be listed using [LIST 
FILE](sql-ref-syntax-aux-resource-mgmt-list-file.html).
+`ADD ARCHIVE` can be used to add an archive file to the list of resources. The 
given archive file should be one of .zip, .tar, .tar.gz, .tgz and .jar. The 
added archive file can be listed using [LIST 
ARCHIVE](sql-ref-syntax-aux-resource-mgmt-list-archive.html).
 
 ### Syntax
 
 ```sql
-ADD FILE resource_name
+ADD ARCHIVE file_name
 ```
 
 ### Parameters
 
-* **resource_name**
+* **file_name**
 
-The name of the file or directory to be added.
+The name of the archive file to be added. It could be either on a local 
file system or a distributed file system.
 
 ### Examples
 
 ```sql
-ADD FILE /tmp/test;
-ADD FILE "/path/to/file/abc.txt";
-ADD FILE '/another/test.txt';
-ADD FILE "/path with space/abc.txt";
-ADD FILE "/path/to/some/directory";
+ADD ARCHIVE /tmp/test.t

[spark] branch branch-3.0 updated: [SPARK-34545][SQL][3.0] Fix issues with valueCompare feature of pyrolite

2021-03-09 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new e52da94  [SPARK-34545][SQL][3.0] Fix issues with valueCompare feature 
of pyrolite
e52da94 is described below

commit e52da94926b2de7184936f92d454862ba4fff349
Author: Peter Toth 
AuthorDate: Tue Mar 9 06:16:18 2021 -0600

[SPARK-34545][SQL][3.0] Fix issues with valueCompare feature of pyrolite

### What changes were proposed in this pull request?

pyrolite 4.21 introduced and enabled value comparison by default 
(`valueCompare=true`) during object memoization and serialization: 
https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122
This change has undesired effect when we serialize a row (actually 
`GenericRowWithSchema`) to be passed to python: 
https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60.
 A simple example is that
```
new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", 
DoubleType), StructField("_2", DoubleType
```
and
```
new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", 
IntegerType), StructField("_2", IntegerType
```
are currently equal and the second instance is replaced to the short code 
of the first one during serialization.

### Why are the changes needed?
The above can cause nasty issues like the one in 
https://issues.apache.org/jira/browse/SPARK-34545 description:

```
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>>
>>> def udf1(data_type):
def u1(e):
return e[0]
return udf(u1, data_type)
>>>
>>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2'])
>>>
>>> df = df.withColumn("c3", udf1(DoubleType())("c1"))
>>> df = df.withColumn("c4", udf1(IntegerType())("c2"))
>>>
>>> df.select("c3").show()
+---+
| c3|
+---+
|1.0|
+---+

>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
+---+

>>> df.select("c3", "c4").show()
+---++
| c3|  c4|
+---++
|1.0|null|
+---++
```
This is because during serialization from JVM to Python 
`GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when 
`GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short 
code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The 
python functions then runs but the return type of `c4` is expected to be 
`IntegerType` and if a different type (`DoubleType`) comes back from python 
then it is discarded: https://github.com/apache/spark/blob/bra [...]

After this PR:
```
>>> df.select("c3", "c4").show()
+---+---+
| c3| c4|
+---+---+
|1.0|  1|
+---+---+
```

### Does this PR introduce _any_ user-facing change?
Yes, fixes a correctness issue.

### How was this patch tested?
Added new UT + manual tests.

Closes #31778 from peter-toth/SPARK-34545-fix-row-comparison-3.0.

Authored-by: Peter Toth 
Signed-off-by: Sean Owen 
---
 .../main/scala/org/apache/spark/api/python/SerDeUtil.scala  |  9 ++---
 .../org/apache/spark/mllib/api/python/PythonMLLibAPI.scala  |  6 --
 python/pyspark/sql/tests/test_udf.py| 11 +++
 .../spark/sql/execution/python/BatchEvalPythonExec.scala| 13 -
 4 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
index 01e64b6..0405615 100644
--- a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
@@ -146,7 +146,8 @@ private[spark] object SerDeUtil extends Logging {
* Choose batch size based on size of objects
*/
   private[spark] class AutoBatchedPickler(iter: Iterator[Any]) extends 
Iterator[Array[Byte]] {
-private val pickle = new Pickler()
+private val pickle = new Pickler(/* useMemo = */ true,
+  /* valueCompare = */ false)
 private var batch = 1
 private val buffer = new mutable.ArrayBuffer[Any]
 
@@ -199,7 +200,8 @@ private[spark] object SerDeUtil extends Logging {
   }
 
   private def checkPickle(t: (Any, Any)): (Boolean, Boolean) = {
-val pickle = new Pickler
+val pickle = new Pickler(/* useMemo = */ true,
+  /* valueCompare = */ false)
 val kt = Try {
   pickle.dumps(t._1)
 }
@@ -250,7 +252,8 @@ private[spark] object SerDeUtil extends Logging {
   if (batchSize == 0) {
 new AutoBatchedPickler(cleaned)

[spark] branch master updated: [SPARK-34649][SQL][DOCS] org.apache.spark.sql.DataFrameNaFunctions.replace() fails for column name having a dot

2021-03-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a9c1189  [SPARK-34649][SQL][DOCS] 
org.apache.spark.sql.DataFrameNaFunctions.replace() fails for column name 
having a dot
a9c1189 is described below

commit a9c11896a5db3cd6844d5e444ad59e65d9441e7c
Author: Amandeep Sharma 
AuthorDate: Tue Mar 9 11:47:01 2021 +

[SPARK-34649][SQL][DOCS] 
org.apache.spark.sql.DataFrameNaFunctions.replace() fails for column name 
having a dot

### What changes were proposed in this pull request?

Use resolved attributes instead of data-frame fields for replacing values.

### Why are the changes needed?

dataframe.na.replace() does not work for column having a dot in the name

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

Added unit tests for the same

Closes #31769 from amandeep-sharma/master.

Authored-by: Amandeep Sharma 
Signed-off-by: Wenchen Fan 
---
 docs/sql-migration-guide.md|  2 +
 .../apache/spark/sql/DataFrameNaFunctions.scala| 42 -
 .../spark/sql/DataFrameNaFunctionsSuite.scala  | 43 --
 3 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 0e96c6d..5551d56 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -66,6 +66,8 @@ license: |
   - In Spark 3.2, the output schema of `SHOW TBLPROPERTIES` becomes `key: 
string, value: string` whether you specify the table property key or not. In 
Spark 3.1 and earlier, the output schema of `SHOW TBLPROPERTIES` is `value: 
string` when you specify the table property key. To restore the old schema with 
the builtin catalog, you can set `spark.sql.legacy.keepCommandOutputSchema` to 
`true`.
 
   - In Spark 3.2, we support typed literals in the partition spec of INSERT 
and ADD/DROP/RENAME PARTITION. For example, `ADD PARTITION(dt = 
date'2020-01-01')` adds a partition with date value `2020-01-01`. In Spark 3.1 
and earlier, the partition value will be parsed as string value `date 
'2020-01-01'`, which is an illegal date value, and we add a partition with null 
value at the end.
+  
+  - In Spark 3.2, `DataFrameNaFunctions.replace()` no longer uses exact string 
match for the input column names, to match the SQL syntax and support qualified 
column names. Input column name having a dot in the name (not nested) needs to 
be escaped with backtick \`. Now, it throws `AnalysisException` if the column 
is not found in the data frame schema. It also throws 
`IllegalArgumentException` if the input column name is a nested column. In 
Spark 3.1 and earlier, it used to ignore invali [...]
 
 ## Upgrading from Spark SQL 3.0 to 3.1
 
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
index 308bb96..91905f2 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
@@ -327,9 +327,9 @@ final class DataFrameNaFunctions private[sql](df: 
DataFrame) {
*/
   def replace[T](col: String, replacement: Map[T, T]): DataFrame = {
 if (col == "*") {
-  replace0(df.columns, replacement)
+  replace0(df.logicalPlan.output, replacement)
 } else {
-  replace0(Seq(col), replacement)
+  replace(Seq(col), replacement)
 }
   }
 
@@ -352,10 +352,21 @@ final class DataFrameNaFunctions private[sql](df: 
DataFrame) {
*
* @since 1.3.1
*/
-  def replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame = 
replace0(cols, replacement)
+  def replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame = {
+val attrs = cols.map { colName =>
+  // Check column name exists
+  val attr = df.resolve(colName) match {
+case a: Attribute => a
+case _ => throw new UnsupportedOperationException(
+  s"Nested field ${colName} is not supported.")
+  }
+  attr
+}
+replace0(attrs, replacement)
+  }
 
-  private def replace0[T](cols: Seq[String], replacement: Map[T, T]): 
DataFrame = {
-if (replacement.isEmpty || cols.isEmpty) {
+  private def replace0[T](attrs: Seq[Attribute], replacement: Map[T, T]): 
DataFrame = {
+if (replacement.isEmpty || attrs.isEmpty) {
   return df
 }
 
@@ -379,15 +390,13 @@ final class DataFrameNaFunctions private[sql](df: 
DataFrame) {
   case _: String => StringType
 }
 
-val columnEquals = df.sparkSession.sessionState.analyzer.resolver
-val projections = df.schema.fields.map { f =>
-  val shouldReplace = cols.exists(colName => columnEquals(colName, f.

[spark] branch master updated (43b23fd -> b5b1985)

2021-03-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 43b23fd  [SPARK-33498][SQL][TESTS][FOLLOWUP] Remove 
SQLConf.withExistingConf in CastSuite
 add b5b1985  [SPARK-34620][SQL] Code-gen broadcast nested loop join 
(inner/cross)

No new revisions were added by this update.

Summary of changes:
 .../benchmarks/JoinBenchmark-jdk11-results.txt |  71 ---
 sql/core/benchmarks/JoinBenchmark-results.txt  |  71 ---
 .../sql/execution/WholeStageCodegenExec.scala  |   3 +-
 .../joins/BroadcastNestedLoopJoinExec.scala|  63 +-
 .../spark/sql/execution/joins/HashJoin.scala   |  71 +--
 .../sql/execution/joins/JoinCodegenSupport.scala   |  96 +
 .../approved-plans-v1_4/q28.sf100/explain.txt  |  92 -
 .../approved-plans-v1_4/q28.sf100/simplified.txt   | 119 +--
 .../approved-plans-v1_4/q28/explain.txt|  92 -
 .../approved-plans-v1_4/q28/simplified.txt | 119 +--
 .../approved-plans-v1_4/q61.sf100/explain.txt  |  34 ++--
 .../approved-plans-v1_4/q61.sf100/simplified.txt   | 127 ++--
 .../approved-plans-v1_4/q61/explain.txt|  38 ++--
 .../approved-plans-v1_4/q61/simplified.txt | 127 ++--
 .../approved-plans-v1_4/q77.sf100/explain.txt  |  66 +++---
 .../approved-plans-v1_4/q77.sf100/simplified.txt   |  51 +++--
 .../approved-plans-v1_4/q77/explain.txt|  56 ++---
 .../approved-plans-v1_4/q77/simplified.txt |  47 +++--
 .../approved-plans-v1_4/q88.sf100/explain.txt  | 226 ++---
 .../approved-plans-v1_4/q88.sf100/simplified.txt   | 201 +-
 .../approved-plans-v1_4/q88/explain.txt| 226 ++---
 .../approved-plans-v1_4/q88/simplified.txt | 201 +-
 .../approved-plans-v1_4/q90.sf100/explain.txt  |  38 ++--
 .../approved-plans-v1_4/q90.sf100/simplified.txt   |  81 
 .../approved-plans-v1_4/q90/explain.txt|  38 ++--
 .../approved-plans-v1_4/q90/simplified.txt |  81 
 .../approved-plans-v2_7/q22.sf100/explain.txt  |  16 +-
 .../approved-plans-v2_7/q22.sf100/simplified.txt   |  75 ---
 .../approved-plans-v2_7/q22/explain.txt|  24 +--
 .../approved-plans-v2_7/q22/simplified.txt |  57 +++---
 .../approved-plans-v2_7/q77a.sf100/explain.txt |  80 
 .../approved-plans-v2_7/q77a.sf100/simplified.txt  |  63 +++---
 .../approved-plans-v2_7/q77a/explain.txt   |  70 +++
 .../approved-plans-v2_7/q77a/simplified.txt|  59 +++---
 .../sql/execution/WholeStageCodegenSuite.scala |  42 +++-
 .../sql/execution/benchmark/JoinBenchmark.scala|  14 ++
 36 files changed, 1557 insertions(+), 1378 deletions(-)
 create mode 100644 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-34681][SQL] Fix bug for full outer shuffled hash join when building left side with non-equal condition

[spark] branch master updated (48377d5 -> a916690)

[spark] branch branch-3.0 updated: [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite

[spark] branch branch-3.1 updated: [SPARK-34676][SQL][TEST] TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite

[spark] branch master updated (2fd8517 -> 48377d5)

[spark] branch branch-2.4 updated (eb4601e -> 191b24c)

[spark] branch master updated: [SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES command

[spark] branch branch-3.0 updated: [SPARK-34545][SQL][3.0] Fix issues with valueCompare feature of pyrolite

[spark] branch master updated: [SPARK-34649][SQL][DOCS] org.apache.spark.sql.DataFrameNaFunctions.replace() fails for column name having a dot

[spark] branch master updated (43b23fd -> b5b1985)

10 matches

Site Navigation

Mail list logo

Footer information