[spark] branch master updated (61c4057c3fe -> 99739ae068d)

2023-01-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 61c4057c3fe [SPARK-41905][CONNECT] Support name as strings in slice
 add 99739ae068d [SPARK-41840][CONNECT][TESTS] Remove the invalid JIRA in 
the comment

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/connect/test_parity_dataframe.py |  9 +++--
 python/pyspark/sql/tests/connect/test_parity_functions.py | 14 ++
 2 files changed, 9 insertions(+), 14 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (b7dbfa2c376 -> 61c4057c3fe)

2023-01-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b7dbfa2c376 [SPARK-41921][CONNECT][TESTS] Enable doctests in 
connect.column and connect.functions
 add 61c4057c3fe [SPARK-41905][CONNECT] Support name as strings in slice

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/functions.py   | 10 +-
 python/pyspark/sql/tests/connect/test_parity_functions.py |  5 -
 2 files changed, 5 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (ec60670818e -> b7dbfa2c376)

2023-01-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ec60670818e [SPARK-41906][CONNECT][TESTS] Reenable rand test in Spark 
Connect
 add b7dbfa2c376 [SPARK-41921][CONNECT][TESTS] Enable doctests in 
connect.column and connect.functions

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/column.py| 4 
 python/pyspark/sql/connect/functions.py | 3 ---
 2 files changed, 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (b5d162bb59c -> ec60670818e)

2023-01-05 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b5d162bb59c [SPARK-41869][CONNECT] Reject single string in 
dropDuplicates
 add ec60670818e [SPARK-41906][CONNECT][TESTS] Reenable rand test in Spark 
Connect

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/connect/test_parity_functions.py | 5 -
 python/pyspark/sql/tests/test_functions.py| 3 +--
 2 files changed, 1 insertion(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (3a3bc77f3de -> b5d162bb59c)

2023-01-05 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3a3bc77f3de Revert "[SPARK-41677][CORE][SQL][SS][UI] Add Protobuf 
serializer for StreamingQueryProgressWrapper"
 add b5d162bb59c [SPARK-41869][CONNECT] Reject single string in 
dropDuplicates

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/dataframe.py   | 3 +++
 python/pyspark/sql/tests/connect/test_parity_dataframe.py | 5 -
 2 files changed, 3 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: Revert "[SPARK-41677][CORE][SQL][SS][UI] Add Protobuf serializer for StreamingQueryProgressWrapper"

2023-01-05 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3a3bc77f3de Revert "[SPARK-41677][CORE][SQL][SS][UI] Add Protobuf 
serializer for StreamingQueryProgressWrapper"
3a3bc77f3de is described below

commit 3a3bc77f3dea368ca0b434a3f8a9629b5d69a5ca
Author: Gengliang Wang 
AuthorDate: Thu Jan 5 20:28:55 2023 -0800

Revert "[SPARK-41677][CORE][SQL][SS][UI] Add Protobuf serializer for 
StreamingQueryProgressWrapper"

### What changes were proposed in this pull request?

This reverts commit 915e9c67a9581a1f66e70321879092d854c9fb3b.

### Why are the changes needed?

When running end-to-end tests, there are 5 NPE errors from string fields:

- SourceProgress.latestOffset
- SourceProgress.endOffset
- SourceProgress.startOffset
- StreamingQueryData.name
- StreamingQueryProgress.name

After fixing them, there is following error:
```
java.lang.UnsupportedOperationException
at 
java.base/java.util.Collections$UnmodifiableMap.remove(Collections.java:1460)
at 
org.apache.spark.sql.streaming.ui.StreamingQueryStatisticsPage.$anonfun$generateStatTable$27(StreamingQueryStatisticsPage.scala:401)
```
The deserialized map `StreamingQueryProgress.durationMs` needs to be 
mutable.

Give the StreamingQueryProgressWrapper contains nullable fields and mutable 
map, I suggest using the default JSON serailizer for this class.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA tests

Closes #39416 from gengliangwang/revertSS.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
---
 .../apache/spark/status/protobuf/store_types.proto |  51 ---
 .../org.apache.spark.status.protobuf.ProtobufSerDe |   1 -
 .../org/apache/spark/sql/streaming/progress.scala  |   8 +-
 .../ui/StreamingQueryStatusListener.scala  |   2 +-
 .../protobuf/sql/SinkProgressSerializer.scala  |  42 -
 .../protobuf/sql/SourceProgressSerializer.scala|  65 
 .../sql/StateOperatorProgressSerializer.scala  |  75 -
 .../sql/StreamingQueryProgressSerializer.scala |  89 ---
 .../StreamingQueryProgressWrapperSerializer.scala  |  40 -
 .../sql/KVStoreProtobufSerializerSuite.scala   | 170 +
 10 files changed, 6 insertions(+), 537 deletions(-)

diff --git 
a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto 
b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
index 1c3e5bfc49a..499fda34174 100644
--- a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
+++ b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
@@ -686,54 +686,3 @@ message ExecutorPeakMetricsDistributions {
   repeated double quantiles = 1;
   repeated ExecutorMetrics executor_metrics = 2;
 }
-
-message StateOperatorProgress {
-  string operator_name = 1;
-  int64 num_rows_total = 2;
-  int64 num_rows_updated = 3;
-  int64 all_updates_time_ms = 4;
-  int64 num_rows_removed = 5;
-  int64 all_removals_time_ms = 6;
-  int64 commit_time_ms = 7;
-  int64 memory_used_bytes = 8;
-  int64 num_rows_dropped_by_watermark = 9;
-  int64 num_shuffle_partitions = 10;
-  int64 num_state_store_instances = 11;
-  map custom_metrics = 12;
-}
-
-message SourceProgress {
-  string description = 1;
-  string start_offset = 2;
-  string end_offset = 3;
-  string latest_offset = 4;
-  int64 num_input_rows = 5;
-  double input_rows_per_second = 6;
-  double processed_rows_per_second = 7;
-  map metrics = 8;
-}
-
-message SinkProgress {
-  string description = 1;
-  int64 num_output_rows = 2;
-  map metrics = 3;
-}
-
-message StreamingQueryProgress {
-  string id = 1;
-  string run_id = 2;
-  string name = 3;
-  string timestamp = 4;
-  int64 batch_id = 5;
-  int64 batch_duration = 6;
-  map duration_ms = 7;
-  map event_time = 8;
-  repeated StateOperatorProgress state_operators = 9;
-  repeated SourceProgress sources = 10;
-  SinkProgress sink = 11;
-  map observed_metrics = 12;
-}
-
-message StreamingQueryProgressWrapper {
-  StreamingQueryProgress progress = 1;
-}
diff --git 
a/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
 
b/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
index e907d559349..7beff87d7ec 100644
--- 
a/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
+++ 
b/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
@@ -18,4 +18,3 @@
 org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer
 org.apache.spark.status.protobuf.sql.SparkPlanGraphWrapperSerializer
 

[spark] branch master updated: [SPARK-41708][SQL] Pull v1write information to `WriteFiles`

2023-01-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 27e20fe9eb1 [SPARK-41708][SQL] Pull v1write information to `WriteFiles`
27e20fe9eb1 is described below

commit 27e20fe9eb1b1ef1b3d32e180de55931f31fc345
Author: ulysses-you 
AuthorDate: Fri Jan 6 12:13:30 2023 +0800

[SPARK-41708][SQL] Pull v1write information to `WriteFiles`

### What changes were proposed in this pull request?

This pr aims to pull out the v1write information from `V1WriteCommand` to 
`WriteFiles`:
```scala
case class WriteFiles(child: LogicalPlan)

=>

case class WriteFiles(
child: LogicalPlan,
fileFormat: FileFormat,
partitionColumns: Seq[Attribute],
bucketSpec: Option[BucketSpec],
options: Map[String, String],
staticPartitions: TablePartitionSpec)
```

Also, this pr do a cleanup for `WriteSpec` which is unnecessary.

### Why are the changes needed?

After this pr, `WriteFiles` will hold write information that can help 
developers

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Pass CI

Closes #39277 from ulysses-you/SPARK-41708.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/internal/WriteSpec.java   |  33 ---
 .../org/apache/spark/sql/execution/SparkPlan.scala |   9 +-
 .../spark/sql/execution/SparkStrategies.scala  |   5 +-
 .../spark/sql/execution/datasources/V1Writes.scala |  24 ++-
 .../sql/execution/datasources/WriteFiles.scala |  26 ++-
 .../org/apache/spark/sql/hive/HiveStrategies.scala |   3 +-
 .../{SaveAsHiveFile.scala => HiveTempPath.scala}   | 204 ++-
 .../hive/execution/InsertIntoHiveDirCommand.scala  |  13 +-
 .../sql/hive/execution/InsertIntoHiveTable.scala   |  88 +---
 .../spark/sql/hive/execution/SaveAsHiveFile.scala  | 221 +
 .../sql/hive/execution/V1WritesHiveUtils.scala |  33 ++-
 .../org/apache/spark/sql/hive/InsertSuite.scala|  15 +-
 12 files changed, 224 insertions(+), 450 deletions(-)

diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/internal/WriteSpec.java 
b/sql/core/src/main/java/org/apache/spark/sql/internal/WriteSpec.java
deleted file mode 100644
index c51a3ed7dc6..000
--- a/sql/core/src/main/java/org/apache/spark/sql/internal/WriteSpec.java
+++ /dev/null
@@ -1,33 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.internal;
-
-import java.io.Serializable;
-
-/**
- * Write spec is a input parameter of
- * {@link org.apache.spark.sql.execution.SparkPlan#executeWrite}.
- *
- * 
- * This is an empty interface, the concrete class which implements
- * {@link org.apache.spark.sql.execution.SparkPlan#doExecuteWrite}
- * should define its own class and use it.
- *
- * @since 3.4.0
- */
-public interface WriteSpec extends Serializable {}
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
index 401302e5bde..5ca36a8a216 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
@@ -36,8 +36,9 @@ import org.apache.spark.sql.catalyst.plans.physical._
 import org.apache.spark.sql.catalyst.trees.{BinaryLike, LeafLike, TreeNodeTag, 
UnaryLike}
 import org.apache.spark.sql.connector.write.WriterCommitMessage
 import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.execution.datasources.WriteFilesSpec
 import org.apache.spark.sql.execution.metric.SQLMetric
-import org.apache.spark.sql.internal.{SQLConf, WriteSpec}
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.vectorized.ColumnarBatch
 import org.apache.spark.util.NextIterator
 import org.apache.spark.util.io.{ChunkedByteBuffer, 
ChunkedByteBufferOutputStream}
@@ -230,11 +231,11 @@ abstract class 

[spark] branch branch-3.2 updated: [SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations

2023-01-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 7eca60d4f30 [SPARK-41162][SQL][3.3] Fix anti- and semi-join for 
self-join with aggregations
7eca60d4f30 is described below

commit 7eca60d4f304d4a1a66add9fd04166d8eed1dd4f
Author: Enrico Minack 
AuthorDate: Fri Jan 6 11:32:45 2023 +0800

[SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with 
aggregations

### What changes were proposed in this pull request?
Backport #39131 to branch-3.3.

Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an 
`Aggregate` when the join condition references an attribute that exists in its 
right plan and its left plan's child. This usually happens when the anti-join / 
semi-join is a self-join while `DeduplicateRelations` cannot deduplicate those 
attributes (in this example due to the projection of `value` to `id`).

This behaviour already exists for `Project` and `Union`, but `Aggregate` 
lacks this safety guard.

### Why are the changes needed?
Without this change, the optimizer creates an incorrect plan.

This example fails with `distinct()` (an aggregation), and succeeds without 
`distinct()`, but both queries are identical:
```scala
val ids = Seq(1, 2, 3).toDF("id").distinct()
val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), 
"left_anti").collect()
assert(result.length == 1)
```
With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition 
`(value#907 + 1) = value#907`, which can never be true. This effectively 
removes the anti-join.

**Before this PR:**
The anti-join is fully removed from the plan.
```
== Physical Plan ==
AdaptiveSparkPlan (16)
+- == Final Plan ==
   LocalTableScan (1)

(16) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```

This is caused by `PushDownLeftSemiAntiJoin` adding join condition 
`(value#907 + 1) = value#907`, which is wrong as because `id#910` in `(id#910 + 
1) AS id#912` exists in the right child of the join as well as in the left 
grandchild:
```
=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
!Join LeftAnti, (id#912 = id#910)  Aggregate [id#910], 
[(id#910 + 1) AS id#912]
!:- Aggregate [id#910], [(id#910 + 1) AS id#912]   +- Project [value#907 AS 
id#910]
!:  +- Project [value#907 AS id#910]  +- Join LeftAnti, 
((value#907 + 1) = value#907)
!: +- LocalRelation [value#907]  :- LocalRelation 
[value#907]
!+- Aggregate [id#910], [id#910] +- Aggregate 
[id#910], [id#910]
!   +- Project [value#914 AS id#910]+- Project 
[value#914 AS id#910]
!  +- LocalRelation [value#914]+- 
LocalRelation [value#914]
```

The right child of the join and in the left grandchild would become the 
children of the pushed-down join, which creates an invalid join condition.

**After this PR:**
Join condition `(id#910 + 1) AS id#912` is understood to become ambiguous 
as both sides of the prospect join contain `id#910`. Hence, the join is not 
pushed down. The rule is then not applied any more.

The final plan contains the anti-join:
```
== Physical Plan ==
AdaptiveSparkPlan (24)
+- == Final Plan ==
   * BroadcastHashJoin LeftSemi BuildRight (14)
   :- * HashAggregate (7)
   :  +- AQEShuffleRead (6)
   : +- ShuffleQueryStage (5), Statistics(sizeInBytes=48.0 B, 
rowCount=3)
   :+- Exchange (4)
   :   +- * HashAggregate (3)
   :  +- * Project (2)
   : +- * LocalTableScan (1)
   +- BroadcastQueryStage (13), Statistics(sizeInBytes=1024.0 KiB, 
rowCount=3)
  +- BroadcastExchange (12)
 +- * HashAggregate (11)
+- AQEShuffleRead (10)
   +- ShuffleQueryStage (9), Statistics(sizeInBytes=48.0 B, 
rowCount=3)
  +- ReusedExchange (8)

(8) ReusedExchange [Reuses operator id: 4]
Output [1]: [id#898]

(24) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```

### Does this PR introduce _any_ user-facing change?
It fixes correctness.

### How was this patch tested?
Unit tests in `DataFrameJoinSuite` and `LeftSemiAntiJoinPushDownSuite`.

Closes #39409 from EnricoMi/branch-antijoin-selfjoin-fix-3.3.

Authored-by: Enrico Minack 
Signed-off-by: Wenchen Fan 
(cherry picked from commit b97f79da04acc9bde1cb4def7dc33c22cfc11372)
Signed-off-by: Wenchen Fan 
---
 .../optimizer/PushDownLeftSemiAntiJoin.scala   

[spark] branch branch-3.3 updated: [SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations

2023-01-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new b97f79da04a [SPARK-41162][SQL][3.3] Fix anti- and semi-join for 
self-join with aggregations
b97f79da04a is described below

commit b97f79da04acc9bde1cb4def7dc33c22cfc11372
Author: Enrico Minack 
AuthorDate: Fri Jan 6 11:32:45 2023 +0800

[SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with 
aggregations

### What changes were proposed in this pull request?
Backport #39131 to branch-3.3.

Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an 
`Aggregate` when the join condition references an attribute that exists in its 
right plan and its left plan's child. This usually happens when the anti-join / 
semi-join is a self-join while `DeduplicateRelations` cannot deduplicate those 
attributes (in this example due to the projection of `value` to `id`).

This behaviour already exists for `Project` and `Union`, but `Aggregate` 
lacks this safety guard.

### Why are the changes needed?
Without this change, the optimizer creates an incorrect plan.

This example fails with `distinct()` (an aggregation), and succeeds without 
`distinct()`, but both queries are identical:
```scala
val ids = Seq(1, 2, 3).toDF("id").distinct()
val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), 
"left_anti").collect()
assert(result.length == 1)
```
With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition 
`(value#907 + 1) = value#907`, which can never be true. This effectively 
removes the anti-join.

**Before this PR:**
The anti-join is fully removed from the plan.
```
== Physical Plan ==
AdaptiveSparkPlan (16)
+- == Final Plan ==
   LocalTableScan (1)

(16) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```

This is caused by `PushDownLeftSemiAntiJoin` adding join condition 
`(value#907 + 1) = value#907`, which is wrong as because `id#910` in `(id#910 + 
1) AS id#912` exists in the right child of the join as well as in the left 
grandchild:
```
=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
!Join LeftAnti, (id#912 = id#910)  Aggregate [id#910], 
[(id#910 + 1) AS id#912]
!:- Aggregate [id#910], [(id#910 + 1) AS id#912]   +- Project [value#907 AS 
id#910]
!:  +- Project [value#907 AS id#910]  +- Join LeftAnti, 
((value#907 + 1) = value#907)
!: +- LocalRelation [value#907]  :- LocalRelation 
[value#907]
!+- Aggregate [id#910], [id#910] +- Aggregate 
[id#910], [id#910]
!   +- Project [value#914 AS id#910]+- Project 
[value#914 AS id#910]
!  +- LocalRelation [value#914]+- 
LocalRelation [value#914]
```

The right child of the join and in the left grandchild would become the 
children of the pushed-down join, which creates an invalid join condition.

**After this PR:**
Join condition `(id#910 + 1) AS id#912` is understood to become ambiguous 
as both sides of the prospect join contain `id#910`. Hence, the join is not 
pushed down. The rule is then not applied any more.

The final plan contains the anti-join:
```
== Physical Plan ==
AdaptiveSparkPlan (24)
+- == Final Plan ==
   * BroadcastHashJoin LeftSemi BuildRight (14)
   :- * HashAggregate (7)
   :  +- AQEShuffleRead (6)
   : +- ShuffleQueryStage (5), Statistics(sizeInBytes=48.0 B, 
rowCount=3)
   :+- Exchange (4)
   :   +- * HashAggregate (3)
   :  +- * Project (2)
   : +- * LocalTableScan (1)
   +- BroadcastQueryStage (13), Statistics(sizeInBytes=1024.0 KiB, 
rowCount=3)
  +- BroadcastExchange (12)
 +- * HashAggregate (11)
+- AQEShuffleRead (10)
   +- ShuffleQueryStage (9), Statistics(sizeInBytes=48.0 B, 
rowCount=3)
  +- ReusedExchange (8)

(8) ReusedExchange [Reuses operator id: 4]
Output [1]: [id#898]

(24) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```

### Does this PR introduce _any_ user-facing change?
It fixes correctness.

### How was this patch tested?
Unit tests in `DataFrameJoinSuite` and `LeftSemiAntiJoinPushDownSuite`.

Closes #39409 from EnricoMi/branch-antijoin-selfjoin-fix-3.3.

Authored-by: Enrico Minack 
Signed-off-by: Wenchen Fan 
---
 .../optimizer/PushDownLeftSemiAntiJoin.scala   | 13 ++---
 .../optimizer/LeftSemiAntiJoinPushDownSuite.scala  | 57 ++
 

[spark] branch master updated: [SPARK-41912][SQL] Subquery should not validate CTE

2023-01-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 89666d44a39 [SPARK-41912][SQL] Subquery should not validate CTE
89666d44a39 is described below

commit 89666d44a39c48df841a0102ff6f54eaeb4c6140
Author: Rui Wang 
AuthorDate: Fri Jan 6 11:30:48 2023 +0800

[SPARK-41912][SQL] Subquery should not validate CTE

### What changes were proposed in this pull request?

The commit https://github.com/apache/spark/pull/38029 actually intended to 
do the right thing: it checks CTE more aggressively even if a CTE is not used, 
which is ok. However, it triggers an existing issue where a subquery checks 
itself but in the CTE case if the subquery contains a CTE which is defined 
outside of the subquery, the check will fail as CTE not found (e.g. key not 
found).

So it is:

the commit checks more thus in the repro examples, every CTE is checked now 
(in the past only used CTE is checked).

One of the CTE that is checked after the commit in the example contains 
subquery.

The subquery contains another CTE which is defined outside of the subquery.

The subquery checks itself thus fail due to CTE not found.

This PR fixes the issue by removing the subquery self-validation on CTE 
case.

### Why are the changes needed?

This fixed a regression that
```
val df = sql("""
   |WITH
   |cte1 as (SELECT 1 col1),
   |cte2 as (SELECT (SELECT MAX(col1) FROM cte1))
   |SELECT * FROM cte1
   |""".stripMargin
)
checkAnswer(df, Row(1) :: Nil)
```

cannot pass analyzer anymore.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #39414 from amaliujia/fix_subquery_validate.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/catalyst/analysis/CheckAnalysis.scala|  2 +-
 .../src/test/scala/org/apache/spark/sql/SubquerySuite.scala   | 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 8309186d566..4dc0bf98a54 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -923,7 +923,7 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 }
 
 // Validate the subquery plan.
-checkAnalysis(expr.plan)
+checkAnalysis0(expr.plan)
 
 // Check if there is outer attribute that cannot be found from the plan.
 checkOuterReference(plan, expr)
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
index 3d4a629f7a9..86a0c4d1799 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
@@ -1019,6 +1019,17 @@ class SubquerySuite extends QueryTest
 }
   }
 
+  test("SPARK-41912: Subquery does not validate CTE") {
+val df = sql("""
+   |WITH
+   |cte1 as (SELECT 1 col1),
+   |cte2 as (SELECT (SELECT MAX(col1) FROM cte1))
+   |SELECT * FROM cte1
+   |""".stripMargin
+)
+checkAnswer(df, Row(1) :: Nil)
+  }
+
   test("SPARK-21835: Join in correlated subquery should be duplicateResolved: 
case 1") {
 withTable("t1") {
   withTempPath { path =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (729c4bff167 -> 80ee385cc38)

2023-01-05 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 729c4bff167 [SPARK-41861][SQL] Make v2 ScanBuilders' build() return 
typed scan
 add 80ee385cc38 [SPARK-41831][CONNECT][PYTHON] Make `DataFrame.select` 
accept column list

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/dataframe.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (eba31a8de3f -> 729c4bff167)

2023-01-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from eba31a8de3f [SPARK-41806][SQL] Use AppendData.byName for SQL INSERT 
INTO by name for DSV2
 add 729c4bff167 [SPARK-41861][SQL] Make v2 ScanBuilders' build() return 
typed scan

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala | 3 +--
 .../spark/sql/execution/datasources/v2/csv/CSVScanBuilder.scala   | 3 +--
 .../spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala | 4 ++--
 .../spark/sql/execution/datasources/v2/json/JsonScanBuilder.scala | 3 +--
 .../spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala   | 4 ++--
 .../sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala | 4 ++--
 .../spark/sql/execution/datasources/v2/text/TextScanBuilder.scala | 3 +--
 7 files changed, 10 insertions(+), 14 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41806][SQL] Use AppendData.byName for SQL INSERT INTO by name for DSV2

2023-01-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new eba31a8de3f [SPARK-41806][SQL] Use AppendData.byName for SQL INSERT 
INTO by name for DSV2
eba31a8de3f is described below

commit eba31a8de3fb79f96255a0feb58db19842c9d16d
Author: Allison Portis 
AuthorDate: Fri Jan 6 10:42:16 2023 +0800

[SPARK-41806][SQL] Use AppendData.byName for SQL INSERT INTO by name for 
DSV2

### What changes were proposed in this pull request?

Use DSv2 AppendData.byName for INSERT INTO by name instead of reordering 
and converting to AppendData.byOrdinal

### Why are the changes needed?

Currently for INSERT INTO by name we reorder the value list and convert it 
to INSERT INTO by ordinal. Since DSv2 logical nodes have the `isByName` flag we 
don't need to do this. The current approach is limiting in that

- Users must provide the full list of table columns (this limits the 
functionality for features like generated columns see 
[SPARK-41290](https://issues.apache.org/jira/browse/SPARK-41290))
- It allows ambiguous queries such as `INSERT OVERWRITE t PARTITION (c='1') 
(c) VALUES ('2')` where the user provides both the static partition column 'c' 
and the column 'c' in the column list. We should check that the static 
partition column is not in the column list. See the added test for more 
detailed example.

### Does this PR introduce _any_ user-facing change?

For versions 3.3 and below:
```sql
CREATE TABLE t(i STRING, c string) USING PARQUET PARTITIONED BY (c);
INSERT OVERWRITE t PARTITION (c='1') (c) VALUES ('2')
SELECT * FROM t
```
```
+---+---+
|  i|  c|
+---+---+
|  2|  1|
+---+---+
```
For versions 3.4 and above:
```sql
CREATE TABLE t(i STRING, c string) USING PARQUET PARTITIONED BY (c);
INSERT OVERWRITE t PARTITION (c='1') (c) VALUES ('2')
```
```
AnalysisException: [STATIC_PARTITION_COLUMN_IN_COLUMN_LIST] Static 
partition column c is also specified in the column list.
```

### How was this patch tested?

Unit tests are added.

Closes #39334 from allisonport-db/insert-into-by-name.

Authored-by: Allison Portis 
Signed-off-by: Wenchen Fan 
---
 core/src/main/resources/error/error-classes.json   |  5 ++
 .../spark/sql/catalyst/analysis/Analyzer.scala | 99 +++---
 .../spark/sql/errors/QueryCompilationErrors.scala  |  6 ++
 .../org/apache/spark/sql/SQLInsertTestSuite.scala  | 16 +++-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 96 +
 .../execution/command/PlanResolutionSuite.scala| 30 ++-
 6 files changed, 239 insertions(+), 13 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 29cafdcc1b6..1d1952dce1b 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -1145,6 +1145,11 @@
   "Star (*) is not allowed in a select list when GROUP BY an ordinal 
position is used."
 ]
   },
+  "STATIC_PARTITION_COLUMN_IN_INSERT_COLUMN_LIST" : {
+"message" : [
+  "Static partition column  is also specified in the column 
list."
+]
+  },
   "STREAM_FAILED" : {
 "message" : [
   "Query [id = , runId = ] terminated with exception: "
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 1ebbfb9a39a..8fff0d41add 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1291,28 +1291,92 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 }
   }
 
+  /** Handle INSERT INTO for DSv2 */
   object ResolveInsertInto extends Rule[LogicalPlan] {
+
+/** Add a project to use the table column names for INSERT INTO BY NAME */
+private def createProjectForByNameQuery(i: InsertIntoStatement): 
LogicalPlan = {
+  SchemaUtils.checkColumnNameDuplication(i.userSpecifiedCols, resolver)
+
+  if (i.userSpecifiedCols.size != i.query.output.size) {
+throw QueryCompilationErrors.writeTableWithMismatchedColumnsError(
+  i.userSpecifiedCols.size, i.query.output.size, i.query)
+  }
+  val projectByName = i.userSpecifiedCols.zip(i.query.output)
+.map { case (userSpecifiedCol, queryOutputCol) =>
+  val resolvedCol = i.table.resolve(Seq(userSpecifiedCol), resolver)
+.getOrElse(
+  throw QueryCompilationErrors.unresolvedAttributeError(
+"UNRESOLVED_COLUMN", userSpecifiedCol, 
i.table.output.map(_.name), 

[spark] branch master updated (ca8bd4ec49f -> b5347737767)

2023-01-05 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ca8bd4ec49f [SPARK-41827][CONNECT][PYTHON] Make `GroupBy` accept 
column list
 add b5347737767 [SPARK-41567][BUILD] Move configuration of 
`versions-maven-plugin` to parent pom

No new revisions were added by this update.

Summary of changes:
 dev/test-dependencies.sh | 5 ++---
 pom.xml  | 6 ++
 2 files changed, 8 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (334bc188387 -> ca8bd4ec49f)

2023-01-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 334bc188387 [SPARK-41849][CONNECT] Implement DataFrameReader.text
 add ca8bd4ec49f [SPARK-41827][CONNECT][PYTHON] Make `GroupBy` accept 
column list

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/dataframe.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (542644075b3 -> 334bc188387)

2023-01-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 542644075b3 [SPARK-41892][CONNECT][TESTS] 
pyspark.sql.tests.test_functions - Add JIRAs or messages for skipped messages
 add 334bc188387 [SPARK-41849][CONNECT] Implement DataFrameReader.text

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/functions.py|  3 ---
 python/pyspark/sql/connect/readwriter.py   | 24 ++
 python/pyspark/sql/readwriter.py   |  3 +++
 .../sql/tests/connect/test_connect_basic.py| 10 +
 4 files changed, 37 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41892][CONNECT][TESTS] pyspark.sql.tests.test_functions - Add JIRAs or messages for skipped messages

2023-01-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 542644075b3 [SPARK-41892][CONNECT][TESTS] 
pyspark.sql.tests.test_functions - Add JIRAs or messages for skipped messages
542644075b3 is described below

commit 542644075b3fe92ca2b6675111237fd2fc177ba1
Author: Sandeep Singh 
AuthorDate: Fri Jan 6 09:42:44 2023 +0900

[SPARK-41892][CONNECT][TESTS] pyspark.sql.tests.test_functions - Add JIRAs 
or messages for skipped messages

### What changes were proposed in this pull request?
This PR enables the reused PySpark tests in Spark Connect that pass now. 
And add JIRAs/ Messages to the skipped ones

### Why are the changes needed?
To make sure on the test coverage.

### Does this PR introduce any user-facing change?
No, test-only.

### How was this patch tested?
Enabling tests

Closes #39412 from techaddict/SPARK-41892.

Authored-by: Sandeep Singh 
Signed-off-by: Hyukjin Kwon 
---
 .../sql/tests/connect/test_parity_functions.py | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/python/pyspark/sql/tests/connect/test_parity_functions.py 
b/python/pyspark/sql/tests/connect/test_parity_functions.py
index 3c616b5c864..a85acf7f6ec 100644
--- a/python/pyspark/sql/tests/connect/test_parity_functions.py
+++ b/python/pyspark/sql/tests/connect/test_parity_functions.py
@@ -44,106 +44,132 @@ class FunctionsParityTests(ReusedSQLTestCase, 
FunctionsTestsMixin):
 cls.spark = cls._spark.stop()
 del os.environ["SPARK_REMOTE"]
 
+# TODO(SPARK-41897): Parity in Error types between pyspark and connect 
functions
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_assert_true(self):
 super().test_assert_true()
 
+# Spark Connect does not support Spark Context but the test depends on 
that.
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_basic_functions(self):
 super().test_basic_functions()
 
+# TODO(SPARK-41899): DataFrame.createDataFrame converting int to bigint
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_date_add_function(self):
 super().test_date_add_function()
 
+# TODO(SPARK-41899): DataFrame.createDataFrame converting int to bigint
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_date_sub_function(self):
 super().test_date_sub_function()
 
+# TODO(SPARK-41847): DataFrame mapfield,structlist invalid type
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_explode(self):
 super().test_explode()
 
+# Spark Connect does not support Spark Context but the test depends on 
that.
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_function_parity(self):
 super().test_function_parity()
 
+# Spark Connect does not support Spark Context, _jdf but the test depends 
on that.
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_functions_broadcast(self):
 super().test_functions_broadcast()
 
+# Spark Connect does not support Spark Context but the test depends on 
that.
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_input_file_name_reset_for_rdd(self):
 super().test_input_file_name_reset_for_rdd()
 
+# TODO(SPARK-41849): Implement DataFrameReader.text
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_input_file_name_udf(self):
 super().test_input_file_name_udf()
 
+# TODO(SPARK-41901): Parity in String representation of Column
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_inverse_trig_functions(self):
 super().test_inverse_trig_functions()
 
+# TODO(SPARK-41834): Implement SparkSession.conf
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_lit_list(self):
 super().test_lit_list()
 
+# TODO(SPARK-41900): support Data Type int8
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_lit_np_scalar(self):
 super().test_lit_np_scalar()
 
+# TODO(SPARK-41902): Fix String representation of maps created by 
`map_from_arrays`
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_map_functions(self):
 super().test_map_functions()
 
+# TODO(SPARK-41903): Support data type ndarray
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_ndarray_input(self):
 super().test_ndarray_input()
 
+# TODO(SPARK-41902): Parity in String representation of 
higher_order_function's output
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_nested_higher_order_function(self):
 

[spark] branch master updated: [SPARK-41893][BUILD] Publish SBOM artifacts

2023-01-05 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f123179c0fe [SPARK-41893][BUILD] Publish SBOM artifacts
f123179c0fe is described below

commit f123179c0fe5517ebe3ed3f9668c3970fb491064
Author: Dongjoon Hyun 
AuthorDate: Thu Jan 5 16:22:48 2023 -0800

[SPARK-41893][BUILD] Publish SBOM artifacts

### What changes were proposed in this pull request?

This PR aims to publish `SBOM` artifacts.

### Why are the changes needed?

Here is an article to give some context.
- 
https://www.activestate.com/blog/why-the-us-government-is-mandating-software-bill-of-materials-sbom/

Software Bill of Materials (SBOM) are additional artifacts containing the 
aggregate of all direct and transitive dependencies of a project. The US 
Government (based on NIST recommendations) currently accepts only the three 
most popular SBOM standards as valid, namely: 
[CycloneDX](https://cyclonedx.org/), [Software Identification (SWID) 
tag](https://csrc.nist.gov/projects/Software-Identification-SWID), [Software 
Package Data Exchange® (SPDX)](https://spdx.dev/).

This PR uses [CycloneDX maven 
plugin](https://github.com/CycloneDX/cyclonedx-maven-plugin), a lightweight 
software bill of materials (SBOM) standard designed for use in application 
security contexts and supply chain component analysis.

For example, `spark-tags_2.12-3.4.0-SNAPSHOT-cyclonedx.xml` and 
`spark-tags_2.12-3.4.0-SNAPSHOT-cyclonedx.json` files are attached to 
`spark-tags_2.12-3.4.0-SNAPSHOT.jar`.
```
$ ls -al ~/.m2/repository/org/apache/spark/spark-tags_2.12/3.4.0-SNAPSHOT
total 2488
drwxr-xr-x  12 dongjoon  staff  384 Jan  4 23:36 .
drwxr-xr-x   4 dongjoon  staff  128 Jan  4 23:36 ..
-rw-r--r--   1 dongjoon  staff  492 Jan  4 23:36 _remote.repositories
-rw-r--r--   1 dongjoon  staff 1955 Jan  4 23:36 
maven-metadata-local.xml
-rw-r--r--   1 dongjoon  staff16310 Jan  4 23:36 
spark-tags_2.12-3.4.0-SNAPSHOT-cyclonedx.json
-rw-r--r--   1 dongjoon  staff14045 Jan  4 23:36 
spark-tags_2.12-3.4.0-SNAPSHOT-cyclonedx.xml
-rw-r--r--   1 dongjoon  staff  1162027 Jan  4 23:36 
spark-tags_2.12-3.4.0-SNAPSHOT-javadoc.jar
-rw-r--r--   1 dongjoon  staff16272 Jan  4 23:36 
spark-tags_2.12-3.4.0-SNAPSHOT-sources.jar
-rw-r--r--   1 dongjoon  staff12453 Jan  4 23:36 
spark-tags_2.12-3.4.0-SNAPSHOT-test-sources.jar
-rw-r--r--   1 dongjoon  staff10387 Jan  4 23:36 
spark-tags_2.12-3.4.0-SNAPSHOT-tests.jar
-rw-r--r--   1 dongjoon  staff15181 Jan  4 23:36 
spark-tags_2.12-3.4.0-SNAPSHOT.jar
-rw-r--r--   1 dongjoon  staff 5822 Jan  4 23:36 
spark-tags_2.12-3.4.0-SNAPSHOT.pom
```

### Does this PR introduce _any_ user-facing change?

Yes, but dev-only changes.

### How was this patch tested?

Manually test.
```
$ mvn install -DskipTests
...
[INFO] 

[INFO] Reactor Summary for Spark Project Parent POM 3.4.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [ 
10.501 s]
[INFO] Spark Project Tags . SUCCESS [ 
12.900 s]
[INFO] Spark Project Sketch ... SUCCESS [ 
24.315 s]
[INFO] Spark Project Local DB . SUCCESS [ 
25.406 s]
[INFO] Spark Project Networking ... SUCCESS [ 
36.217 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 
31.532 s]
[INFO] Spark Project Unsafe ... SUCCESS [ 
33.338 s]
[INFO] Spark Project Launcher . SUCCESS [ 
19.204 s]
[INFO] Spark Project Core . SUCCESS [05:24 
min]
[INFO] Spark Project ML Local Library . SUCCESS [01:20 
min]
[INFO] Spark Project GraphX ... SUCCESS [01:41 
min]
[INFO] Spark Project Streaming  SUCCESS [02:36 
min]
[INFO] Spark Project Catalyst . SUCCESS [06:44 
min]
[INFO] Spark Project SQL .. SUCCESS [07:10 
min]
[INFO] Spark Project ML Library ... SUCCESS [05:48 
min]
[INFO] Spark Project Tools  SUCCESS [ 
17.132 s]
[INFO] Spark Project Hive . SUCCESS [02:49 
min]
[INFO] Spark Project REPL . SUCCESS [ 
50.149 s]
[INFO] Spark Project Assembly . SUCCESS [  
6.706 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 
44.131 s]

[spark] branch master updated: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-05 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 545cf87dc72 [SPARK-41882][CORE][SQL][UI] Add tests for 
`SQLAppStatusStore` with RocksDB backend and fix some bugs
545cf87dc72 is described below

commit 545cf87dc723342fd0f7f1a222c1a94d4b4c91a0
Author: yangjie01 
AuthorDate: Thu Jan 5 10:12:41 2023 -0800

[SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB 
backend and fix some bugs

### What changes were proposed in this pull request?
This pr add the following new test suites for `SQLAppStatusStore` with 
RocksDB backend:

- `SQLAppStatusListenerWithRocksDBBackendSuite` base on 
`SQLAppStatusListenerSuite`
- `AllExecutionsPageWithRocksDBBackendSuite` base on 
`AllExecutionsPageSuite`

and fix bugs in `SQLExecutionUIDataSerializer` and 
`SparkPlanGraphWrapperSerializer` to make the new test pass.

adds protection to 
`SparkPlanGraphWrapperSerializer#serializeSparkPlanGraphNodeWrapper` to avoid 
throwing NPE.

### Why are the changes needed?
Add more test for `SQLAppStatusStore` with RocksDB backend and fix bugs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass GitHub Actions
- Add new tests

Closes #39385 from LuciferYang/SPARK-41432-FOLLOWUP.

Lead-authored-by: yangjie01 
Co-authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
---
 .../apache/spark/status/protobuf/store_types.proto |  3 +-
 .../sql/SQLExecutionUIDataSerializer.scala | 15 ++--
 .../sql/SparkPlanGraphWrapperSerializer.scala  |  5 ++-
 .../sql/execution/ui/AllExecutionsPageSuite.scala  | 41 +
 .../execution/ui/SQLAppStatusListenerSuite.scala   | 43 +-
 .../sql/KVStoreProtobufSerializerSuite.scala   |  4 +-
 6 files changed, 84 insertions(+), 27 deletions(-)

diff --git 
a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto 
b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
index 2a45b5da1d8..1c3e5bfc49a 100644
--- a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
+++ b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
@@ -395,7 +395,8 @@ message SQLExecutionUIData {
   optional string error_message = 9;
   map jobs = 10;
   repeated int64 stages = 11;
-  map metric_values = 12;
+  bool metric_values_is_null = 12;
+  map metric_values = 13;
 }
 
 message SparkPlanGraphNode {
diff --git 
a/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala
 
b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala
index 7a4a3e2a55d..09cef9663c0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala
@@ -49,7 +49,10 @@ class SQLExecutionUIDataSerializer extends ProtobufSerDe {
 }
 ui.stages.foreach(stageId => builder.addStages(stageId.toLong))
 val metricValues = ui.metricValues
-if (metricValues != null) {
+if (metricValues == null) {
+  builder.setMetricValuesIsNull(true)
+} else {
+  builder.setMetricValuesIsNull(false)
   metricValues.foreach {
 case (k, v) => builder.putMetricValues(k, v)
   }
@@ -67,9 +70,13 @@ class SQLExecutionUIDataSerializer extends ProtobufSerDe {
 val jobs = ui.getJobsMap.asScala.map {
   case (jobId, status) => jobId.toInt -> 
JobExecutionStatusSerializer.deserialize(status)
 }.toMap
-val metricValues = ui.getMetricValuesMap.asScala.map {
-  case (k, v) => k.toLong -> v
-}.toMap
+val metricValues = if (ui.getMetricValuesIsNull) {
+  null
+} else {
+  ui.getMetricValuesMap.asScala.map {
+case (k, v) => k.toLong -> v
+  }.toMap
+}
 
 new SQLExecutionUIData(
   executionId = ui.getExecutionId,
diff --git 
a/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer.scala
 
b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer.scala
index 49debedbb68..a8f715564fc 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer.scala
@@ -53,8 +53,9 @@ class SparkPlanGraphWrapperSerializer extends ProtobufSerDe {
 StoreTypes.SparkPlanGraphNodeWrapper = {
 
 val builder = StoreTypes.SparkPlanGraphNodeWrapper.newBuilder()
-builder.setNode(serializeSparkPlanGraphNode(input.node))
-

[spark] branch master updated: [SPARK-41883][BUILD] Upgrade dropwizard metrics 4.2.15

2023-01-05 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new eee9428ea76 [SPARK-41883][BUILD] Upgrade dropwizard metrics 4.2.15
eee9428ea76 is described below

commit eee9428ea76f8f5603117d8be58028b11d75ff24
Author: yangjie01 
AuthorDate: Thu Jan 5 07:32:00 2023 -0600

[SPARK-41883][BUILD] Upgrade dropwizard metrics 4.2.15

### What changes were proposed in this pull request?
This pr aims upgrade dropwizard metrics to 4.2.15.

### Why are the changes needed?
The release notes as follows:

- https://github.com/dropwizard/metrics/releases/tag/v4.2.14
- https://github.com/dropwizard/metrics/releases/tag/v4.2.15

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

Closes #39391 from LuciferYang/SPARK-41883.

Authored-by: yangjie01 
Signed-off-by: Sean Owen 
---
 dev/deps/spark-deps-hadoop-2-hive-2.3 | 10 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 10 +-
 pom.xml   |  2 +-
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 
b/dev/deps/spark-deps-hadoop-2-hive-2.3
index e1141fbc558..a1fd06003bb 100644
--- a/dev/deps/spark-deps-hadoop-2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2-hive-2.3
@@ -194,11 +194,11 @@ log4j-slf4j2-impl/2.19.0//log4j-slf4j2-impl-2.19.0.jar
 logging-interceptor/3.12.12//logging-interceptor-3.12.12.jar
 lz4-java/1.8.0//lz4-java-1.8.0.jar
 mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
-metrics-core/4.2.13//metrics-core-4.2.13.jar
-metrics-graphite/4.2.13//metrics-graphite-4.2.13.jar
-metrics-jmx/4.2.13//metrics-jmx-4.2.13.jar
-metrics-json/4.2.13//metrics-json-4.2.13.jar
-metrics-jvm/4.2.13//metrics-jvm-4.2.13.jar
+metrics-core/4.2.15//metrics-core-4.2.15.jar
+metrics-graphite/4.2.15//metrics-graphite-4.2.15.jar
+metrics-jmx/4.2.15//metrics-jmx-4.2.15.jar
+metrics-json/4.2.15//metrics-json-4.2.15.jar
+metrics-jvm/4.2.15//metrics-jvm-4.2.15.jar
 minlog/1.3.0//minlog-1.3.0.jar
 netty-all/4.1.86.Final//netty-all-4.1.86.Final.jar
 netty-buffer/4.1.86.Final//netty-buffer-4.1.86.Final.jar
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index d4157917e43..adf9ec9452b 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -178,11 +178,11 @@ log4j-slf4j2-impl/2.19.0//log4j-slf4j2-impl-2.19.0.jar
 logging-interceptor/3.12.12//logging-interceptor-3.12.12.jar
 lz4-java/1.8.0//lz4-java-1.8.0.jar
 mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
-metrics-core/4.2.13//metrics-core-4.2.13.jar
-metrics-graphite/4.2.13//metrics-graphite-4.2.13.jar
-metrics-jmx/4.2.13//metrics-jmx-4.2.13.jar
-metrics-json/4.2.13//metrics-json-4.2.13.jar
-metrics-jvm/4.2.13//metrics-jvm-4.2.13.jar
+metrics-core/4.2.15//metrics-core-4.2.15.jar
+metrics-graphite/4.2.15//metrics-graphite-4.2.15.jar
+metrics-jmx/4.2.15//metrics-jmx-4.2.15.jar
+metrics-json/4.2.15//metrics-json-4.2.15.jar
+metrics-jvm/4.2.15//metrics-jvm-4.2.15.jar
 minlog/1.3.0//minlog-1.3.0.jar
 netty-all/4.1.86.Final//netty-all-4.1.86.Final.jar
 netty-buffer/4.1.86.Final//netty-buffer-4.1.86.Final.jar
diff --git a/pom.xml b/pom.xml
index 20a334e8c4b..e2ae0631f80 100644
--- a/pom.xml
+++ b/pom.xml
@@ -152,7 +152,7 @@
 If you changes codahale.metrics.version, you also need to change
 the link to metrics.dropwizard.io in docs/monitoring.md.
 -->
-4.2.13
+4.2.15
 
 1.11.1
 1.12.0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (737eecded4d -> 3dc881afcfc)

2023-01-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 737eecded4d [SPARK-41162][SQL] Fix anti- and semi-join for self-join 
with aggregations
 add 3dc881afcfc [SPARK-41791] Add new file source metadata column types

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/namedExpressions.scala| 95 +++---
 .../org/apache/spark/sql/types/StructField.scala   |  4 +
 .../org/apache/spark/sql/types/StructType.scala|  3 +-
 .../spark/sql/execution/DataSourceScanExec.scala   | 34 
 .../sql/execution/datasources/FileFormat.scala |  9 --
 .../execution/datasources/FileSourceStrategy.scala | 68 +---
 .../datasources/PartitioningAwareFileIndex.scala   |  3 +-
 .../datasources/FileMetadataStructSuite.scala  | 10 +--
 8 files changed, 145 insertions(+), 81 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations

2023-01-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 737eecded4d [SPARK-41162][SQL] Fix anti- and semi-join for self-join 
with aggregations
737eecded4d is described below

commit 737eecded4dc2a828c978147a396f8808b09566f
Author: Enrico Minack 
AuthorDate: Thu Jan 5 18:55:11 2023 +0800

[SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations

### What changes were proposed in this pull request?
Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an 
`Aggregate` when the join condition references an attribute that exists in its 
right plan and its left plan's child. This usually happens when the anti-join / 
semi-join is a self-join while `DeduplicateRelations` cannot deduplicate those 
attributes (in this example due to the projection of `value` to `id`).

This behaviour already exists for `Project` and `Union`, but `Aggregate` 
lacks this safety guard.

### Why are the changes needed?
Without this change, the optimizer creates an incorrect plan.

This example fails with `distinct()` (an aggregation), and succeeds without 
`distinct()`, but both queries are identical:
```scala
val ids = Seq(1, 2, 3).toDF("id").distinct()
val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), 
"left_anti").collect()
assert(result.length == 1)
```
With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition 
`(value#907 + 1) = value#907`, which can never be true. This effectively 
removes the anti-join.

**Before this PR:**
The anti-join is fully removed from the plan.
```
== Physical Plan ==
AdaptiveSparkPlan (16)
+- == Final Plan ==
   LocalTableScan (1)

(16) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```

This is caused by `PushDownLeftSemiAntiJoin` adding join condition 
`(value#907 + 1) = value#907`, which is wrong as because `id#910` in `(id#910 + 
1) AS id#912` exists in the right child of the join as well as in the left 
grandchild:
```
=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
!Join LeftAnti, (id#912 = id#910)  Aggregate [id#910], 
[(id#910 + 1) AS id#912]
!:- Aggregate [id#910], [(id#910 + 1) AS id#912]   +- Project [value#907 AS 
id#910]
!:  +- Project [value#907 AS id#910]  +- Join LeftAnti, 
((value#907 + 1) = value#907)
!: +- LocalRelation [value#907]  :- LocalRelation 
[value#907]
!+- Aggregate [id#910], [id#910] +- Aggregate 
[id#910], [id#910]
!   +- Project [value#914 AS id#910]+- Project 
[value#914 AS id#910]
!  +- LocalRelation [value#914]+- 
LocalRelation [value#914]
```

The right child of the join and in the left grandchild would become the 
children of the pushed-down join, which creates an invalid join condition.

**After this PR:**
Join condition `(id#910 + 1) AS id#912` is understood to become ambiguous 
as both sides of the prospect join contain `id#910`. Hence, the join is not 
pushed down. The rule is then not applied any more.

The final plan contains the anti-join:
```
== Physical Plan ==
AdaptiveSparkPlan (24)
+- == Final Plan ==
   * BroadcastHashJoin LeftSemi BuildRight (14)
   :- * HashAggregate (7)
   :  +- AQEShuffleRead (6)
   : +- ShuffleQueryStage (5), Statistics(sizeInBytes=48.0 B, 
rowCount=3)
   :+- Exchange (4)
   :   +- * HashAggregate (3)
   :  +- * Project (2)
   : +- * LocalTableScan (1)
   +- BroadcastQueryStage (13), Statistics(sizeInBytes=1024.0 KiB, 
rowCount=3)
  +- BroadcastExchange (12)
 +- * HashAggregate (11)
+- AQEShuffleRead (10)
   +- ShuffleQueryStage (9), Statistics(sizeInBytes=48.0 B, 
rowCount=3)
  +- ReusedExchange (8)

(8) ReusedExchange [Reuses operator id: 4]
Output [1]: [id#898]

(24) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```

### Does this PR introduce _any_ user-facing change?
It fixes correctness.

### How was this patch tested?
Unit tests in `DataFrameJoinSuite` and `LeftSemiAntiJoinPushDownSuite`.

Closes #39131 from EnricoMi/branch-antijoin-selfjoin-fix.

Authored-by: Enrico Minack 
Signed-off-by: Wenchen Fan 
---
 .../optimizer/PushDownLeftSemiAntiJoin.scala   | 13 ++---
 .../optimizer/LeftSemiAntiJoinPushDownSuite.scala  | 57 ++
 .../org/apache/spark/sql/DataFrameJoinSuite.scala  | 18 +++
 3 files 

[spark] branch master updated: [SPARK-41842][CONNECT][PYTHON][TESTS] Enable doctests for time functions

2023-01-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 272634b4240 [SPARK-41842][CONNECT][PYTHON][TESTS] Enable doctests for 
time functions
272634b4240 is described below

commit 272634b42406aa60d3a6fb818427aa3078ec9c00
Author: Ruifeng Zheng 
AuthorDate: Thu Jan 5 19:20:35 2023 +0900

[SPARK-41842][CONNECT][PYTHON][TESTS] Enable doctests for time functions

### What changes were proposed in this pull request?
Enable doctests for time functions

### Why are the changes needed?
for test coverage

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
enable tests

Closes #39407 from zhengruifeng/connect_fix_41842.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/functions.py | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/python/pyspark/sql/connect/functions.py 
b/python/pyspark/sql/connect/functions.py
index f2603d477cb..d9665cd1a8e 100644
--- a/python/pyspark/sql/connect/functions.py
+++ b/python/pyspark/sql/connect/functions.py
@@ -2368,13 +2368,6 @@ def _test() -> None:
 # TODO(SPARK-41757): Fix String representation for Column class
 del pyspark.sql.connect.functions.col.__doc__
 
-# TODO(SPARK-41842): support data type: Timestamp(NANOSECOND, null)
-del pyspark.sql.connect.functions.hour.__doc__
-del pyspark.sql.connect.functions.minute.__doc__
-del pyspark.sql.connect.functions.second.__doc__
-del pyspark.sql.connect.functions.window.__doc__
-del pyspark.sql.connect.functions.window_time.__doc__
-
 # TODO(SPARK-41838): fix dataset.show
 del pyspark.sql.connect.functions.posexplode_outer.__doc__
 del pyspark.sql.connect.functions.explode_outer.__doc__


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41830][CONNECT][PYTHON] Make `DataFrame.sample` accept the same parameters as PySpark

2023-01-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 44d165113dd [SPARK-41830][CONNECT][PYTHON] Make `DataFrame.sample` 
accept the same parameters as PySpark
44d165113dd is described below

commit 44d165113ddce621f0090d89624bcff554ae49bb
Author: Ruifeng Zheng 
AuthorDate: Thu Jan 5 19:19:00 2023 +0900

[SPARK-41830][CONNECT][PYTHON] Make `DataFrame.sample` accept the same 
parameters as PySpark

### What changes were proposed in this pull request?
Make `DataFrame.sample` accept the same parameters as PySpark.

### Why are the changes needed?
For consistency

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
enabled doctests

Closes #39403 from zhengruifeng/connect_fix_41830.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/dataframe.py | 55 -
 1 file changed, 41 insertions(+), 14 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 13a421ca72a..639e3faa748 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -405,26 +405,56 @@ class DataFrame:
 
 def sample(
 self,
-fraction: float,
-*,
-withReplacement: bool = False,
+withReplacement: Optional[Union[float, bool]] = None,
+fraction: Optional[Union[int, float]] = None,
 seed: Optional[int] = None,
 ) -> "DataFrame":
-if not isinstance(fraction, float):
-raise TypeError(f"'fraction' must be float, but got 
{type(fraction).__name__}")
-if not isinstance(withReplacement, bool):
+
+# For the cases below:
+#   sample(True, 0.5 [, seed])
+#   sample(True, fraction=0.5 [, seed])
+#   sample(withReplacement=False, fraction=0.5 [, seed])
+is_withReplacement_set = type(withReplacement) == bool and 
isinstance(fraction, float)
+
+# For the case below:
+#   sample(faction=0.5 [, seed])
+is_withReplacement_omitted_kwargs = withReplacement is None and 
isinstance(fraction, float)
+
+# For the case below:
+#   sample(0.5 [, seed])
+is_withReplacement_omitted_args = isinstance(withReplacement, float)
+
+if not (
+is_withReplacement_set
+or is_withReplacement_omitted_kwargs
+or is_withReplacement_omitted_args
+):
+argtypes = [
+str(type(arg)) for arg in [withReplacement, fraction, seed] if 
arg is not None
+]
 raise TypeError(
-f"'withReplacement' must be bool, but got 
{type(withReplacement).__name__}"
+"withReplacement (optional), fraction (required) and seed 
(optional)"
+" should be a bool, float and number; however, "
+"got [%s]." % ", ".join(argtypes)
 )
-if seed is not None and not isinstance(seed, int):
-raise TypeError(f"'seed' must be None or int, but got 
{type(seed).__name__}")
+
+if is_withReplacement_omitted_args:
+if fraction is not None:
+seed = cast(int, fraction)
+fraction = withReplacement
+withReplacement = None
+
+if withReplacement is None:
+withReplacement = False
+
+seed = int(seed) if seed is not None else None
 
 return DataFrame.withPlan(
 plan.Sample(
 child=self._plan,
 lower_bound=0.0,
-upper_bound=fraction,
-with_replacement=withReplacement,
+upper_bound=fraction,  # type: ignore[arg-type]
+with_replacement=withReplacement,  # type: ignore[arg-type]
 seed=seed,
 ),
 session=self._session,
@@ -1485,9 +1515,6 @@ def _test() -> None:
 # TODO(SPARK-41827): groupBy requires all cols be Column or str
 del pyspark.sql.connect.dataframe.DataFrame.groupBy.__doc__
 
-# TODO(SPARK-41830): fix sample parameters
-del pyspark.sql.connect.dataframe.DataFrame.sample.__doc__
-
 # TODO(SPARK-41831): fix transform to accept ColumnReference
 del pyspark.sql.connect.dataframe.DataFrame.transform.__doc__
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41790][SQL] Set TRANSFORM reader and writer's format correctly

2023-01-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ede8c52de88 [SPARK-41790][SQL] Set TRANSFORM reader and writer's 
format correctly
ede8c52de88 is described below

commit ede8c52de8878cbcd098284d5c632ea8fa4ebf67
Author: maming 
AuthorDate: Thu Jan 5 17:52:19 2023 +0900

[SPARK-41790][SQL] Set TRANSFORM reader and writer's format correctly

### What changes were proposed in this pull request?
We'll get wrong data when transform only specify reader or writer 's row 
format delimited, the reason is using the wrong format to feed/fetch data 
to/from running script now. we should set the format correctly.

Currently in Spark:
```sql
spark-sql> CREATE TABLE t1 (a string, b string);

spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");

spark-sql> SELECT TRANSFORM(a, b)
 >   ROW FORMAT DELIMITED
 >   FIELDS TERMINATED BY ','
 >   USING 'cat'
 >   AS (c)
 > FROM t1;
c

spark-sql> SELECT TRANSFORM(a, b)
 >   USING 'cat'
 >   AS (c)
 >   ROW FORMAT DELIMITED
 >   FIELDS TERMINATED BY ','
 > FROM t1;
c
1234
```

The same sql in hive:
```sql
hive> SELECT TRANSFORM(a, b)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   USING 'cat'
>   AS (c)
> FROM t1;
c
1,2
3,4

hive> SELECT TRANSFORM(a, b)
>   USING 'cat'
>   AS (c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
> FROM t1;
c
12
34
```

### Why are the changes needed?
Fix transform writer format and reader format.

### Does this PR introduce _any_ user-facing change?
When we set transform's row format delimited in the sql, we may get the 
wrong data.

### How was this patch tested?
New tests.

Closes #39315 from mattshma/SPARK-41790.

Lead-authored-by: maming 
Co-authored-by: mattshma 
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/execution/SparkSqlParser.scala  | 15 +--
 .../execution/HiveScriptTransformationSuite.scala| 20 
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
index ad0599775de..e67ffa606ef 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
@@ -751,15 +751,18 @@ class SparkSqlAstBuilder extends AstBuilder {
   (Nil, Option(name), props, recordHandler)
   }
 
-  val (inFormat, inSerdeClass, inSerdeProps, reader) =
+  // The Writer uses inFormat to feed input data into the running script 
and
+  // the reader uses outFormat to read the output from the running script,
+  // this behavior is same with hive.
+  val (inFormat, inSerdeClass, inSerdeProps, writer) =
 format(
-  inRowFormat, "hive.script.recordreader",
-  "org.apache.hadoop.hive.ql.exec.TextRecordReader")
+  inRowFormat, "hive.script.recordwriter",
+  "org.apache.hadoop.hive.ql.exec.TextRecordWriter")
 
-  val (outFormat, outSerdeClass, outSerdeProps, writer) =
+  val (outFormat, outSerdeClass, outSerdeProps, reader) =
 format(
-  outRowFormat, "hive.script.recordwriter",
-  "org.apache.hadoop.hive.ql.exec.TextRecordWriter")
+  outRowFormat, "hive.script.recordreader",
+  "org.apache.hadoop.hive.ql.exec.TextRecordReader")
 
   ScriptInputOutputSchema(
 inFormat, outFormat,
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
index ad4a311528a..4e8a62acddd 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
@@ -638,4 +638,24 @@ class HiveScriptTransformationSuite extends 
BaseScriptTransformationSuite with T
 Row("1") :: Row("2") :: Row("3") :: Nil)
 }
   }
+
+  test("SPARK-41790: Set TRANSFORM reader and writer's format correctly") {
+withTempView("v") {
+  val df = Seq(
+(1, 2)
+  ).toDF("a", "b")
+  df.createTempView("v")
+
+  checkAnswer(
+sql(
+  s"""
+ |SELECT TRANSFORM(a, b)
+ |  ROW FORMAT DELIMITED
+ |