date:20221005

[spark] branch branch-3.3 updated (5fe895a65a4 -> 5a23f628061)

2022-10-05 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a change to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5fe895a65a4 [SPARK-40660][SQL][3.3] Switch to XORShiftRandom to 
distribute elements
 add 7c465bc3154 Preparing Spark release v3.3.1-rc3
 new 5a23f628061 Preparing development version 3.3.2-SNAPSHOT

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 01/01: Preparing development version 3.3.2-SNAPSHOT

2022-10-05 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 5a23f62806109425869752de9be1b4ab012f9af8
Author: Yuming Wang 
AuthorDate: Thu Oct 6 05:15:21 2022 +

Preparing development version 3.3.2-SNAPSHOT
---
 R/pkg/DESCRIPTION  | 2 +-
 assembly/pom.xml   | 2 +-
 common/kvstore/pom.xml | 2 +-
 common/network-common/pom.xml  | 2 +-
 common/network-shuffle/pom.xml | 2 +-
 common/network-yarn/pom.xml| 2 +-
 common/sketch/pom.xml  | 2 +-
 common/tags/pom.xml| 2 +-
 common/unsafe/pom.xml  | 2 +-
 core/pom.xml   | 2 +-
 docs/_config.yml   | 6 +++---
 examples/pom.xml   | 2 +-
 external/avro/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml  | 2 +-
 external/kafka-0-10-assembly/pom.xml   | 2 +-
 external/kafka-0-10-sql/pom.xml| 2 +-
 external/kafka-0-10-token-provider/pom.xml | 2 +-
 external/kafka-0-10/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml  | 2 +-
 external/kinesis-asl/pom.xml   | 2 +-
 external/spark-ganglia-lgpl/pom.xml| 2 +-
 graphx/pom.xml | 2 +-
 hadoop-cloud/pom.xml   | 2 +-
 launcher/pom.xml   | 2 +-
 mllib-local/pom.xml| 2 +-
 mllib/pom.xml  | 2 +-
 pom.xml| 2 +-
 python/pyspark/version.py  | 2 +-
 repl/pom.xml   | 2 +-
 resource-managers/kubernetes/core/pom.xml  | 2 +-
 resource-managers/kubernetes/integration-tests/pom.xml | 2 +-
 resource-managers/mesos/pom.xml| 2 +-
 resource-managers/yarn/pom.xml | 2 +-
 sql/catalyst/pom.xml   | 2 +-
 sql/core/pom.xml   | 2 +-
 sql/hive-thriftserver/pom.xml  | 2 +-
 sql/hive/pom.xml   | 2 +-
 streaming/pom.xml  | 2 +-
 tools/pom.xml  | 2 +-
 39 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 0e449e841cf..c1e490df26f 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 3.3.1
+Version: 3.3.2
 Title: R Front End for 'Apache Spark'
 Description: Provides an R Front end for 'Apache Spark' 
.
 Authors@R:
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 32126a5e138..eff5e3419be 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.1
+3.3.2-SNAPSHOT
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 21bf5609450..8834464f7f6 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.1
+3.3.2-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index 43740354d84..bfadba306c5 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.1
+3.3.2-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 46c875dcb0a..287355ac07d 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.1
+3.3.2-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index d6d28fe4ec6..14d41802a8b 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.1
+3.3.2-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index a37bc21ca6e..f6f26a262fd 100644
--- a/common/sketch/pom.xml
+++ b/common/sketch/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.1
+3.3.2-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/tags/pom.xml b/common/tags/pom.xml
index

[spark] 01/01: Preparing Spark release v3.3.1-rc3

2022-10-05 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to tag v3.3.1-rc3
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 7c465bc3154cdd0d578f837c9b82e4289caf0b14
Author: Yuming Wang 
AuthorDate: Thu Oct 6 05:15:03 2022 +

Preparing Spark release v3.3.1-rc3
---
 R/pkg/DESCRIPTION  | 2 +-
 assembly/pom.xml   | 2 +-
 common/kvstore/pom.xml | 2 +-
 common/network-common/pom.xml  | 2 +-
 common/network-shuffle/pom.xml | 2 +-
 common/network-yarn/pom.xml| 2 +-
 common/sketch/pom.xml  | 2 +-
 common/tags/pom.xml| 2 +-
 common/unsafe/pom.xml  | 2 +-
 core/pom.xml   | 2 +-
 docs/_config.yml   | 6 +++---
 examples/pom.xml   | 2 +-
 external/avro/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml  | 2 +-
 external/kafka-0-10-assembly/pom.xml   | 2 +-
 external/kafka-0-10-sql/pom.xml| 2 +-
 external/kafka-0-10-token-provider/pom.xml | 2 +-
 external/kafka-0-10/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml  | 2 +-
 external/kinesis-asl/pom.xml   | 2 +-
 external/spark-ganglia-lgpl/pom.xml| 2 +-
 graphx/pom.xml | 2 +-
 hadoop-cloud/pom.xml   | 2 +-
 launcher/pom.xml   | 2 +-
 mllib-local/pom.xml| 2 +-
 mllib/pom.xml  | 2 +-
 pom.xml| 2 +-
 python/pyspark/version.py  | 2 +-
 repl/pom.xml   | 2 +-
 resource-managers/kubernetes/core/pom.xml  | 2 +-
 resource-managers/kubernetes/integration-tests/pom.xml | 2 +-
 resource-managers/mesos/pom.xml| 2 +-
 resource-managers/yarn/pom.xml | 2 +-
 sql/catalyst/pom.xml   | 2 +-
 sql/core/pom.xml   | 2 +-
 sql/hive-thriftserver/pom.xml  | 2 +-
 sql/hive/pom.xml   | 2 +-
 streaming/pom.xml  | 2 +-
 tools/pom.xml  | 2 +-
 39 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index c1e490df26f..0e449e841cf 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 3.3.2
+Version: 3.3.1
 Title: R Front End for 'Apache Spark'
 Description: Provides an R Front end for 'Apache Spark' 
.
 Authors@R:
diff --git a/assembly/pom.xml b/assembly/pom.xml
index eff5e3419be..32126a5e138 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.2-SNAPSHOT
+3.3.1
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 8834464f7f6..21bf5609450 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.2-SNAPSHOT
+3.3.1
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index bfadba306c5..43740354d84 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.2-SNAPSHOT
+3.3.1
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 287355ac07d..46c875dcb0a 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.2-SNAPSHOT
+3.3.1
 ../../pom.xml
   
 
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index 14d41802a8b..d6d28fe4ec6 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.2-SNAPSHOT
+3.3.1
 ../../pom.xml
   
 
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index f6f26a262fd..a37bc21ca6e 100644
--- a/common/sketch/pom.xml
+++ b/common/sketch/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.3.2-SNAPSHOT
+3.3.1
 ../../pom.xml
   
 
diff --git a/common/tags/pom.xml b/common/tags/pom.xml
index

[spark] tag v3.3.1-rc3 created (now 7c465bc3154)

2022-10-05 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a change to tag v3.3.1-rc3
in repository https://gitbox.apache.org/repos/asf/spark.git


  at 7c465bc3154 (commit)
This tag includes the following new commits:

 new 7c465bc3154 Preparing Spark release v3.3.1-rc3

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-40660][SQL][3.3] Switch to XORShiftRandom to distribute elements

2022-10-05 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new ad2fb0ec5f1 [SPARK-40660][SQL][3.3] Switch to XORShiftRandom to 
distribute elements
ad2fb0ec5f1 is described below

commit ad2fb0ec5f1054a2734de818d30be4606c624fbd
Author: Yuming Wang 
AuthorDate: Thu Oct 6 13:02:22 2022 +0800

[SPARK-40660][SQL][3.3] Switch to XORShiftRandom to distribute elements

### What changes were proposed in this pull request?

Cherry-picked from #38106 and reverted changes in RDD.scala:

https://github.com/apache/spark/blob/d2952b671a3579759ad9ce326ed8389f5270fd9f/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L507

### Why are the changes needed?

The number of output files has changed since SPARK-40407. [Some downstream 
projects](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java#L578-L579)
 use repartition to determine the number of output files in the test.
```
bin/spark-shell --master "local[2]"

spark.range(10).repartition(10).write.mode("overwrite").parquet("/tmp/spark/repartition")
```
Before this PR and after SPARK-40407, the number of output files is 8. 
After this PR or before  SPARK-40407, the number of output files is 10.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #38110 from wangyum/branch-3.3-SPARK-40660.

Authored-by: Yuming Wang 
Signed-off-by: Yuming Wang 
(cherry picked from commit 5fe895a65a4a9d65f81d43af473b5e3a855ed8c8)
Signed-off-by: Yuming Wang 
---
 .../spark/sql/execution/exchange/ShuffleExchangeExec.scala |  5 ++---
 .../src/test/scala/org/apache/spark/sql/DatasetSuite.scala | 14 +-
 .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala|  4 ++--
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
index bc8416c5b2e..f11ae0ed207 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
@@ -17,11 +17,9 @@
 
 package org.apache.spark.sql.execution.exchange
 
-import java.util.Random
 import java.util.function.Supplier
 
 import scala.concurrent.Future
-import scala.util.hashing
 
 import org.apache.spark._
 import org.apache.spark.internal.config
@@ -40,6 +38,7 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.util.MutablePair
 import org.apache.spark.util.collection.unsafe.sort.{PrefixComparators, 
RecordComparator}
+import org.apache.spark.util.random.XORShiftRandom
 
 /**
  * Common trait for all shuffle exchange implementations to facilitate pattern 
matching.
@@ -314,7 +313,7 @@ object ShuffleExchangeExec {
 // end up being almost the same regardless of the index. substantially 
scrambling the
 // seed by hashing will help. Refer to SPARK-21782 for more details.
 val partitionId = TaskContext.get().partitionId()
-var position = new 
Random(hashing.byteswap32(partitionId)).nextInt(numPartitions)
+var position = new XORShiftRandom(partitionId).nextInt(numPartitions)
 (row: InternalRow) => {
   // The HashPartitioner will handle the `mod` by the number of 
partitions
   position += 1
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
index 617314eff4e..6cd4fe54f42 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql
 import java.io.{Externalizable, ObjectInput, ObjectOutput}
 import java.sql.{Date, Timestamp}
 
+import org.apache.hadoop.fs.{Path, PathFilter}
 import org.scalatest.Assertions._
 import org.scalatest.exceptions.TestFailedException
 import org.scalatest.prop.TableDrivenPropertyChecks._
@@ -2064,7 +2065,18 @@ class DatasetSuite extends QueryTest
   test("SPARK-40407: repartition should not result in severe data skew") {
 val df = spark.range(0, 100, 1, 50).repartition(4)
 val result = df.mapPartitions(iter => 
Iterator.single(iter.length)).collect()
-assert(result.sorted.toSeq === Seq(19, 25, 25, 31))
+assert(result.sorted.toSeq === Seq(23, 25, 25, 27))
+  }
+
+  test("SPARK-40660: Switch to XORShiftRandom to distribute elements") {

[spark] branch branch-3.3 updated: [SPARK-40660][SQL][3.3] Switch to XORShiftRandom to distribute elements

2022-10-05 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 5fe895a65a4 [SPARK-40660][SQL][3.3] Switch to XORShiftRandom to 
distribute elements
5fe895a65a4 is described below

commit 5fe895a65a4a9d65f81d43af473b5e3a855ed8c8
Author: Yuming Wang 
AuthorDate: Thu Oct 6 13:02:22 2022 +0800

[SPARK-40660][SQL][3.3] Switch to XORShiftRandom to distribute elements

### What changes were proposed in this pull request?

Cherry-picked from #38106 and reverted changes in RDD.scala:

https://github.com/apache/spark/blob/d2952b671a3579759ad9ce326ed8389f5270fd9f/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L507

### Why are the changes needed?

The number of output files has changed since SPARK-40407. [Some downstream 
projects](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java#L578-L579)
 use repartition to determine the number of output files in the test.
```
bin/spark-shell --master "local[2]"

spark.range(10).repartition(10).write.mode("overwrite").parquet("/tmp/spark/repartition")
```
Before this PR and after SPARK-40407, the number of output files is 8. 
After this PR or before  SPARK-40407, the number of output files is 10.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #38110 from wangyum/branch-3.3-SPARK-40660.

Authored-by: Yuming Wang 
Signed-off-by: Yuming Wang 
---
 .../spark/sql/execution/exchange/ShuffleExchangeExec.scala |  5 ++---
 .../src/test/scala/org/apache/spark/sql/DatasetSuite.scala | 14 +-
 .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala|  4 ++--
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
index 9800a781402..964f1d6d518 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
@@ -17,11 +17,9 @@
 
 package org.apache.spark.sql.execution.exchange
 
-import java.util.Random
 import java.util.function.Supplier
 
 import scala.concurrent.Future
-import scala.util.hashing
 
 import org.apache.spark._
 import org.apache.spark.internal.config
@@ -40,6 +38,7 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.util.MutablePair
 import org.apache.spark.util.collection.unsafe.sort.{PrefixComparators, 
RecordComparator}
+import org.apache.spark.util.random.XORShiftRandom
 
 /**
  * Common trait for all shuffle exchange implementations to facilitate pattern 
matching.
@@ -314,7 +313,7 @@ object ShuffleExchangeExec {
 // end up being almost the same regardless of the index. substantially 
scrambling the
 // seed by hashing will help. Refer to SPARK-21782 for more details.
 val partitionId = TaskContext.get().partitionId()
-var position = new 
Random(hashing.byteswap32(partitionId)).nextInt(numPartitions)
+var position = new XORShiftRandom(partitionId).nextInt(numPartitions)
 (row: InternalRow) => {
   // The HashPartitioner will handle the `mod` by the number of 
partitions
   position += 1
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
index c65ae966ef6..f5e736621eb 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql
 import java.io.{Externalizable, ObjectInput, ObjectOutput}
 import java.sql.{Date, Timestamp}
 
+import org.apache.hadoop.fs.{Path, PathFilter}
 import org.scalatest.Assertions._
 import org.scalatest.exceptions.TestFailedException
 import org.scalatest.prop.TableDrivenPropertyChecks._
@@ -2138,7 +2139,18 @@ class DatasetSuite extends QueryTest
   test("SPARK-40407: repartition should not result in severe data skew") {
 val df = spark.range(0, 100, 1, 50).repartition(4)
 val result = df.mapPartitions(iter => 
Iterator.single(iter.length)).collect()
-assert(result.sorted.toSeq === Seq(19, 25, 25, 31))
+assert(result.sorted.toSeq === Seq(23, 25, 25, 27))
+  }
+
+  test("SPARK-40660: Switch to XORShiftRandom to distribute elements") {
+withTempDir { dir =>
+

[spark] branch master updated (e2176338c9b -> 03b055fa4e5)

2022-10-05 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e2176338c9b [SPARK-40665][CONNECT] Avoid embedding Spark Connect in 
the Apache Spark binary release
 add 03b055fa4e5 [SPARK-40643][PS] Implement `min_count` in `GroupBy.last`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/groupby.py| 48 +
 python/pyspark/pandas/tests/test_groupby.py | 17 ++
 2 files changed, 59 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch dependabot/maven/connector/connect/com.google.protobuf-protobuf-java-3.21.7 created (now b391e8af868)

2022-10-05 Thread github-bot

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/connector/connect/com.google.protobuf-protobuf-java-3.21.7
in repository https://gitbox.apache.org/repos/asf/spark.git


  at b391e8af868 Bump protobuf-java from 3.21.1 to 3.21.7 in 
/connector/connect

No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-40665][CONNECT] Avoid embedding Spark Connect in the Apache Spark binary release

2022-10-05 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e2176338c9b [SPARK-40665][CONNECT] Avoid embedding Spark Connect in 
the Apache Spark binary release
e2176338c9b is described below

commit e2176338c9b4020b9d5dcd831038d350ce03137f
Author: Hyukjin Kwon 
AuthorDate: Thu Oct 6 12:27:15 2022 +0900

[SPARK-40665][CONNECT] Avoid embedding Spark Connect in the Apache Spark 
binary release

### What changes were proposed in this pull request?

This PR proposes

1. Move `connect` to `connector/connect` to be consistent with Kafka and 
Avro.
2. Do not include this in the default Apache Spark release binary.
3. Fix the module dependency in `modules.py`.
4. Fix the usages in `README.md` with cleaning up.
5. Cleanup PySpark test structure to be consistent with other PySpark tests.

### Why are the changes needed?

To make it consistent with Avro or Kafka, see also 
https://github.com/apache/spark/pull/37710/files#r978291019

### Does this PR introduce _any_ user-facing change?

No, this isn't released yet.

The usage of this project would be changed from:

```bash
./bin/spark-shell \
  --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
```

to

```bash
./bin/spark-shell \
  --packages org.apache.spark:spark-connect_2.12:3.4.0 \
  --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
```

### How was this patch tested?

CI in the PR should verify this.

Closes #38109 from HyukjinKwon/SPARK-40665.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .github/labeler.yml|  2 +-
 .github/workflows/build_and_test.yml   |  2 +-
 assembly/pom.xml   |  5 -
 {connect => connector/connect}/.gitignore  |  0
 .../connect}/dev/generate_protos.sh|  4 ++--
 {connect => connector/connect}/pom.xml |  2 +-
 .../connect}/src/main/buf.gen.yaml |  0
 .../connect}/src/main/buf.work.yaml|  0
 .../connect}/src/main/protobuf/buf.yaml|  0
 .../src/main/protobuf/spark/connect/base.proto |  0
 .../src/main/protobuf/spark/connect/commands.proto |  0
 .../main/protobuf/spark/connect/expressions.proto  |  0
 .../main/protobuf/spark/connect/relations.proto|  0
 .../src/main/protobuf/spark/connect/types.proto|  0
 .../spark/sql/connect/SparkConnectPlugin.scala |  0
 .../command/SparkConnectCommandPlanner.scala   |  0
 .../apache/spark/sql/connect/config/Connect.scala  |  0
 .../sql/connect/planner/SparkConnectPlanner.scala  |  0
 .../sql/connect/service/SparkConnectService.scala  |  0
 .../service/SparkConnectStreamHandler.scala|  0
 .../connect}/src/test/resources/log4j2.properties  |  0
 .../connect/planner/SparkConnectPlannerSuite.scala |  0
 dev/deps/spark-deps-hadoop-2-hive-2.3  | 16 --
 dev/deps/spark-deps-hadoop-3-hive-2.3  | 16 --
 dev/sparktestsupport/modules.py| 25 --
 dev/sparktestsupport/utils.py  | 20 -
 pom.xml|  2 +-
 python/mypy.ini|  9 +++-
 python/pyspark/sql/connect/README.md   | 17 ---
 python/pyspark/sql/tests/connect/__init__.py   | 16 --
 python/pyspark/sql/tests/connect/utils/__init__.py | 20 -
 ...test_spark_connect.py => test_connect_basic.py} |  8 +--
 ...sions.py => test_connect_column_expressions.py} |  5 ++---
 ...test_plan_only.py => test_connect_plan_only.py} |  4 ++--
 ...st_select_ops.py => test_connect_select_ops.py} |  4 ++--
 .../connectutils.py}   | 21 ++
 python/run-tests.py| 12 +--
 37 files changed, 80 insertions(+), 130 deletions(-)

diff --git a/.github/labeler.yml b/.github/labeler.yml
index 0d04244f882..cf1d2a71172 100644
--- a/.github/labeler.yml
+++ b/.github/labeler.yml
@@ -152,6 +152,6 @@ WEB UI:
 DEPLOY:
   - "sbin/**/*"
 CONNECT:
-  - "connect/**/*"
+  - "connector/connect/**/*"
   - "**/sql/sparkconnect/**/*"
   - "python/pyspark/sql/**/connect/**/*"
diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index b0847187dff..b7f8b10c00f 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -334,7 +334,7 @@ jobs:
   - >-
 pyspark-pandas-slow
   - >-
-pyspark-sql-connect
+pyspark-connect
 env:
   MODULES_TO_TEST: ${{ matrix.modules

[spark] branch branch-3.3 updated: [SPARK-40669][SQL][TESTS] Parameterize `rowsNum` in `InMemoryColumnarBenchmark`

2022-10-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 5dc9ba0d227 [SPARK-40669][SQL][TESTS] Parameterize `rowsNum` in 
`InMemoryColumnarBenchmark`
5dc9ba0d227 is described below

commit 5dc9ba0d22741173bd122afb387c54d7ca4bfb6d
Author: Dongjoon Hyun 
AuthorDate: Wed Oct 5 18:01:55 2022 -0700

[SPARK-40669][SQL][TESTS] Parameterize `rowsNum` in 
`InMemoryColumnarBenchmark`

This PR aims to parameterize `InMemoryColumnarBenchmark` to accept 
`rowsNum`.

This enables us to benchmark more flexibly.
```
build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark 100"
...
[info] Running benchmark: Int In-Memory scan
[info]   Running case: columnar deserialization + columnar-to-row
[info]   Stopped after 3 iterations, 444 ms
[info]   Running case: row-based deserialization
[info]   Stopped after 3 iterations, 462 ms
[info] OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Mac OS X 12.6
[info] Apple M1 Max
[info] Int In-Memory scan: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 
--
[info] columnar deserialization + columnar-to-row119
148  26  8.4 118.5   1.0X
[info] row-based deserialization 119
154  32  8.4 119.5   1.0X
```

```
$ build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark 1000"
...
[info] Running benchmark: Int In-Memory scan
[info]   Running case: columnar deserialization + columnar-to-row
[info]   Stopped after 3 iterations, 3855 ms
[info]   Running case: row-based deserialization
[info]   Stopped after 3 iterations, 4250 ms
[info] OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Mac OS X 12.6
[info] Apple M1 Max
[info] Int In-Memory scan: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 
--
[info] columnar deserialization + columnar-to-row   1082   
1285 199  9.2 108.2   1.0X
[info] row-based deserialization1057   
1417 335  9.5 105.7   1.0X
```

```
$ build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark 2000"
[info] Running benchmark: Int In-Memory scan
[info]   Running case: columnar deserialization + columnar-to-row
[info]   Stopped after 3 iterations, 8482 ms
[info]   Running case: row-based deserialization
[info]   Stopped after 3 iterations, 7534 ms
[info] OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Mac OS X 12.6
[info] Apple M1 Max
[info] Int In-Memory scan: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 
--
[info] columnar deserialization + columnar-to-row   2261   
2828 555  8.8 113.1   1.0X
[info] row-based deserialization1788   
25111187 11.2  89.4   1.3X
```

No. This is a benchmark test code.

Manually.

Closes #38114 from dongjoon-hyun/SPARK-40669.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 95cfdc694d3e0b68979cd06b78b52e107aa58a9f)
Signed-off-by: Dongjoon Hyun 
---
 .../sql/execution/columnar/InMemoryColumnarBenchmark.scala | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarBenchmark.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarBenchmark.scala
index b975451e135..55d9fb27317 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarBenchmark.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarBenchmark.scala
@@ -26,14 +26,15 @@ import 
org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
  * {{{
  *   1. without sbt:
  *  bin/spark-submit --class 
- *--jars  
- *   2. build/sbt "sql/test:runMain "
- *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt

[spark] branch master updated (04142136f18 -> 95cfdc694d3)

2022-10-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 04142136f18 [SPARK-40537][CONNECT] Enable mypy for Spark Connect 
Python Client
 add 95cfdc694d3 [SPARK-40669][SQL][TESTS] Parameterize `rowsNum` in 
`InMemoryColumnarBenchmark`

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/columnar/InMemoryColumnarBenchmark.scala | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (34d5272663c -> e8fdf8e9994)

2022-10-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 34d5272663c [SPARK-40607][CORE][SQL][MLLIB][SS] Remove redundant 
string interpolator operations
 add e8fdf8e9994 [SPARK-40651][INFRA] Drop Hadoop2 binary distribtuion from 
release process

No new revisions were added by this update.

Summary of changes:
 dev/create-release/release-build.sh | 1 -
 1 file changed, 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-40607][CORE][SQL][MLLIB][SS] Remove redundant string interpolator operations

2022-10-05 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 34d5272663c [SPARK-40607][CORE][SQL][MLLIB][SS] Remove redundant 
string interpolator operations
34d5272663c is described below

commit 34d5272663ce4852ca5b2daa665983a321b42060
Author: yangjie01 
AuthorDate: Wed Oct 5 18:05:12 2022 -0500

[SPARK-40607][CORE][SQL][MLLIB][SS] Remove redundant string interpolator 
operations

### What changes were proposed in this pull request?
This pr remove redundant string interpolator operations in Spark code, and 
the change of this pr does not include the code related to logs, exceptions, 
and `configurations.doc`.

### Why are the changes needed?
Clean up unnecessary function calls

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

Closes #38043 from LuciferYang/unused-s.

Authored-by: yangjie01 
Signed-off-by: Sean Owen 
---
 core/src/main/scala/org/apache/spark/TaskEndReason.scala   |  2 +-
 core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala   | 10 +-
 .../org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala |  4 ++--
 .../scala/org/apache/spark/sql/catalyst/expressions/Cast.scala |  4 ++--
 .../spark/sql/catalyst/expressions/collectionOperations.scala  |  2 +-
 .../spark/sql/catalyst/expressions/datetimeExpressions.scala   |  2 +-
 .../org/apache/spark/sql/catalyst/expressions/literals.scala   |  2 +-
 .../sql/catalyst/optimizer/PullOutGroupingExpressions.scala|  2 +-
 .../org/apache/spark/sql/catalyst/parser/AstBuilder.scala  |  2 +-
 .../main/scala/org/apache/spark/sql/execution/HiveResult.scala |  2 +-
 .../spark/sql/execution/aggregate/HashMapGenerator.scala   |  2 +-
 .../sql/execution/aggregate/RowBasedHashMapGenerator.scala |  2 +-
 .../apache/spark/sql/execution/basicPhysicalOperators.scala|  2 +-
 .../spark/sql/execution/joins/BroadcastHashJoinExec.scala  |  2 +-
 .../spark/sql/execution/streaming/ResolveWriteToStream.scala   |  2 +-
 .../src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala   |  2 +-
 .../main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala |  2 +-
 17 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/TaskEndReason.scala 
b/core/src/main/scala/org/apache/spark/TaskEndReason.scala
index 5dc70e9834b..f1ce302a05d 100644
--- a/core/src/main/scala/org/apache/spark/TaskEndReason.scala
+++ b/core/src/main/scala/org/apache/spark/TaskEndReason.scala
@@ -242,7 +242,7 @@ case class TaskCommitDenied(
 jobID: Int,
 partitionID: Int,
 attemptNumber: Int) extends TaskFailedReason {
-  override def toErrorString: String = s"TaskCommitDenied (Driver denied task 
commit)" +
+  override def toErrorString: String = "TaskCommitDenied (Driver denied task 
commit)" +
 s" for job: $jobID, partition: $partitionID, attemptNumber: $attemptNumber"
   /**
* If a task failed because its attempt to commit was denied, do not count 
this failure
diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala 
b/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
index 8106eec847e..1934e9e58e6 100644
--- a/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
+++ b/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
@@ -360,7 +360,7 @@ private[ui] class StagePage(parent: StagesTab, store: 
AppStatusStore) extends We
|'content': '
+ |data-title="${"Task " + index + " (attempt " + attempt + 
")"}
  |Status: ${taskInfo.status}
  |Launch Time: ${UIUtils.formatDate(new Date(launchTime))}
  |${
@@ -416,7 +416,7 @@ private[ui] class StagePage(parent: StagesTab, store: 
AppStatusStore) extends We
   Enable zooming
 
 
-  
 
 
 . Show 
 
@@ -445,7 +445,7 @@ private[ui] class StagePage(parent: StagesTab, store: 
AppStatusStore) extends We
   {TIMELINE_LEGEND}
  ++
 
-  {Unparsed(s"drawTaskAssignmentTimeline(" +
+  {Unparsed("drawTaskAssignmentTimeline(" +
   s"$groupArrayStr, $executorsArrayStr, $minLaunchTime, $maxFinishTime, " +
 s"${UIUtils.getTimeZoneOffset()})")}
 
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
index 2f6b9c1e11a..c61aa14edca 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
@@ -144,8 +144,8 @@ private[shared] object SharedParamsCodeGen {
 case _ if c == classOf[Float] =>

[GitHub] [spark-website] srowen closed pull request #414: CVE version update

2022-10-05 Thread GitBox



srowen closed pull request #414: CVE version update
URL: https://github.com/apache/spark-website/pull/414


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] srowen opened a new pull request, #414: CVE version update

2022-10-05 Thread GitBox



srowen opened a new pull request, #414:
URL: https://github.com/apache/spark-website/pull/414

   See mailing list discussion. The idea is to give a 'resolved by' version for 
older CVEs that are advice or affected only the build.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-40663][SQL] Migrate execution errors onto error classes: _LEGACY_ERROR_TEMP_2000-2025

2022-10-05 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1e305643b3b [SPARK-40663][SQL] Migrate execution errors onto error 
classes: _LEGACY_ERROR_TEMP_2000-2025
1e305643b3b is described below

commit 1e305643b3b4adfbfdab8ed181f55ea2e6b74f0d
Author: itholic 
AuthorDate: Wed Oct 5 13:14:19 2022 +0300

[SPARK-40663][SQL] Migrate execution errors onto error classes: 
_LEGACY_ERROR_TEMP_2000-2025

### What changes were proposed in this pull request?

This PR proposes to migrate 26 execution errors onto temporary error 
classes with the prefix `_LEGACY_ERROR_TEMP_2000` to `_LEGACY_ERROR_TEMP_2024`.

The error classes are prefixed with `_LEGACY_ERROR_TEMP_` indicates the 
dev-facing error messages, and won't be exposed to end users.

### Why are the changes needed?

To speed-up the error class migration.

The migration on temporary error classes allow us to analyze the errors, so 
we can detect the most popular error classes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "test:testOnly *SQLQuerySuite"
$ build/sbt -Phadoop-3 -Phive-thriftserver catalyst/test 
hive-thriftserver/test
```

Closes #38104 from itholic/SPARK-40540-2000.

Authored-by: itholic 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   | 130 ++
 .../spark/sql/errors/QueryExecutionErrors.scala| 186 ++---
 .../expressions/aggregate/PercentileSuite.scala|  18 +-
 .../resources/sql-tests/results/ansi/date.sql.out  |  29 +++-
 .../sql-tests/results/ansi/timestamp.sql.out   |  30 +++-
 .../sql-tests/results/postgreSQL/date.sql.out  |  30 +++-
 .../sql-tests/results/regexp-functions.sql.out | 104 +---
 .../results/timestampNTZ/timestamp-ansi.sql.out|  30 +++-
 8 files changed, 438 insertions(+), 119 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 653c4a6938f..d27a2cbde97 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -3002,5 +3002,135 @@
 "message" : [
   "Failed to execute command because subquery expressions are not allowed 
in DEFAULT values."
 ]
+  },
+  "_LEGACY_ERROR_TEMP_2000" : {
+"message" : [
+  ". If necessary set  to false to bypass this error."
+]
+  },
+  "_LEGACY_ERROR_TEMP_2001" : {
+"message" : [
+  " If necessary set  to false to bypass this error"
+]
+  },
+  "_LEGACY_ERROR_TEMP_2002" : {
+"message" : [
+  ""
+]
+  },
+  "_LEGACY_ERROR_TEMP_2003" : {
+"message" : [
+  "Unsuccessful try to zip maps with  unique keys due to exceeding 
the array size limit "
+]
+  },
+  "_LEGACY_ERROR_TEMP_2004" : {
+"message" : [
+  "no default for type "
+]
+  },
+  "_LEGACY_ERROR_TEMP_2005" : {
+"message" : [
+  "Type  does not support ordered operations"
+]
+  },
+  "_LEGACY_ERROR_TEMP_2006" : {
+"message" : [
+  "The specified group index cannot be less than zero"
+]
+  },
+  "_LEGACY_ERROR_TEMP_2007" : {
+"message" : [
+  "Regex group count is , but the specified group index is 
"
+]
+  },
+  "_LEGACY_ERROR_TEMP_2008" : {
+"message" : [
+  "Find an invalid url string . If necessary set  to 
false to bypass this error."
+]
+  },
+  "_LEGACY_ERROR_TEMP_2009" : {
+"message" : [
+  "dataType"
+]
+  },
+  "_LEGACY_ERROR_TEMP_2010" : {
+"message" : [
+  "Window Functions do not support merging."
+]
+  },
+  "_LEGACY_ERROR_TEMP_2011" : {
+"message" : [
+  "Unexpected data type "
+]
+  },
+  "_LEGACY_ERROR_TEMP_2012" : {
+"message" : [
+  "Unexpected type "
+]
+  },
+  "_LEGACY_ERROR_TEMP_2013" : {
+"message" : [
+  "Negative values found in "
+]
+  },
+  "_LEGACY_ERROR_TEMP_2014" : {
+"message" : [
+  " is not matched at addNewFunction"
+]
+  },
+  "_LEGACY_ERROR_TEMP_2015" : {
+"message" : [
+  "Cannot generate  code for incomparable type: "
+]
+  },
+  "_LEGACY_ERROR_TEMP_2016" : {
+"message" : [
+  "Can not interpolate  into code block."
+]
+  },
+  "_LEGACY_ERROR_TEMP_2017" : {
+"message" : [
+  "not resolved"
+]
+  },
+  "_LEGACY_ERROR_TEMP_2018" : {
+"message" : [
+  "class `` is not supported by `MapObjects` as resulting collection."
+]
+  },
+  "_LEGACY_ERROR_TEMP_2019" : {
+"message" : [
+  "Cannot use null as map key!"
+]
+  },
+  "_LEGACY_ERROR_TEMP_2020" : {
+"message" : [
+  "Couldn't find a valid

[spark] branch master updated (e6bebb66651 -> 5600bef0ee6)

2022-10-05 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e6bebb66651 [SPARK-40660][CORE][SQL] Switch to XORShiftRandom to 
distribute elements
 add 5600bef0ee6 [SPARK-40635][YARN][TESTS] Fix `yarn` module daily test 
failed with `hadoop2`

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/deploy/yarn/Client.scala| 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-40660][CORE][SQL] Switch to XORShiftRandom to distribute elements

2022-10-05 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e6bebb66651 [SPARK-40660][CORE][SQL] Switch to XORShiftRandom to 
distribute elements
e6bebb66651 is described below

commit e6bebb66651a1ff06f821bd4ee2b7b52bd532c01
Author: Yuming Wang 
AuthorDate: Wed Oct 5 14:00:55 2022 +0800

[SPARK-40660][CORE][SQL] Switch to XORShiftRandom to distribute elements

### What changes were proposed in this pull request?

This PR replaces `Random(hashing.byteswap32(index))` with 
`XORShiftRandom(index)` to distribute elements evenly across output partitions.

### Why are the changes needed?

It seems that the distribution using `XORShiftRandom` is better. For 
example:

1. The number of output files has changed since SPARK-40407. [Some 
downstream 
projects](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java#L578-L579)
 use repartition to determine the number of output files in the test.
   ```
   bin/spark-shell --master "local[2]"
   
spark.range(10).repartition(10).write.mode("overwrite").parquet("/tmp/spark/repartition")
   ```
   Before this PR and after SPARK-40407, the number of output files is 8. 
After this PR or before  SPARK-40407, the number of output files is 10.

2. The distribution using `XORShiftRandom` seem better.
   ```scala
   import java.util.Random
   import org.apache.spark.util.random.XORShiftRandom
   import scala.util.hashing

   def distribution(count: Int, partition: Int) = {
 println((1 to count).map(partitionId => new 
Random(partitionId).nextInt(partition))
   .groupBy(f => f)
   .map(_._2.size).mkString(". "))

 println((1 to count).map(partitionId => new 
Random(hashing.byteswap32(partitionId)).nextInt(partition))
   .groupBy(f => f)
   .map(_._2.size).mkString(". "))

 println((1 to count).map(partitionId => new 
XORShiftRandom(partitionId).nextInt(partition))
   .groupBy(f => f)
   .map(_._2.size).mkString(". "))
   }

   distribution(200, 4)
   ```
   The output:
   ```
   200
   50. 60. 46. 44
   55. 48. 43. 54
   ```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #38106 from wangyum/SPARK-40660.

Authored-by: Yuming Wang 
Signed-off-by: Wenchen Fan 
---
 core/src/main/scala/org/apache/spark/rdd/RDD.scala |  5 ++---
 .../spark/sql/execution/exchange/ShuffleExchangeExec.scala |  5 ++---
 .../src/test/scala/org/apache/spark/sql/DatasetSuite.scala | 14 +-
 .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala|  4 ++--
 4 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index d12804fc12b..18f3f87f30f 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -25,7 +25,6 @@ import scala.io.Codec
 import scala.language.implicitConversions
 import scala.ref.WeakReference
 import scala.reflect.{classTag, ClassTag}
-import scala.util.hashing
 
 import com.clearspring.analytics.stream.cardinality.HyperLogLogPlus
 import org.apache.hadoop.io.{BytesWritable, NullWritable, Text}
@@ -50,7 +49,7 @@ import org.apache.spark.util.Utils
 import org.apache.spark.util.collection.{ExternalAppendOnlyMap, OpenHashMap,
   Utils => collectionUtils}
 import org.apache.spark.util.random.{BernoulliCellSampler, BernoulliSampler, 
PoissonSampler,
-  SamplingUtils}
+  SamplingUtils, XORShiftRandom}
 
 /**
  * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. 
Represents an immutable,
@@ -505,7 +504,7 @@ abstract class RDD[T: ClassTag](
 if (shuffle) {
   /** Distributes elements evenly across output partitions, starting from 
a random partition. */
   val distributePartition = (index: Int, items: Iterator[T]) => {
-var position = new 
Random(hashing.byteswap32(index)).nextInt(numPartitions)
+var position = new XORShiftRandom(index).nextInt(numPartitions)
 items.map { t =>
   // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
   // will mod it with the number of total partitions.
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
index 6f287028f74..806a048b244 100644
---

[spark] branch branch-3.3 updated (5fe895a65a4 -> 5a23f628061)

[spark] 01/01: Preparing development version 3.3.2-SNAPSHOT

[spark] 01/01: Preparing Spark release v3.3.1-rc3

[spark] tag v3.3.1-rc3 created (now 7c465bc3154)

[spark] branch branch-3.2 updated: [SPARK-40660][SQL][3.3] Switch to XORShiftRandom to distribute elements

[spark] branch branch-3.3 updated: [SPARK-40660][SQL][3.3] Switch to XORShiftRandom to distribute elements

[spark] branch master updated (e2176338c9b -> 03b055fa4e5)

[spark] branch dependabot/maven/connector/connect/com.google.protobuf-protobuf-java-3.21.7 created (now b391e8af868)

[spark] branch master updated: [SPARK-40665][CONNECT] Avoid embedding Spark Connect in the Apache Spark binary release

[spark] branch branch-3.3 updated: [SPARK-40669][SQL][TESTS] Parameterize `rowsNum` in `InMemoryColumnarBenchmark`

[spark] branch master updated (04142136f18 -> 95cfdc694d3)

[spark] branch master updated (34d5272663c -> e8fdf8e9994)

[spark] branch master updated: [SPARK-40607][CORE][SQL][MLLIB][SS] Remove redundant string interpolator operations

[GitHub] [spark-website] srowen closed pull request #414: CVE version update

[GitHub] [spark-website] srowen opened a new pull request, #414: CVE version update

[spark] branch master updated: [SPARK-40663][SQL] Migrate execution errors onto error classes: _LEGACY_ERROR_TEMP_2000-2025

[spark] branch master updated (e6bebb66651 -> 5600bef0ee6)

[spark] branch master updated: [SPARK-40660][CORE][SQL] Switch to XORShiftRandom to distribute elements

18 matches

Site Navigation

Mail list logo

Footer information