(spark) branch master updated: [SPARK-46808][PYTHON] Refine error classes in Python with automatic sorting function

2024-01-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new eeed19158ea5 [SPARK-46808][PYTHON] Refine error classes in Python with 
automatic sorting function
eeed19158ea5 is described below

commit eeed19158ea53510cef768136d95021c8e39f247
Author: Hyukjin Kwon 
AuthorDate: Tue Jan 23 16:33:23 2024 +0900

[SPARK-46808][PYTHON] Refine error classes in Python with automatic sorting 
function

### What changes were proposed in this pull request?

This PR proposes:
- Add the automated way of writing `error_classes.py` file, `from 
pyspark.errors.exceptions import _write_self; _write_self()`
- Fix the formatting of the JSON file to be consistent
- Fix typos within the error messages
- Fix parameter names to be consistent (it fixes some, not all)

### Why are the changes needed?

- Now, it is difficult to add a new error class because it enforces 
alphabetical order or error classes, etc. When you add multiple error classes, 
you should manually fix and move them around which is troublesome.
- In addition, the current JSON format isn't very consistent.
- For consistency. This PR includes the changes of *some* of parameter 
naming.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes a couple of typos.

### How was this patch tested?

Unittests were fixed together.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44848 from HyukjinKwon/SPARK-46808.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/errors/error_classes.py | 747 +++--
 python/pyspark/errors/exceptions/__init__.py   |  40 ++
 python/pyspark/sql/connect/dataframe.py|   2 +-
 python/pyspark/sql/connect/session.py  |   2 +-
 python/pyspark/sql/dataframe.py|   6 +-
 python/pyspark/sql/session.py  |   2 +-
 .../sql/tests/connect/test_connect_basic.py|   2 +-
 python/pyspark/sql/tests/test_dataframe.py |   4 +-
 python/pyspark/sql/tests/test_functions.py |   2 +-
 python/pyspark/sql/types.py|   2 +-
 python/pyspark/worker.py   |   2 +-
 11 files changed, 429 insertions(+), 382 deletions(-)

diff --git a/python/pyspark/errors/error_classes.py 
b/python/pyspark/errors/error_classes.py
index 7d627a508d12..a643615803c2 100644
--- a/python/pyspark/errors/error_classes.py
+++ b/python/pyspark/errors/error_classes.py
@@ -14,13 +14,18 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+
+# NOTE: Automatically sort this file via
+# - cd $SPARK_HOME
+# - bin/pyspark
+# - from pyspark.errors.exceptions import _write_self; _write_self()
 import json
 
 
-ERROR_CLASSES_JSON = """
+ERROR_CLASSES_JSON = '''
 {
-  "APPLICATION_NAME_NOT_SET" : {
-"message" : [
+  "APPLICATION_NAME_NOT_SET": {
+"message": [
   "An application name must be set in your configuration."
 ]
   },
@@ -34,18 +39,18 @@ ERROR_CLASSES_JSON = """
   "Arrow legacy IPC format is not supported in PySpark, please unset 
ARROW_PRE_0_15_IPC_FORMAT."
 ]
   },
-  "ATTRIBUTE_NOT_CALLABLE" : {
-"message" : [
+  "ATTRIBUTE_NOT_CALLABLE": {
+"message": [
   "Attribute `` in provided object `` is not 
callable."
 ]
   },
-  "ATTRIBUTE_NOT_SUPPORTED" : {
-"message" : [
+  "ATTRIBUTE_NOT_SUPPORTED": {
+"message": [
   "Attribute `` is not supported."
 ]
   },
-  "AXIS_LENGTH_MISMATCH" : {
-"message" : [
+  "AXIS_LENGTH_MISMATCH": {
+"message": [
   "Length mismatch: Expected axis has  element, new 
values have  elements."
 ]
   },
@@ -116,12 +121,12 @@ ERROR_CLASSES_JSON = """
   },
   "CANNOT_INFER_ARRAY_TYPE": {
 "message": [
-  "Can not infer Array Type from an list with None as the first element."
+  "Can not infer Array Type from a list with None as the first element."
 ]
   },
   "CANNOT_INFER_EMPTY_SCHEMA": {
 "message": [
-  "Can not infer schema from empty dataset."
+  "Can not infer schema from an empty dataset."
 ]
   },
   "CANNOT_INFER_SCHEMA_FOR_TYPE": {
@@ -151,7 +156,7 @@ ERROR_CLASSES_JSON = """
   },
   "CANNOT_PROVIDE_METADATA": {
 "message": [
-  "metadata can only be provided for a single column."
+  "Metadata can only be provided for a single column."
 ]
   },
   "CANNOT_SET_TOGETHER": {
@@ -174,43 +179,43 @@ ERROR_CLASSES_JSON = """
   "`` does not allow a Column in a list."
 ]
   },
-  "CONNECT_URL_ALREADY_DEFINED" : {
-"message" : [
+  "CONNECT_URL_ALREADY_DEFINED": {
+"message": [
   "Only one Spark Connect client URL can be set; 

(spark) branch master updated: [SPARK-46807][DOCS] Add automation notice to generated SQL error class docs

2024-01-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e7fb0ad68f73 [SPARK-46807][DOCS] Add automation notice to generated 
SQL error class docs
e7fb0ad68f73 is described below

commit e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4
Author: Nicholas Chammas 
AuthorDate: Tue Jan 23 15:41:28 2024 +0900

[SPARK-46807][DOCS] Add automation notice to generated SQL error class docs

### What changes were proposed in this pull request?

Add an HTML comment to all generated SQL error class documents warning the 
reader that they should not edit the file.

### Why are the changes needed?

This should help prevent problems like the one in #44825.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I ran the following command and reviewed the modified documents:

```sh
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly *SparkThrowableSuite 
-- -t \"Error classes match with document\""
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44847 from nchammas/SPARK-46807-sql-error-automation-notice.

Authored-by: Nicholas Chammas 
Signed-off-by: Hyukjin Kwon 
---
 core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala | 7 ++-
 docs/sql-error-conditions-as-of-join-error-class.md| 5 +
 ...error-conditions-cannot-create-data-source-table-error-class.md | 5 +
 docs/sql-error-conditions-cannot-load-state-store-error-class.md   | 5 +
 docs/sql-error-conditions-cannot-update-field-error-class.md   | 5 +
 docs/sql-error-conditions-cannot-write-state-store-error-class.md  | 5 +
 ...-error-conditions-collection-size-limit-exceeded-error-class.md | 7 +++
 ...-conditions-complex-expression-unsupported-input-error-class.md | 5 +
 docs/sql-error-conditions-connect-error-class.md   | 5 +
 ...ror-conditions-create-view-column-arity-mismatch-error-class.md | 5 +
 docs/sql-error-conditions-datatype-mismatch-error-class.md | 5 +
 ...onditions-duplicate-routine-parameter-assignment-error-class.md | 5 +
 docs/sql-error-conditions-expect-table-not-view-error-class.md | 5 +
 docs/sql-error-conditions-expect-view-not-table-error-class.md | 5 +
 docs/sql-error-conditions-failed-jdbc-error-class.md   | 5 +
 ...sql-error-conditions-incompatible-data-for-table-error-class.md | 5 +
 .../sql-error-conditions-incomplete-type-definition-error-class.md | 5 +
 ...r-conditions-inconsistent-behavior-cross-version-error-class.md | 5 +
 ...ql-error-conditions-insert-column-arity-mismatch-error-class.md | 5 +
 ...sql-error-conditions-insufficient-table-property-error-class.md | 5 +
 ...error-conditions-internal-error-metadata-catalog-error-class.md | 5 +
 docs/sql-error-conditions-invalid-boundary-error-class.md  | 5 +
 docs/sql-error-conditions-invalid-cursor-error-class.md| 5 +
 docs/sql-error-conditions-invalid-default-value-error-class.md | 5 +
 docs/sql-error-conditions-invalid-format-error-class.md| 5 +
 docs/sql-error-conditions-invalid-handle-error-class.md| 5 +
 docs/sql-error-conditions-invalid-inline-table-error-class.md  | 5 +
 ...conditions-invalid-inverse-distribution-function-error-class.md | 5 +
 ...ql-error-conditions-invalid-lambda-function-call-error-class.md | 5 +
 ...l-error-conditions-invalid-limit-like-expression-error-class.md | 5 +
 docs/sql-error-conditions-invalid-observed-metrics-error-class.md  | 5 +
 docs/sql-error-conditions-invalid-options-error-class.md   | 5 +
 docs/sql-error-conditions-invalid-parameter-value-error-class.md   | 5 +
 ...sql-error-conditions-invalid-partition-operation-error-class.md | 5 +
 docs/sql-error-conditions-invalid-schema-error-class.md| 5 +
 docs/sql-error-conditions-invalid-sql-syntax-error-class.md| 5 +
 ...sql-error-conditions-invalid-subquery-expression-error-class.md | 5 +
 ...or-conditions-invalid-time-travel-timestamp-expr-error-class.md | 5 +
 .../sql-error-conditions-invalid-write-distribution-error-class.md | 5 +
 ...sql-error-conditions-malformed-record-in-parsing-error-class.md | 5 +
 docs/sql-error-conditions-missing-attributes-error-class.md| 5 +
 docs/sql-error-conditions-not-a-constant-string-error-class.md | 5 +
 docs/sql-error-conditions-not-allowed-in-from-error-class.md   | 5 +
 ...l-error-conditions-not-null-constraint-violation-error-class.md | 5 +
 ...l-error-conditions-not-supported-in-jdbc-catalog-error-class.md | 5 +
 

(spark) branch master updated: [SPARK-46718][BUILD] Upgrade Arrow to 15.0.0

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8a7609f1cb2d [SPARK-46718][BUILD] Upgrade Arrow to 15.0.0
8a7609f1cb2d is described below

commit 8a7609f1cb2dd92ee30ec8172a1c1501d5810dae
Author: yangjie01 
AuthorDate: Mon Jan 22 21:25:21 2024 -0800

[SPARK-46718][BUILD] Upgrade Arrow to 15.0.0

### What changes were proposed in this pull request?
This pr aims to upgrade Arrow from 14.0.2 to 15.0.0, this version fixes the 
compatibility issue with Netty 4.1.104.Final(GH-39265).

Additionally, since the `arrow-vector` module uses `eclipse-collections` to 
replace `netty-common` as a compile-level dependency, Apache Spark has added a 
dependency on `eclipse-collections` after upgrading to use Arrow 15.0.0.

### Why are the changes needed?
The new version brings the following major changes:

Bug Fixes
GH-34610 - [Java] Fix valueCount and field name when loading/transferring 
NullVector
GH-38242 - [Java] Fix incorrect internal struct accounting for 
DenseUnionVector#getBufferSizeFor
GH-38254 - [Java] Add reusable buffer getters to char/binary vectors
GH-38366 - [Java] Fix Murmur hash on buffers less than 4 bytes
GH-38387 - [Java] Fix JDK8 compilation issue with TestAllTypes
GH-38614 - [Java] Add VarBinary and VarCharWriter helper methods to more 
writers
GH-38725 - [Java] decompression in Lz4CompressionCodec.java does not set 
writer index

New Features and Improvements
GH-38511 - [Java] Add getTransferPair(Field, BufferAllocator, CallBack) for 
StructVector and MapVector
GH-14936 - [Java] Remove netty dependency from arrow-vector
GH-38990 - [Java] Upgrade to flatc version 23.5.26
GH-39265 - [Java] Make it run well with the netty newest version 4.1.104

The full release notes as follows:

- https://arrow.apache.org/release/15.0.0.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44797 from LuciferYang/SPARK-46718.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 12 +++-
 pom.xml   |  2 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 6220626069af..4ee0f5a41191 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -16,10 +16,10 @@ antlr4-runtime/4.13.1//antlr4-runtime-4.13.1.jar
 aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar
 arpack/3.0.3//arpack-3.0.3.jar
 arpack_combined_all/0.1//arpack_combined_all-0.1.jar
-arrow-format/14.0.2//arrow-format-14.0.2.jar
-arrow-memory-core/14.0.2//arrow-memory-core-14.0.2.jar
-arrow-memory-netty/14.0.2//arrow-memory-netty-14.0.2.jar
-arrow-vector/14.0.2//arrow-vector-14.0.2.jar
+arrow-format/15.0.0//arrow-format-15.0.0.jar
+arrow-memory-core/15.0.0//arrow-memory-core-15.0.0.jar
+arrow-memory-netty/15.0.0//arrow-memory-netty-15.0.0.jar
+arrow-vector/15.0.0//arrow-vector-15.0.0.jar
 audience-annotations/0.12.0//audience-annotations-0.12.0.jar
 avro-ipc/1.11.3//avro-ipc-1.11.3.jar
 avro-mapred/1.11.3//avro-mapred-1.11.3.jar
@@ -63,7 +63,9 @@ derby/10.16.1.1//derby-10.16.1.1.jar
 derbyshared/10.16.1.1//derbyshared-10.16.1.1.jar
 derbytools/10.16.1.1//derbytools-10.16.1.1.jar
 
dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
-flatbuffers-java/1.12.0//flatbuffers-java-1.12.0.jar
+eclipse-collections-api/11.1.0//eclipse-collections-api-11.1.0.jar
+eclipse-collections/11.1.0//eclipse-collections-11.1.0.jar
+flatbuffers-java/23.5.26//flatbuffers-java-23.5.26.jar
 gcs-connector/hadoop3-2.2.18/shaded/gcs-connector-hadoop3-2.2.18-shaded.jar
 gmetric4j/1.0.10//gmetric4j-1.0.10.jar
 gson/2.2.4//gson-2.2.4.jar
diff --git a/pom.xml b/pom.xml
index e290273543c6..5f33dd7d8ebc 100644
--- a/pom.xml
+++ b/pom.xml
@@ -230,7 +230,7 @@
 If you are changing Arrow version specification, please check
 ./python/pyspark/sql/pandas/utils.py, and ./python/setup.py too.
 -->
-14.0.2
+15.0.0
 2.5.11
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46805][BUILD] Upgrade `scalafmt` to 3.7.17

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 34beb02827ff [SPARK-46805][BUILD] Upgrade `scalafmt` to 3.7.17
34beb02827ff is described below

commit 34beb02827ffe14e3ed0407bed3f434098340ce4
Author: panbingkun 
AuthorDate: Mon Jan 22 21:23:10 2024 -0800

[SPARK-46805][BUILD] Upgrade `scalafmt` to 3.7.17

### What changes were proposed in this pull request?
The pr aims to upgrade `scalafmt` from `3.7.13` to `3.7.17`.

### Why are the changes needed?
- Regular upgrade, the last upgrade occurred 5 months ago.

- The full release notes:
  https://github.com/scalameta/scalafmt/releases/tag/v3.7.17
  https://github.com/scalameta/scalafmt/releases/tag/v3.7.16
  https://github.com/scalameta/scalafmt/releases/tag/v3.7.15
  https://github.com/scalameta/scalafmt/releases/tag/v3.7.14

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44845 from panbingkun/SPARK-46805.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 dev/.scalafmt.conf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/.scalafmt.conf b/dev/.scalafmt.conf
index 721dec289900..b3a43a03651a 100644
--- a/dev/.scalafmt.conf
+++ b/dev/.scalafmt.conf
@@ -32,4 +32,4 @@ fileOverride {
 runner.dialect = scala213
   }
 }
-version = 3.7.13
+version = 3.7.17


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46806][PYTHON] Improve error message for spark.table when argument type is wrong

2024-01-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ae2d43f279d5 [SPARK-46806][PYTHON] Improve error message for 
spark.table when argument type is wrong
ae2d43f279d5 is described below

commit ae2d43f279d5d27b63db3356abaf7d64755f3f5c
Author: Hyukjin Kwon 
AuthorDate: Tue Jan 23 12:35:52 2024 +0900

[SPARK-46806][PYTHON] Improve error message for spark.table when argument 
type is wrong

### What changes were proposed in this pull request?

This PR improves error message for spark.table when argument type is wrong

```python
spark.table(None)
```

**Before:**

```
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/session.py", line 1710, in table
return DataFrame(self._jsparkSession.table(tableName), self)
 
  File "/.../spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", 
line 1322, in __call__
  File "/.../spark/python/pyspark/errors/exceptions/captured.py", line 215, 
in deco
return f(*a, **kw)
   ^^^
  File "/.../spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 
326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o27.table.
: java.lang.NullPointerException: Cannot invoke "String.length()" because 
"s" is null
at org.antlr.v4.runtime.CharStreams.fromString(CharStreams.java:222)
at org.antlr.v4.runtime.CharStreams.fromString(CharStreams.java:212)
at 
org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:58)
at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:55)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseMultipartIdentifier(AbstractSqlParser.scala:54)
at 
org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:681)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:619)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
```

**After:**

```
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/session.py", line 1711, in table
raise PySparkTypeError(
pyspark.errors.exceptions.base.PySparkTypeError: [INVALID_TYPE] Argument 
`tableName` should not be a str.
```

### Why are the changes needed?

For better error messages to the end users.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the user-facing error messages.

### How was this patch tested?

Unittest was added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44846 from HyukjinKwon/SPARK-46806.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/session.py  |  6 ++
 python/pyspark/sql/session.py  |  6 ++
 python/pyspark/sql/tests/test_dataframe.py | 10 ++
 3 files changed, 22 insertions(+)

diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index 5cbcb4ab5c35..1c53e460c196 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -287,6 +287,12 @@ class SparkSession:
 active.__doc__ = PySparkSession.active.__doc__
 
 def table(self, tableName: str) -> DataFrame:
+if not isinstance(tableName, str):
+raise PySparkTypeError(
+error_class="NOT_STR",
+message_parameters={"arg_name": "tableName", "arg_type": 
type(tableName).__name__},
+)
+
 return self.read.table(tableName)
 
 table.__doc__ = PySparkSession.table.__doc__
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index fef834b9f0a0..7d0d9dc113f2 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -1707,6 +1707,12 @@ class SparkSession(SparkConversionMixin):
 |  4|
 +---+
 """
+if not isinstance(tableName, str):
+raise PySparkTypeError(
+  

(spark) branch master updated (00a9b94d1827 -> 31ffb2d99900)

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 00a9b94d1827 [SPARK-46804][DOCS][TESTS] Recover the generated documents
 add 31ffb2d99900 [SPARK-46800][CORE] Support 
`spark.deploy.spreadOutDrivers`

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/deploy/master/Master.scala| 66 ++
 .../org/apache/spark/internal/config/Deploy.scala  |  5 ++
 .../apache/spark/deploy/master/MasterSuite.scala   | 38 +
 docs/spark-standalone.md   | 10 
 4 files changed, 96 insertions(+), 23 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46804][DOCS][TESTS] Recover the generated documents

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 00a9b94d1827 [SPARK-46804][DOCS][TESTS] Recover the generated documents
00a9b94d1827 is described below

commit 00a9b94d18279cc75259c46b67cbb3da0078327b
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 22 17:57:05 2024 -0800

[SPARK-46804][DOCS][TESTS] Recover the generated documents

### What changes were proposed in this pull request?

This PR regenerated the documents with the following.
```
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly *SparkThrowableSuite 
-- -t \"Error classes match with document\""
```

### Why are the changes needed?

The following PR broke CIs by manually fixing the generated docs.

- #44825

Currently, CI is broken like the following.
- https://github.com/apache/spark/actions/runs/7619269448/job/20752056653
- https://github.com/apache/spark/actions/runs/7619199659/job/20751858197

```
[info] - Error classes match with document *** FAILED *** (24 milliseconds)
[info]   "...lstates.html#class-0[A]-feature-not-support..." did not equal 
"...lstates.html#class-0[a]-feature-not-support..." The error class document is 
not up to date. Please regenerate it. (SparkThrowableSuite.scala:322)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually check.
```
$ build/sbt "core/testOnly *.SparkThrowableSuite"
...
[info] SparkThrowableSuite:
[info] - No duplicate error classes (31 milliseconds)
[info] - Error classes are correctly formatted (47 milliseconds)
[info] - SQLSTATE is mandatory (2 milliseconds)
[info] - SQLSTATE invariants (26 milliseconds)
[info] - Message invariants (8 milliseconds)
[info] - Message format invariants (7 milliseconds)
[info] - Error classes match with document (65 milliseconds)
[info] - Round trip (28 milliseconds)
[info] - Error class names should contain only capital letters, numbers and 
underscores (7 milliseconds)
[info] - Check if error class is missing (15 milliseconds)
[info] - Check if message parameters match message format (4 milliseconds)
[info] - Error message is formatted (1 millisecond)
[info] - Error message does not do substitution on values (1 millisecond)
[info] - Try catching legacy SparkError (0 milliseconds)
[info] - Try catching SparkError with error class (1 millisecond)
[info] - Try catching internal SparkError (0 milliseconds)
[info] - Get message in the specified format (6 milliseconds)
[info] - overwrite error classes (61 milliseconds)
[info] - prohibit dots in error class names (23 milliseconds)
[info] Run completed in 1 second, 357 milliseconds.
[info] Total number of tests run: 19
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44843 from dongjoon-hyun/SPARK-46804.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 ...r-conditions-cannot-update-field-error-class.md |   4 +-
 ...ions-insufficient-table-property-error-class.md |   4 +-
 ...-internal-error-metadata-catalog-error-class.md |   4 +-
 ...-error-conditions-invalid-cursor-error-class.md |   4 +-
 ...-error-conditions-invalid-handle-error-class.md |   4 +-
 ...or-conditions-missing-attributes-error-class.md |   6 +-
 ...ns-not-supported-in-jdbc-catalog-error-class.md |   4 +-
 ...-conditions-unsupported-add-file-error-class.md |   4 +-
 ...itions-unsupported-default-value-error-class.md |   4 +-
 ...ditions-unsupported-deserializer-error-class.md |   4 +-
 ...r-conditions-unsupported-feature-error-class.md |   4 +-
 ...conditions-unsupported-save-mode-error-class.md |   4 +-
 ...ted-subquery-expression-category-error-class.md |   4 +-
 docs/sql-error-conditions.md   | 102 ++---
 14 files changed, 92 insertions(+), 64 deletions(-)

diff --git a/docs/sql-error-conditions-cannot-update-field-error-class.md 
b/docs/sql-error-conditions-cannot-update-field-error-class.md
index 3d7152e499c9..42f952a403be 100644
--- a/docs/sql-error-conditions-cannot-update-field-error-class.md
+++ b/docs/sql-error-conditions-cannot-update-field-error-class.md
@@ -19,7 +19,7 @@ license: |
   limitations under the License.
 ---
 
-[SQLSTATE: 
0A000](sql-error-conditions-sqlstates.html#class-0a-feature-not-supported)
+[SQLSTATE: 
0A000](sql-error-conditions-sqlstates.html#class-0A-feature-not-supported)
 
 Cannot update `` field `` type:
 
@@ -44,3 +44,5 @@ Update a struct by updating its fields.
 ## 

(spark) branch master updated: [SPARK-46802][PYTHON][TESTS] Clean up obsolete code in PySpark coverage script

2024-01-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 227611d70d52 [SPARK-46802][PYTHON][TESTS] Clean up obsolete code in 
PySpark coverage script
227611d70d52 is described below

commit 227611d70d5293bbb5d67b62af649e3bf36eaec6
Author: Hyukjin Kwon 
AuthorDate: Tue Jan 23 10:55:01 2024 +0900

[SPARK-46802][PYTHON][TESTS] Clean up obsolete code in PySpark coverage 
script

### What changes were proposed in this pull request?

This PR cleans up the obsolete code in PySpark coverage script

### Why are the changes needed?

We used to use `coverage_daemon.py` for Python workers to track the 
coverage of the Python worker side (e.g., the coverage within Python UDF), 
added in https://github.com/apache/spark/pull/20204. However, seems it does not 
work anymore. In fact, it has been multiple years that it stopped working. The 
approach of replacing the Python worker itself was a bit hacky workaround. We 
should just get rid of them first, and find a proper way.

This should also deflake the scheduled jobs, and speed up the build.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested via:

```python
./run-tests-with-coverage --python-executables=python3 
--testname="pyspark.sql.functions.builtin"
```

```
...
Finished test(python3): pyspark.sql.tests.test_functions (87s)
Tests passed in 87 seconds
Combining collected coverage data under
...
Creating XML report file at python/coverage.xml
Wrote XML report to coverage.xml
Reporting the coverage data at 
/.../spark/python/test_coverage/coverage_data/coverage
NameStmts   Miss Branch BrPart  Cover
-
pyspark/__init__.py48  7 10  376%
pyspark/_globals.py16  3  4  275%
pyspark/accumulators.py   123 38 26  566%
pyspark/broadcast.py  121 79 40  333%
pyspark/conf.py99 33 50  564%
pyspark/context.py451216151 2651%
pyspark/errors/__init__.py  3  0  0  0   100%
pyspark/errors/error_classes.py 3  0  0  0   100%
pyspark/errors/exceptions/__init__.py   0  0  0  0   100%
pyspark/errors/exceptions/base.py  91 15 24  483%
pyspark/errors/exceptions/captured.py 168 81 57 1748%
pyspark/errors/utils.py34  8  6  270%
pyspark/files.py   34 15 12  357%
pyspark/find_spark_home.py 30 24 12  219%
pyspark/java_gateway.py   114 31 30 1269%
pyspark/join.py66 58 58  0 6%
pyspark/profiler.py   244182 92  322%
pyspark/rdd.py   1064741378  927%
pyspark/rddsampler.py  68 50 32  018%
pyspark/resource/__init__.py5  0  0  0   100%
pyspark/resource/information.py11  4  4  073%
pyspark/resource/profile.py   110 82 58  127%
pyspark/resource/requests.py  139 90 70  035%
pyspark/resultiterable.py  14  6  2  156%
pyspark/serializers.py349185 90 1343%
pyspark/shuffle.py397322180  113%
pyspark/sql/__init__.py14  0  0  0   100%
pyspark/sql/catalog.py203127 66  230%
pyspark/sql/column.py 268 78 64 1267%
pyspark/sql/conf.py40 16 10  358%
pyspark/sql/context.py170 95 58  247%
pyspark/sql/dataframe.py  900475459 4045%
pyspark/sql/functions/__init__.py   3  0  0  0   100%
pyspark/sql/functions/builtin.py 1741542   1126 2676%
pyspark/sql/functions/partitioning.py  41 19 18  359%
pyspark/sql/group.py   81 30 32  365%
pyspark/sql/observation.py 54 37 22  126%
pyspark/sql/pandas/__init__.py  

(spark) branch master updated: [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 52b62921cadb [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as 
a test failure in Python testing script
52b62921cadb is described below

commit 52b62921cadb05da5b1183f979edf7d608256f2e
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 22 17:06:59 2024 -0800

[SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in 
Python testing script

### What changes were proposed in this pull request?

This PR proposes to avoid treating the exit code 5 as a test failure in 
Python testing script.

### Why are the changes needed?

```
...

Running PySpark tests

Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.12']
Will test the following Python modules: ['pyspark-core', 
'pyspark-streaming', 'pyspark-errors']
python3.12 python_implementation is CPython
python3.12 version is: Python 3.12.1
Starting test(python3.12): pyspark.streaming.tests.test_context (temp 
output: 
/__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log)
Finished test(python3.12): pyspark.streaming.tests.test_context (12s)
Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp 
output: 
/__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log)
Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s)
Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp 
output: 
/__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log)
test_kinesis_stream 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) 
... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."
test_kinesis_stream_api 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api)
 ... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."

--
Ran 0 tests in 0.000s

NO TESTS RAN (skipped=2)

Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; 
see logs.
Error:  running /__w/spark/spark/python/run-tests 
--modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 
--python-executables=python3.12 ; received return code 255
Error: Process completed with exit code 19.
```

Scheduled job fails because of exit 5, see 
https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No,

Closes #44841 from HyukjinKwon/SPARK-46801.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 python/run-tests.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index 97fbf9be320b..4cd3569efce3 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -147,8 +147,8 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
 # this code is invoked from a thread other than the main thread.
 os._exit(1)
 duration = time.time() - start_time
-# Exit on the first failure.
-if retcode != 0:
+# Exit on the first failure but exclude the code 5 for no test ran, see 
SPARK-46801.
+if retcode != 0 and retcode != 5:
 try:
 with FAILURE_REPORTING_LOCK:
 with open(LOG_FILE, 'ab') as log_file:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 2621882da3ef [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as 
a test failure in Python testing script
2621882da3ef is described below

commit 2621882da3effe2c9e0b3aedbcb26942e165a09f
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 22 17:06:59 2024 -0800

[SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in 
Python testing script

### What changes were proposed in this pull request?

This PR proposes to avoid treating the exit code 5 as a test failure in 
Python testing script.

### Why are the changes needed?

```
...

Running PySpark tests

Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.12']
Will test the following Python modules: ['pyspark-core', 
'pyspark-streaming', 'pyspark-errors']
python3.12 python_implementation is CPython
python3.12 version is: Python 3.12.1
Starting test(python3.12): pyspark.streaming.tests.test_context (temp 
output: 
/__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log)
Finished test(python3.12): pyspark.streaming.tests.test_context (12s)
Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp 
output: 
/__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log)
Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s)
Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp 
output: 
/__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log)
test_kinesis_stream 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) 
... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."
test_kinesis_stream_api 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api)
 ... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."

--
Ran 0 tests in 0.000s

NO TESTS RAN (skipped=2)

Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; 
see logs.
Error:  running /__w/spark/spark/python/run-tests 
--modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 
--python-executables=python3.12 ; received return code 255
Error: Process completed with exit code 19.
```

Scheduled job fails because of exit 5, see 
https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No,

Closes #44841 from HyukjinKwon/SPARK-46801.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 52b62921cadb05da5b1183f979edf7d608256f2e)
Signed-off-by: Dongjoon Hyun 
---
 python/run-tests.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index 19e39c822cbb..b9031765d943 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -147,8 +147,8 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
 # this code is invoked from a thread other than the main thread.
 os._exit(1)
 duration = time.time() - start_time
-# Exit on the first failure.
-if retcode != 0:
+# Exit on the first failure but exclude the code 5 for no test ran, see 
SPARK-46801.
+if retcode != 0 and retcode != 5:
 try:
 with FAILURE_REPORTING_LOCK:
 with open(LOG_FILE, 'ab') as log_file:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a6869b25fb9a [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as 
a test failure in Python testing script
a6869b25fb9a is described below

commit a6869b25fb9a7ac0e7e5015d342435e5c1b5f044
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 22 17:06:59 2024 -0800

[SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in 
Python testing script

### What changes were proposed in this pull request?

This PR proposes to avoid treating the exit code 5 as a test failure in 
Python testing script.

### Why are the changes needed?

```
...

Running PySpark tests

Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.12']
Will test the following Python modules: ['pyspark-core', 
'pyspark-streaming', 'pyspark-errors']
python3.12 python_implementation is CPython
python3.12 version is: Python 3.12.1
Starting test(python3.12): pyspark.streaming.tests.test_context (temp 
output: 
/__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log)
Finished test(python3.12): pyspark.streaming.tests.test_context (12s)
Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp 
output: 
/__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log)
Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s)
Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp 
output: 
/__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log)
test_kinesis_stream 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) 
... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."
test_kinesis_stream_api 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api)
 ... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."

--
Ran 0 tests in 0.000s

NO TESTS RAN (skipped=2)

Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; 
see logs.
Error:  running /__w/spark/spark/python/run-tests 
--modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 
--python-executables=python3.12 ; received return code 255
Error: Process completed with exit code 19.
```

Scheduled job fails because of exit 5, see 
https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No,

Closes #44841 from HyukjinKwon/SPARK-46801.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 52b62921cadb05da5b1183f979edf7d608256f2e)
Signed-off-by: Dongjoon Hyun 
---
 python/run-tests.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index 19e39c822cbb..b9031765d943 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -147,8 +147,8 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
 # this code is invoked from a thread other than the main thread.
 os._exit(1)
 duration = time.time() - start_time
-# Exit on the first failure.
-if retcode != 0:
+# Exit on the first failure but exclude the code 5 for no test ran, see 
SPARK-46801.
+if retcode != 0 and retcode != 5:
 try:
 with FAILURE_REPORTING_LOCK:
 with open(LOG_FILE, 'ab') as log_file:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46799][CORE][TESTS] Improve `MasterSuite` to use nanoTime-based appIDs and workerIDs

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f9feddfbc9de [SPARK-46799][CORE][TESTS] Improve `MasterSuite` to use 
nanoTime-based appIDs and workerIDs
f9feddfbc9de is described below

commit f9feddfbc9de8e87f7a2e9d8abade7e687335b84
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 22 16:34:26 2024 -0800

[SPARK-46799][CORE][TESTS] Improve `MasterSuite` to use nanoTime-based 
appIDs and workerIDs

### What changes were proposed in this pull request?

This PR aims to improve `MasterSuite` to use nanoTime-based appIDs and 
workerIDs.

### Why are the changes needed?

During testing, I hit a case where two workers have the same ID. This PR 
will prevent the duplicated IDs in Apps and Workers in `MasterSuite`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

```
$ build/sbt "core/testOnly *.MasterSuite"
[info] MasterSuite:
[info] - can use a custom recovery mode factory (443 milliseconds)
[info] - SPARK-46664: master should recover quickly in case of zero workers 
and apps (38 milliseconds)
[info] - master correctly recover the application (41 milliseconds)
[info] - SPARK-46205: Recovery with Kryo Serializer (27 milliseconds)
[info] - SPARK-46216: Recovery without compression (19 milliseconds)
[info] - SPARK-46216: Recovery with compression (20 milliseconds)
[info] - SPARK-46258: Recovery with RocksDB (306 milliseconds)
[info] - master/worker web ui available (197 milliseconds)
[info] - master/worker web ui available with reverseProxy (30 seconds, 123 
milliseconds)
[info] - master/worker web ui available behind front-end reverseProxy (30 
seconds, 113 milliseconds)
[info] - basic scheduling - spread out (23 milliseconds)
[info] - basic scheduling - no spread out (14 milliseconds)
[info] - basic scheduling with more memory - spread out (10 milliseconds)
[info] - basic scheduling with more memory - no spread out (10 milliseconds)
[info] - scheduling with max cores - spread out (9 milliseconds)
[info] - scheduling with max cores - no spread out (9 milliseconds)
[info] - scheduling with cores per executor - spread out (9 milliseconds)
[info] - scheduling with cores per executor - no spread out (8 milliseconds)
[info] - scheduling with cores per executor AND max cores - spread out (8 
milliseconds)
[info] - scheduling with cores per executor AND max cores - no spread out 
(7 milliseconds)
[info] - scheduling with executor limit - spread out (8 milliseconds)
[info] - scheduling with executor limit - no spread out (7 milliseconds)
[info] - scheduling with executor limit AND max cores - spread out (8 
milliseconds)
[info] - scheduling with executor limit AND max cores - no spread out (9 
milliseconds)
[info] - scheduling with executor limit AND cores per executor - spread out 
(8 milliseconds)
[info] - scheduling with executor limit AND cores per executor - no spread 
out (13 milliseconds)
[info] - scheduling with executor limit AND cores per executor AND max 
cores - spread out (8 milliseconds)
[info] - scheduling with executor limit AND cores per executor AND max 
cores - no spread out (7 milliseconds)
[info] - scheduling for app with multiple resource profiles (44 
milliseconds)
[info] - scheduling for app with multiple resource profiles with max cores 
(37 milliseconds)
[info] - SPARK-45174: scheduling with max drivers (9 milliseconds)
[info] - SPARK-13604: Master should ask Worker kill unknown executors and 
drivers (15 milliseconds)
[info] - SPARK-20529: Master should reply the address received from worker 
(20 milliseconds)
[info] - SPARK-27510: Master should avoid dead loop while launching 
executor failed in Worker (34 milliseconds)
[info] - All workers on a host should be decommissioned (28 milliseconds)
[info] - No workers should be decommissioned with invalid host (25 
milliseconds)
[info] - Only worker on host should be decommissioned (19 milliseconds)
[info] - SPARK-19900: there should be a corresponding driver for the app 
after relaunching driver (2 seconds, 60 milliseconds)
[info] - assign/recycle resources to/from driver (33 milliseconds)
[info] - assign/recycle resources to/from executor (27 milliseconds)
[info] - resource description with multiple resource profiles (1 
millisecond)
[info] - SPARK-45753: Support driver id pattern (7 milliseconds)
[info] - SPARK-45753: Prevent invalid driver id patterns (6 milliseconds)
[info] - SPARK-45754: Support app id pattern (7 milliseconds)
[info] - SPARK-45754: Prevent invalid app id patterns (7 milliseconds)
[info] - SPARK-45785: Rotate app 

(spark) branch master updated: [SPARK-46797][CORE] Rename `spark.deploy.spreadOut` to `spark.deploy.spreadOutApps`

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8d1212837538 [SPARK-46797][CORE] Rename `spark.deploy.spreadOut` to 
`spark.deploy.spreadOutApps`
8d1212837538 is described below

commit 8d121283753894d4969d8ff9e09bb487f76e82e1
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 22 16:26:43 2024 -0800

[SPARK-46797][CORE] Rename `spark.deploy.spreadOut` to 
`spark.deploy.spreadOutApps`

### What changes were proposed in this pull request?

This PR aims to rename `spark.deploy.spreadOut` to 
`spark.deploy.spreadOutApps`.

### Why are the changes needed?

Although Apache Spark documentation clearly says it's about `applications`, 
this still misleads users to forget `Driver` JVMs which will be spread out 
always independently from this configuration.


https://github.com/apache/spark/blob/b80e8cb4552268b771fc099457b9186807081c4a/docs/spark-standalone.md?plain=1#L282-L285

### Does this PR introduce _any_ user-facing change?

No, the behavior is the same. Only it will show warnings for old config 
name usages.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44838 from dongjoon-hyun/SPARK-46797.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/internal/config/Deploy.scala | 3 ++-
 docs/spark-standalone.md  | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala 
b/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala
index 6585d62b3b9c..31ac07621176 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala
@@ -97,8 +97,9 @@ private[spark] object Deploy {
 .intConf
 .createWithDefault(10)
 
-  val SPREAD_OUT_APPS = ConfigBuilder("spark.deploy.spreadOut")
+  val SPREAD_OUT_APPS = ConfigBuilder("spark.deploy.spreadOutApps")
 .version("0.6.1")
+.withAlternative("spark.deploy.spreadOut")
 .booleanConf
 .createWithDefault(true)
 
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index b9e3bb5d3f7f..6e454dff1bde 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -279,7 +279,7 @@ SPARK_MASTER_OPTS supports the following system properties:
   1.1.0
 
 
-  spark.deploy.spreadOut
+  spark.deploy.spreadOutApps
   true
   
 Whether the standalone cluster manager should spread applications out 
across nodes or try


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46781][PYTHON][TESTS] Test custom data source and input partition (pyspark.sql.datasource)

2024-01-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b1b29d9eeb76 [SPARK-46781][PYTHON][TESTS] Test custom data source and 
input partition (pyspark.sql.datasource)
b1b29d9eeb76 is described below

commit b1b29d9eeb76951a0129529f2075046cde91937a
Author: Xinrong Meng 
AuthorDate: Tue Jan 23 09:14:31 2024 +0900

[SPARK-46781][PYTHON][TESTS] Test custom data source and input partition 
(pyspark.sql.datasource)

### What changes were proposed in this pull request?
Test custom data source and input partition (pyspark.sql.datasource)

### Why are the changes needed?
Subtasks of 
[SPARK-46041](https://issues.apache.org/jira/browse/SPARK-46041) to improve 
test coverage

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test change only.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44808 from xinrong-meng/test_datasource.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_datasources.py | 50 
 1 file changed, 50 insertions(+)

diff --git a/python/pyspark/sql/tests/test_datasources.py 
b/python/pyspark/sql/tests/test_datasources.py
index 8c16904544b2..ece4839d88a8 100644
--- a/python/pyspark/sql/tests/test_datasources.py
+++ b/python/pyspark/sql/tests/test_datasources.py
@@ -21,7 +21,9 @@ import uuid
 import os
 
 from pyspark.sql import Row
+from pyspark.sql.datasource import InputPartition, DataSource
 from pyspark.sql.types import IntegerType, StructField, StructType, LongType, 
StringType
+from pyspark.errors import PySparkNotImplementedError
 from pyspark.testing.sqlutils import ReusedSQLTestCase
 
 
@@ -283,6 +285,54 @@ class DataSourcesTestsMixin:
 url=f"{url};drop=true", dbtable=dbtable
 ).load().collect()
 
+def test_custom_data_source(self):
+class MyCustomDataSource(DataSource):
+pass
+
+custom_data_source = MyCustomDataSource(options={"path": 
"/path/to/custom/data"})
+
+with self.assertRaises(PySparkNotImplementedError) as pe:
+custom_data_source.schema()
+
+self.check_error(
+exception=pe.exception,
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "schema"},
+)
+
+with self.assertRaises(PySparkNotImplementedError) as pe:
+custom_data_source.reader(schema=None)
+
+self.check_error(
+exception=pe.exception,
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "reader"},
+)
+
+with self.assertRaises(PySparkNotImplementedError) as pe:
+custom_data_source.writer(schema=None, overwrite=False)
+
+self.check_error(
+exception=pe.exception,
+error_class="NOT_IMPLEMENTED",
+message_parameters={"feature": "writer"},
+)
+
+def test_input_partition(self):
+partition = InputPartition(1)
+expected_repr = "InputPartition(value=1)"
+actual_repr = repr(partition)
+self.assertEqual(expected_repr, actual_repr)
+
+class RangeInputPartition(InputPartition):
+def __init__(self, start, end):
+super().__init__((start, end))
+
+partition = RangeInputPartition(1, 3)
+expected_repr = "RangeInputPartition(value=(1, 3))"
+actual_repr = repr(partition)
+self.assertEqual(expected_repr, actual_repr)
+
 
 class DataSourcesTests(DataSourcesTestsMixin, ReusedSQLTestCase):
 pass


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [MINOR][DOCS] Miscellaneous link and anchor fixes

2024-01-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0a68e6ef1c54 [MINOR][DOCS] Miscellaneous link and anchor fixes
0a68e6ef1c54 is described below

commit 0a68e6ef1c54f702a352ee6665f9a1f52accc419
Author: Nicholas Chammas 
AuthorDate: Tue Jan 23 09:12:34 2024 +0900

[MINOR][DOCS] Miscellaneous link and anchor fixes

### What changes were proposed in this pull request?

Fix a handful of links and link anchors.

In Safari at least, link anchors are case-sensitive.

### Why are the changes needed?

Minor documentation cleanup.

### Does this PR introduce _any_ user-facing change?

Yes, minor documentation tweaks.

### How was this patch tested?

No testing beyond building the docs successfully.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44824 from nchammas/minor-link-fixes.

Authored-by: Nicholas Chammas 
Signed-off-by: Hyukjin Kwon 
---
 docs/cloud-integration.md| 3 +--
 docs/ml-guide.md | 3 +--
 docs/mllib-evaluation-metrics.md | 2 +-
 docs/rdd-programming-guide.md| 4 ++--
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/docs/cloud-integration.md b/docs/cloud-integration.md
index 52a7552fe8d4..7afbfef0b393 100644
--- a/docs/cloud-integration.md
+++ b/docs/cloud-integration.md
@@ -330,7 +330,7 @@ It is not available on Hadoop 3.3.4 or earlier.
 IBM provide the Stocator output committer for IBM Cloud Object Storage and 
OpenStack Swift.
 
 Source, documentation and releasea can be found at
-[https://github.com/CODAIT/stocator](Stocator - Storage Connector for Apache 
Spark).
+[Stocator - Storage Connector for Apache 
Spark](https://github.com/CODAIT/stocator).
 
 
 ## Cloud Committers and `INSERT OVERWRITE TABLE`
@@ -396,4 +396,3 @@ The Cloud Committer problem and hive-compatible solutions
 * [The Manifest Committer for Azure and Google Cloud 
Storage](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md)
 * [A Zero-rename 
committer](https://github.com/steveloughran/zero-rename-committer/releases/).
 * [Stocator: A High Performance Object Store Connector for 
Spark](http://arxiv.org/abs/1709.01812)
-
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index 572f61ef9735..132805e7bcd6 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -72,7 +72,7 @@ WARNING: Failed to load implementation 
from:dev.ludovic.netlib.blas.JNIBLAS
 To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 
1.4 or newer.
 
 [^1]: To learn more about the benefits and background of system optimised 
natives, you may wish to
-watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in 
Scala](http://fommil.github.io/scalax14/#/).
+watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in 
Scala](http://fommil.github.io/scalax14/).
 
 # Highlights in 3.0
 
@@ -103,4 +103,3 @@ release of Spark:
 # Migration Guide
 
 The migration guide is now archived [on this page](ml-migration-guide.html).
-
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md
index 30acc3dc634b..aa587b26dca6 100644
--- a/docs/mllib-evaluation-metrics.md
+++ b/docs/mllib-evaluation-metrics.md
@@ -460,7 +460,7 @@ $$rel_D(r) = \begin{cases}1 & \text{if $r \in D$}, \\ 0 & 
\text{otherwise}.\end{
 $p(k)=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{k} 
\sum_{j=0}^{\text{min}(Q_i, k) - 1} rel_{D_i}(R_i(j))}$
   
   
-https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Precision_at_K">Precision
 at k is a measure of
+https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Precision_at_k">Precision
 at k is a measure of
  how many of the first k recommended documents are in the set of true 
relevant documents averaged across all
  users. In this metric, the order of the recommendations is not taken 
into account.
   
diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md
index b92b3da09c5c..2e0f9d3bd6ef 100644
--- a/docs/rdd-programming-guide.md
+++ b/docs/rdd-programming-guide.md
@@ -776,7 +776,7 @@ for other languages.
 
 
 
-### Understanding closures 
+### Understanding closures
 One of the harder things about Spark is understanding the scope and life cycle 
of variables and methods when executing code across a cluster. RDD operations 
that modify variables outside of their scope can be a frequent source of 
confusion. In the example below we'll look at code that uses `foreach()` to 
increment a counter, but similar issues can occur for other operations as well.
 
  Example
@@ -1120,7 

(spark) branch master updated: [MINOR][DOCS] Fix SQL Error links and link anchors

2024-01-22 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3f3cbae8680e [MINOR][DOCS] Fix SQL Error links and link anchors
3f3cbae8680e is described below

commit 3f3cbae8680eb447e1a66d2358755aac446d1265
Author: Nicholas Chammas 
AuthorDate: Tue Jan 23 09:12:19 2024 +0900

[MINOR][DOCS] Fix SQL Error links and link anchors

### What changes were proposed in this pull request?

Link anchors are case sensitive (at least on Safari). Many of the links in 
the SQL error pages use the incorrect case, so the anchor doesn't jump to the 
correct heading. This PR fixes those anchors.

There is also a bad link in the menu, which this PR also fixes.

### Why are the changes needed?

Links should point to their intended target.

### Does this PR introduce _any_ user-facing change?

Yes, user-facing documentation.

### How was this patch tested?

I built the SQL docs with:

```sh
SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 bundle exec jekyll build
```

And I clicked around a bunch of the SQL error pages to confirm the link 
anchors now work correctly.

I also confirmed the menu link also works correctly now.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44825 from nchammas/sql-error-links.

Authored-by: Nicholas Chammas 
Signed-off-by: Hyukjin Kwon 
---
 docs/_data/menu-sql.yaml   |   2 +-
 ...r-conditions-cannot-update-field-error-class.md |   4 +-
 ...ions-insufficient-table-property-error-class.md |   4 +-
 ...-internal-error-metadata-catalog-error-class.md |   4 +-
 ...-error-conditions-invalid-cursor-error-class.md |   4 +-
 ...-error-conditions-invalid-handle-error-class.md |   4 +-
 ...or-conditions-missing-attributes-error-class.md |   6 +-
 ...ns-not-supported-in-jdbc-catalog-error-class.md |   4 +-
 ...-conditions-unsupported-add-file-error-class.md |   4 +-
 ...itions-unsupported-default-value-error-class.md |   4 +-
 ...ditions-unsupported-deserializer-error-class.md |   4 +-
 ...r-conditions-unsupported-feature-error-class.md |   4 +-
 ...conditions-unsupported-save-mode-error-class.md |   4 +-
 ...ted-subquery-expression-category-error-class.md |   4 +-
 docs/sql-error-conditions.md   | 102 ++---
 15 files changed, 65 insertions(+), 93 deletions(-)

diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
index 95833849cc59..0accc01d51c8 100644
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@@ -116,7 +116,7 @@
 - text: DATATYPE_MISMATCH error class
   url: sql-error-conditions-datatype-mismatch-error-class.html
 - text: INCOMPATIBLE_DATA_FOR_TABLE error class
-  url: sql-error-conditions-incompatible-data-to-table-error-class.html
+  url: sql-error-conditions-incompatible-data-for-table-error-class.html
 - text: INCOMPLETE_TYPE_DEFINITION error class
   url: sql-error-conditions-incomplete-type-definition-error-class.html
 - text: INCONSISTENT_BEHAVIOR_CROSS_VERSION error class
diff --git a/docs/sql-error-conditions-cannot-update-field-error-class.md 
b/docs/sql-error-conditions-cannot-update-field-error-class.md
index 42f952a403be..3d7152e499c9 100644
--- a/docs/sql-error-conditions-cannot-update-field-error-class.md
+++ b/docs/sql-error-conditions-cannot-update-field-error-class.md
@@ -19,7 +19,7 @@ license: |
   limitations under the License.
 ---
 
-[SQLSTATE: 
0A000](sql-error-conditions-sqlstates.html#class-0A-feature-not-supported)
+[SQLSTATE: 
0A000](sql-error-conditions-sqlstates.html#class-0a-feature-not-supported)
 
 Cannot update `` field `` type:
 
@@ -44,5 +44,3 @@ Update a struct by updating its fields.
 ## USER_DEFINED_TYPE
 
 Update a UserDefinedType[``] by updating its fields.
-
-
diff --git 
a/docs/sql-error-conditions-insufficient-table-property-error-class.md 
b/docs/sql-error-conditions-insufficient-table-property-error-class.md
index 94175ccaf4c8..8f60d0901bff 100644
--- a/docs/sql-error-conditions-insufficient-table-property-error-class.md
+++ b/docs/sql-error-conditions-insufficient-table-property-error-class.md
@@ -19,7 +19,7 @@ license: |
   limitations under the License.
 ---
 
-[SQLSTATE: XXKUC](sql-error-conditions-sqlstates.html#class-XX-internal-error)
+[SQLSTATE: XXKUC](sql-error-conditions-sqlstates.html#class-xx-internal-error)
 
 Can't find table property:
 
@@ -32,5 +32,3 @@ This error class has the following derived error classes:
 ## MISSING_KEY_PART
 
 ``, `` parts are expected.
-
-
diff --git 
a/docs/sql-error-conditions-internal-error-metadata-catalog-error-class.md 
b/docs/sql-error-conditions-internal-error-metadata-catalog-error-class.md
index e45116561281..ba410d65e4a9 

(spark) branch branch-3.4 updated: [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 97536c6673bb [SPARK-46779][SQL] `InMemoryRelation` instances of the 
same cached plan should be semantically equivalent
97536c6673bb is described below

commit 97536c6673bb08ba8741a6a6f697b6880ca629ce
Author: Bruce Robbins 
AuthorDate: Mon Jan 22 11:09:01 2024 -0800

[SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan 
should be semantically equivalent

When canonicalizing `output` in `InMemoryRelation`, use `output` itself as 
the schema for determining the ordinals, rather than `cachedPlan.output`.

`InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't 
necessarily use the same exprIds. E.g.:
```
+- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, 
deserialized, 1 replicas)
  +- LocalTableScan [c1#254, c2#255]

```
Because of this, `InMemoryRelation` will sometimes fail to fully 
canonicalize, resulting in cases where two semantically equivalent 
`InMemoryRelation` instances appear to be semantically nonequivalent.

Example:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) 
from data d2 group by all;
```
If plan change validation checking is on (i.e., 
`spark.sql.planChangeValidation=true`), the failure is:
```
[PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: 
Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, 
scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS 
count(c2)#83L]
...
is not a valid aggregate expression: 
[SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar 
subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an 
aggregate function.
```
If plan change validation checking is off, the failure is more mysterious:
```
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find 
count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
```
If you remove the cache command, the query succeeds.

The above failures happen because the subquery in the aggregate expressions 
and the subquery in the grouping expressions seem semantically nonequivalent 
since the `InMemoryRelation` in one of the subquery plans failed to completely 
canonicalize.

In `CacheManager#useCachedData`, two lookups for the same cached plan may 
create `InMemoryRelation` instances that have different exprIds in `output`. 
That's because the plan fragments used as lookup keys  may have been 
deduplicated by `DeduplicateRelations`, and thus have different exprIds in 
their respective output schemas. When `CacheManager#useCachedData` creates an 
`InMemoryRelation` instance, it borrows the output schema of the plan fragment 
used as the lookup key.

The failure to fully canonicalize has other effects. For example, this 
query fails to reuse the exchange:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(2, 4),
(3, 7),
(7, 22);

cache table data;

set spark.sql.autoBroadcastJoinThreshold=-1;
set spark.sql.adaptive.enabled=false;

select *
from data l
join data r
on l.c1 = r.c1;
```

No.

New tests.

No.

Closes #44806 from bersprockets/plan_validation_issue.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/columnar/InMemoryRelation.scala   |  2 +-
 .../org/apache/spark/sql/DataFrameAggregateSuite.scala| 15 +++
 .../sql/execution/columnar/InMemoryRelationSuite.scala|  7 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
index 4df9915dc96e..119e9e0a188f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
@@ -391,7 +391,7 @@ case class InMemoryRelation(
   }
 
   override def doCanonicalize(): logical.LogicalPlan =
-copy(output = output.map(QueryPlan.normalizeExpressions(_, 
cachedPlan.output)),
+

(spark) branch branch-3.5 updated: [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 68d9f353300e [SPARK-46779][SQL] `InMemoryRelation` instances of the 
same cached plan should be semantically equivalent
68d9f353300e is described below

commit 68d9f353300ed7de0b47c26cb30236bada896d25
Author: Bruce Robbins 
AuthorDate: Mon Jan 22 11:09:01 2024 -0800

[SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan 
should be semantically equivalent

When canonicalizing `output` in `InMemoryRelation`, use `output` itself as 
the schema for determining the ordinals, rather than `cachedPlan.output`.

`InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't 
necessarily use the same exprIds. E.g.:
```
+- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, 
deserialized, 1 replicas)
  +- LocalTableScan [c1#254, c2#255]

```
Because of this, `InMemoryRelation` will sometimes fail to fully 
canonicalize, resulting in cases where two semantically equivalent 
`InMemoryRelation` instances appear to be semantically nonequivalent.

Example:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) 
from data d2 group by all;
```
If plan change validation checking is on (i.e., 
`spark.sql.planChangeValidation=true`), the failure is:
```
[PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: 
Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, 
scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS 
count(c2)#83L]
...
is not a valid aggregate expression: 
[SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar 
subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an 
aggregate function.
```
If plan change validation checking is off, the failure is more mysterious:
```
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find 
count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
```
If you remove the cache command, the query succeeds.

The above failures happen because the subquery in the aggregate expressions 
and the subquery in the grouping expressions seem semantically nonequivalent 
since the `InMemoryRelation` in one of the subquery plans failed to completely 
canonicalize.

In `CacheManager#useCachedData`, two lookups for the same cached plan may 
create `InMemoryRelation` instances that have different exprIds in `output`. 
That's because the plan fragments used as lookup keys  may have been 
deduplicated by `DeduplicateRelations`, and thus have different exprIds in 
their respective output schemas. When `CacheManager#useCachedData` creates an 
`InMemoryRelation` instance, it borrows the output schema of the plan fragment 
used as the lookup key.

The failure to fully canonicalize has other effects. For example, this 
query fails to reuse the exchange:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(2, 4),
(3, 7),
(7, 22);

cache table data;

set spark.sql.autoBroadcastJoinThreshold=-1;
set spark.sql.adaptive.enabled=false;

select *
from data l
join data r
on l.c1 = r.c1;
```

No.

New tests.

No.

Closes #44806 from bersprockets/plan_validation_issue.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/columnar/InMemoryRelation.scala   |  2 +-
 .../org/apache/spark/sql/DataFrameAggregateSuite.scala| 15 +++
 .../sql/execution/columnar/InMemoryRelationSuite.scala|  7 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
index 65f7835b42cf..5bab8e53eb16 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
@@ -405,7 +405,7 @@ case class InMemoryRelation(
   }
 
   override def doCanonicalize(): logical.LogicalPlan =
-copy(output = output.map(QueryPlan.normalizeExpressions(_, 
cachedPlan.output)),
+

(spark) branch master updated (77062021c5d0 -> b80e8cb45522)

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 77062021c5d0 [SPARK-46758][INFRA] Upgrade github cache action to v4
 add b80e8cb45522 [SPARK-46779][SQL] `InMemoryRelation` instances of the 
same cached plan should be semantically equivalent

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/columnar/InMemoryRelation.scala   |  2 +-
 .../org/apache/spark/sql/DataFrameAggregateSuite.scala| 15 +++
 .../sql/execution/columnar/InMemoryRelationSuite.scala|  7 +++
 3 files changed, 23 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46758][INFRA] Upgrade github cache action to v4

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 77062021c5d0 [SPARK-46758][INFRA] Upgrade github cache action to v4
77062021c5d0 is described below

commit 77062021c5d0c7b445730d47d82cc60910468976
Author: panbingkun 
AuthorDate: Mon Jan 22 10:34:15 2024 -0800

[SPARK-46758][INFRA] Upgrade github cache action to v4

### What changes were proposed in this pull request?
The pr aims to upgrade `github cache action` to v4.

### Why are the changes needed?
- V4 release notes: https://github.com/actions/cache/releases/tag/v4.0.0
- Version 4 of this action updated the [runtime to Node.js 
20](https://docs.github.com/en/actions/creating-actions/metadata-syntax-for-github-actions#runs-for-javascript-actions),
 update action from `node16` to `node20`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44782 from panbingkun/SPARK-46758.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/benchmark.yml| 12 ++--
 .github/workflows/build_and_test.yml   | 36 +-
 .github/workflows/maven_test.yml   |  4 ++--
 .github/workflows/publish_snapshot.yml |  2 +-
 4 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
index 257c80022110..2d7defa6db2d 100644
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@@ -70,7 +70,7 @@ jobs:
 with:
   fetch-depth: 0
   - name: Cache Scala, SBT and Maven
-uses: actions/cache@v3
+uses: actions/cache@v4
 with:
   path: |
 build/apache-maven-*
@@ -81,7 +81,7 @@ jobs:
   restore-keys: |
 build-
   - name: Cache Coursier local repository
-uses: actions/cache@v3
+uses: actions/cache@v4
 with:
   path: ~/.cache/coursier
   key: benchmark-coursier-${{ github.event.inputs.jdk }}-${{ 
hashFiles('**/pom.xml', '**/plugins.sbt') }}
@@ -89,7 +89,7 @@ jobs:
 benchmark-coursier-${{ github.event.inputs.jdk }}
   - name: Cache TPC-DS generated data
 id: cache-tpcds-sf-1
-uses: actions/cache@v3
+uses: actions/cache@v4
 with:
   path: ./tpcds-sf-1
   key: tpcds-${{ hashFiles('.github/workflows/benchmark.yml', 
'sql/core/src/test/scala/org/apache/spark/sql/TPCDSSchema.scala') }}
@@ -139,7 +139,7 @@ jobs:
   with:
 fetch-depth: 0
 - name: Cache Scala, SBT and Maven
-  uses: actions/cache@v3
+  uses: actions/cache@v4
   with:
 path: |
   build/apache-maven-*
@@ -150,7 +150,7 @@ jobs:
 restore-keys: |
   build-
 - name: Cache Coursier local repository
-  uses: actions/cache@v3
+  uses: actions/cache@v4
   with:
 path: ~/.cache/coursier
 key: benchmark-coursier-${{ github.event.inputs.jdk }}-${{ 
hashFiles('**/pom.xml', '**/plugins.sbt') }}
@@ -164,7 +164,7 @@ jobs:
 - name: Cache TPC-DS generated data
   if: contains(github.event.inputs.class, 'TPCDSQueryBenchmark') || 
contains(github.event.inputs.class, '*')
   id: cache-tpcds-sf-1
-  uses: actions/cache@v3
+  uses: actions/cache@v4
   with:
 path: ./tpcds-sf-1
 key: tpcds-${{ hashFiles('.github/workflows/benchmark.yml', 
'sql/core/src/test/scala/org/apache/spark/sql/TPCDSSchema.scala') }}
diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 69636629ca9d..4038f63fb0dc 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -214,7 +214,7 @@ jobs:
 git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit" --allow-empty
 # Cache local repositories. Note that GitHub Actions cache has a 2G limit.
 - name: Cache Scala, SBT and Maven
-  uses: actions/cache@v3
+  uses: actions/cache@v4
   with:
 path: |
   build/apache-maven-*
@@ -225,7 +225,7 @@ jobs:
 restore-keys: |
   build-
 - name: Cache Coursier local repository
-  uses: actions/cache@v3
+  uses: actions/cache@v4
   with:
 path: ~/.cache/coursier
 key: ${{ matrix.java }}-${{ matrix.hadoop }}-coursier-${{ 
hashFiles('**/pom.xml', '**/plugins.sbt') }}
@@ -397,7 +397,7 @@ jobs:
 git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit" --allow-empty
 # Cache local repositories. Note that GitHub Actions cache has a 2G 

(spark) branch master updated: [SPARK-46791][SQL] Support Java Set in JavaTypeInference

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 667c0a9dbbe0 [SPARK-46791][SQL] Support Java Set in JavaTypeInference
667c0a9dbbe0 is described below

commit 667c0a9dbbe045c73842a345c1b3897b155564d4
Author: Liang-Chi Hsieh 
AuthorDate: Mon Jan 22 02:13:12 2024 -0800

[SPARK-46791][SQL] Support Java Set in JavaTypeInference

### What changes were proposed in this pull request?

This patch adds the support of Java `Set` as bean field in 
`JavaTypeInference`.

### Why are the changes needed?

Scala `Set` (`scala.collection.Set`) is supported in `ScalaReflection` so 
users can encode Scala `Set` in Dataset. But Java `Set` is not supported in 
bean encoder (i.e., `JavaTypeInference`). This feature inconsistency makes Java 
users cannot use `Set` like Scala users do.

### Does this PR introduce _any_ user-facing change?

Yes. Java `Set` is supported to be part of Java bean when encoding with 
bean encoder.

### How was this patch tested?

Added tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44828 from viirya/java_set.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/JavaTypeInference.scala |  6 ++-
 .../sql/catalyst/expressions/objects/objects.scala | 50 ++
 .../sql/catalyst/JavaTypeInferenceSuite.scala  | 26 +--
 .../expressions/ObjectExpressionsSuite.scala   |  5 ++-
 .../org/apache/spark/sql/JavaDatasetSuite.java | 45 +++
 .../scala/org/apache/spark/sql/DatasetSuite.scala  |  9 
 6 files changed, 136 insertions(+), 5 deletions(-)

diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
index a945cb720b01..f85e96da2be1 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
@@ -18,7 +18,7 @@ package org.apache.spark.sql.catalyst
 
 import java.beans.{Introspector, PropertyDescriptor}
 import java.lang.reflect.{ParameterizedType, Type, TypeVariable}
-import java.util.{List => JList, Map => JMap}
+import java.util.{List => JList, Map => JMap, Set => JSet}
 import javax.annotation.Nonnull
 
 import scala.jdk.CollectionConverters._
@@ -112,6 +112,10 @@ object JavaTypeInference {
   val element = encoderFor(c.getTypeParameters.array(0), seenTypeSet, 
typeVariables)
   IterableEncoder(ClassTag(c), element, element.nullable, 
lenientSerialization = false)
 
+case c: Class[_] if classOf[JSet[_]].isAssignableFrom(c) =>
+  val element = encoderFor(c.getTypeParameters.array(0), seenTypeSet, 
typeVariables)
+  IterableEncoder(ClassTag(c), element, element.nullable, 
lenientSerialization = false)
+
 case c: Class[_] if classOf[JMap[_, _]].isAssignableFrom(c) =>
   val keyEncoder = encoderFor(c.getTypeParameters.array(0), seenTypeSet, 
typeVariables)
   val valueEncoder = encoderFor(c.getTypeParameters.array(1), seenTypeSet, 
typeVariables)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
index bae2922cf921..a684ca18435e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
@@ -907,6 +907,8 @@ case class MapObjects private(
   _.asInstanceOf[Array[_]].toImmutableArraySeq
 case ObjectType(cls) if classOf[java.util.List[_]].isAssignableFrom(cls) =>
   _.asInstanceOf[java.util.List[_]].asScala.toSeq
+case ObjectType(cls) if classOf[java.util.Set[_]].isAssignableFrom(cls) =>
+  _.asInstanceOf[java.util.Set[_]].asScala.toSeq
 case ObjectType(cls) if cls == classOf[Object] =>
   (inputCollection) => {
 if (inputCollection.getClass.isArray) {
@@ -982,6 +984,34 @@ case class MapObjects private(
   builder
 }
   }
+case Some(cls) if classOf[java.util.Set[_]].isAssignableFrom(cls) =>
+  // Java set
+  if (cls == classOf[java.util.Set[_]] || cls == 
classOf[java.util.AbstractSet[_]]) {
+// Specifying non concrete implementations of `java.util.Set`
+executeFuncOnCollection(_).toSet.asJava
+  } else {
+val constructors = cls.getConstructors()
+val intParamConstructor = constructors.find { constructor =>
+  constructor.getParameterCount == 1 && 
constructor.getParameterTypes()(0) == classOf[Int]
+}
+val 

(spark) branch master updated: [SPARK-46777][SS] Refactor `StreamingDataSourceV2Relation` catalyst structure to be more on-par with the batch version

2024-01-22 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 02533d71806e [SPARK-46777][SS] Refactor 
`StreamingDataSourceV2Relation` catalyst structure to be more on-par with the 
batch version
02533d71806e is described below

commit 02533d71806ec0be97ec793d680189093c9a0ecb
Author: jackierwzhang 
AuthorDate: Mon Jan 22 18:58:55 2024 +0900

[SPARK-46777][SS] Refactor `StreamingDataSourceV2Relation` catalyst 
structure to be more on-par with the batch version

### What changes were proposed in this pull request?
This PR refactors `StreamingDataSourceV2Relation` into 
`StreamingDataSourceV2Relation` and `StreamingDataSourceV2ScanRelation` to 
achieve better parity with the batch version. This prepares the codebase to be 
able to extend certain V2 optimization rules (e.g. `V2ScanRelationPushDown`) to 
be applied to streaming in the future.

### Why are the changes needed?
As described above, we would like to start reuse certain V2 batch 
optimization rules to apply to streaming relations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
This is a pure refactoring, existing tests should be sufficient.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44818 from jackierwzhang/spark-46777.

Authored-by: jackierwzhang 
Signed-off-by: Jungtaek Lim 
---
 .../sql/kafka010/KafkaMicroBatchSourceSuite.scala  |  7 +-
 .../catalyst/streaming/StreamingRelationV2.scala   |  4 +-
 .../datasources/v2/DataSourceV2Relation.scala  | 83 --
 .../datasources/v2/DataSourceV2Strategy.scala  |  4 +-
 .../execution/streaming/MicroBatchExecution.scala  | 12 ++--
 .../sql/execution/streaming/ProgressReporter.scala |  4 +-
 .../streaming/continuous/ContinuousExecution.scala | 14 ++--
 .../sources/RateStreamProviderSuite.scala  |  4 +-
 .../streaming/sources/TextSocketStreamSuite.scala  |  4 +-
 .../apache/spark/sql/streaming/StreamSuite.scala   |  8 +--
 .../apache/spark/sql/streaming/StreamTest.scala|  4 +-
 .../sql/streaming/StreamingQueryManagerSuite.scala |  4 +-
 .../streaming/test/DataStreamTableAPISuite.scala   |  2 +-
 13 files changed, 99 insertions(+), 55 deletions(-)

diff --git 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala
 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala
index cee0d9a3dd72..fb5e71a1e7b8 100644
--- 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala
+++ 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala
@@ -40,7 +40,7 @@ import org.apache.spark.TestUtils
 import org.apache.spark.sql.{Dataset, ForeachWriter, Row, SparkSession}
 import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
 import org.apache.spark.sql.connector.read.streaming.SparkDataStream
-import 
org.apache.spark.sql.execution.datasources.v2.StreamingDataSourceV2Relation
+import 
org.apache.spark.sql.execution.datasources.v2.StreamingDataSourceV2ScanRelation
 import org.apache.spark.sql.execution.exchange.ReusedExchangeExec
 import org.apache.spark.sql.execution.streaming._
 import 
org.apache.spark.sql.execution.streaming.AsyncProgressTrackingMicroBatchExecution.{ASYNC_PROGRESS_TRACKING_CHECKPOINTING_INTERVAL_MS,
 ASYNC_PROGRESS_TRACKING_ENABLED}
@@ -125,7 +125,8 @@ abstract class KafkaSourceTest extends StreamTest with 
SharedSparkSession with K
   val sources: Seq[SparkDataStream] = {
 query.get.logicalPlan.collect {
   case StreamingExecutionRelation(source: KafkaSource, _, _) => source
-  case r: StreamingDataSourceV2Relation if 
r.stream.isInstanceOf[KafkaMicroBatchStream] ||
+  case r: StreamingDataSourceV2ScanRelation
+if r.stream.isInstanceOf[KafkaMicroBatchStream] ||
   r.stream.isInstanceOf[KafkaContinuousStream] =>
 r.stream
 }
@@ -1654,7 +1655,7 @@ class KafkaMicroBatchV2SourceSuite extends 
KafkaMicroBatchSourceSuiteBase {
   makeSureGetOffsetCalled,
   AssertOnQuery { query =>
 query.logicalPlan.exists {
-  case r: StreamingDataSourceV2Relation => 
r.stream.isInstanceOf[KafkaMicroBatchStream]
+  case r: StreamingDataSourceV2ScanRelation => 
r.stream.isInstanceOf[KafkaMicroBatchStream]
   case _ => false
 }
   }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/streaming/StreamingRelationV2.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/streaming/StreamingRelationV2.scala
index ab0352b606e5..c1d7daa6cfcf 100644
--- 

(spark) branch branch-3.5 updated: [SPARK-46789][K8S][TESTS] Add `VolumeSuite` to K8s IT

2024-01-22 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 04d32493fde7 [SPARK-46789][K8S][TESTS] Add `VolumeSuite` to K8s IT
04d32493fde7 is described below

commit 04d32493fde779021871c88709dbbae32f18e512
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 22 16:19:39 2024 +0800

[SPARK-46789][K8S][TESTS] Add `VolumeSuite` to K8s IT

### What changes were proposed in this pull request?

This PR aims to add `VolumeSuite` to K8s IT.

### Why are the changes needed?

To improve the test coverage on various K8s volume use cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44827 from dongjoon-hyun/SPARK-46789.

Authored-by: Dongjoon Hyun 
Signed-off-by: Kent Yao 
---
 .../k8s/integrationtest/KubernetesSuite.scala  |   4 +-
 .../deploy/k8s/integrationtest/VolumeSuite.scala   | 173 +
 2 files changed, 175 insertions(+), 2 deletions(-)

diff --git 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
index f52af87a745c..54ef1f6cee30 100644
--- 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
+++ 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
@@ -45,8 +45,8 @@ import org.apache.spark.internal.config._
 class KubernetesSuite extends SparkFunSuite
   with BeforeAndAfterAll with BeforeAndAfter with BasicTestsSuite with 
SparkConfPropagateSuite
   with SecretsTestsSuite with PythonTestsSuite with ClientModeTestsSuite with 
PodTemplateSuite
-  with PVTestsSuite with DepsTestsSuite with DecommissionSuite with 
RTestsSuite with Logging
-  with Eventually with Matchers {
+  with VolumeSuite with PVTestsSuite with DepsTestsSuite with 
DecommissionSuite with RTestsSuite
+  with Logging with Eventually with Matchers {
 
 
   import KubernetesSuite._
diff --git 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/VolumeSuite.scala
 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/VolumeSuite.scala
new file mode 100644
index ..c57e4b4578d6
--- /dev/null
+++ 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/VolumeSuite.scala
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s.integrationtest
+
+import scala.jdk.CollectionConverters._
+
+import io.fabric8.kubernetes.api.model._
+import org.scalatest.concurrent.PatienceConfiguration
+import org.scalatest.time.{Seconds, Span}
+
+import org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite._
+import 
org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend
+
+private[spark] trait VolumeSuite { k8sSuite: KubernetesSuite =>
+  val IGNORE = Some((Some(PatienceConfiguration.Interval(Span(0, Seconds))), 
None))
+
+  private def checkDisk(pod: Pod, path: String, expected: String) = {
+eventually(PatienceConfiguration.Timeout(Span(10, Seconds)), INTERVAL) {
+  implicit val podName: String = pod.getMetadata.getName
+  implicit val components: KubernetesTestComponents = 
kubernetesTestComponents
+  assert(Utils.executeCommand("df", path).contains(expected))
+}
+  }
+
+  test("A driver-only Spark job with a tmpfs-backed localDir volume", 
k8sTestTag) {
+sparkAppConf
+  .set("spark.kubernetes.driver.master", "local[10]")
+  .set("spark.kubernetes.local.dirs.tmpfs", "true")
+

(spark) branch master updated: [SPARK-46789][K8S][TESTS] Add `VolumeSuite` to K8s IT

2024-01-22 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b07bdea3616f [SPARK-46789][K8S][TESTS] Add `VolumeSuite` to K8s IT
b07bdea3616f is described below

commit b07bdea3616fc582a1242d3b47b465cd406c13c4
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 22 16:19:39 2024 +0800

[SPARK-46789][K8S][TESTS] Add `VolumeSuite` to K8s IT

### What changes were proposed in this pull request?

This PR aims to add `VolumeSuite` to K8s IT.

### Why are the changes needed?

To improve the test coverage on various K8s volume use cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44827 from dongjoon-hyun/SPARK-46789.

Authored-by: Dongjoon Hyun 
Signed-off-by: Kent Yao 
---
 .../k8s/integrationtest/KubernetesSuite.scala  |   4 +-
 .../deploy/k8s/integrationtest/VolumeSuite.scala   | 173 +
 2 files changed, 175 insertions(+), 2 deletions(-)

diff --git 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
index 2039b59b0ab5..868461fd5b9e 100644
--- 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
+++ 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
@@ -45,8 +45,8 @@ import org.apache.spark.internal.config._
 class KubernetesSuite extends SparkFunSuite
   with BeforeAndAfterAll with BeforeAndAfter with BasicTestsSuite with 
SparkConfPropagateSuite
   with SecretsTestsSuite with PythonTestsSuite with ClientModeTestsSuite with 
PodTemplateSuite
-  with PVTestsSuite with DepsTestsSuite with DecommissionSuite with 
RTestsSuite with Logging
-  with Eventually with Matchers {
+  with VolumeSuite with PVTestsSuite with DepsTestsSuite with 
DecommissionSuite with RTestsSuite
+  with Logging with Eventually with Matchers {
 
 
   import KubernetesSuite._
diff --git 
a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/VolumeSuite.scala
 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/VolumeSuite.scala
new file mode 100644
index ..c57e4b4578d6
--- /dev/null
+++ 
b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/VolumeSuite.scala
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s.integrationtest
+
+import scala.jdk.CollectionConverters._
+
+import io.fabric8.kubernetes.api.model._
+import org.scalatest.concurrent.PatienceConfiguration
+import org.scalatest.time.{Seconds, Span}
+
+import org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite._
+import 
org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend
+
+private[spark] trait VolumeSuite { k8sSuite: KubernetesSuite =>
+  val IGNORE = Some((Some(PatienceConfiguration.Interval(Span(0, Seconds))), 
None))
+
+  private def checkDisk(pod: Pod, path: String, expected: String) = {
+eventually(PatienceConfiguration.Timeout(Span(10, Seconds)), INTERVAL) {
+  implicit val podName: String = pod.getMetadata.getName
+  implicit val components: KubernetesTestComponents = 
kubernetesTestComponents
+  assert(Utils.executeCommand("df", path).contains(expected))
+}
+  }
+
+  test("A driver-only Spark job with a tmpfs-backed localDir volume", 
k8sTestTag) {
+sparkAppConf
+  .set("spark.kubernetes.driver.master", "local[10]")
+  .set("spark.kubernetes.local.dirs.tmpfs", "true")
+runSparkApplicationAndVerifyCompletion(
+  

(spark) branch master updated: [SPARK-46673][PYTHON][DOCS] Refine docstring `aes_encrypt/aes_decrypt/try_aes_decrypt`

2024-01-22 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9556da8834b0 [SPARK-46673][PYTHON][DOCS] Refine docstring 
`aes_encrypt/aes_decrypt/try_aes_decrypt`
9556da8834b0 is described below

commit 9556da8834b0b6ef6d4237a46a62cadd839c88e7
Author: panbingkun 
AuthorDate: Mon Jan 22 11:18:40 2024 +0300

[SPARK-46673][PYTHON][DOCS] Refine docstring 
`aes_encrypt/aes_decrypt/try_aes_decrypt`

### What changes were proposed in this pull request?
The pr aims to refine docstring of 
`aes_encrypt/aes_decrypt/try_aes_decrypt`.

### Why are the changes needed?
To improve PySpark documentation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44750 from panbingkun/SPARK-46673.

Authored-by: panbingkun 
Signed-off-by: Max Gekk 
---
 python/pyspark/sql/functions/builtin.py | 246 ++--
 1 file changed, 201 insertions(+), 45 deletions(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index ca2efde0b3c2..d3a94fe4b9e9 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -18836,6 +18836,8 @@ def nvl2(col1: "ColumnOrName", col2: "ColumnOrName", 
col3: "ColumnOrName") -> Co
 return _invoke_function_over_columns("nvl2", col1, col2, col3)
 
 
+# TODO(SPARK-46738) Re-enable testing that includes the 'Cast' operation after
+#  fixing the display difference between Regular Spark and Spark Connect on 
`Cast`.
 @_try_remote_functions
 def aes_encrypt(
 input: "ColumnOrName",
@@ -18877,50 +18879,96 @@ def aes_encrypt(
 Optional additional authenticated data. Only supported for GCM mode. 
This can be any
 free-form input and must be provided for both encryption and 
decryption.
 
+Returns
+---
+:class:`~pyspark.sql.Column`
+A new column that contains an encrypted value.
+
 Examples
 
+
+Example 1: Encrypt data with key, mode, padding, iv and aad.
+
+>>> import pyspark.sql.functions as sf
 >>> df = spark.createDataFrame([(
 ... "Spark", "abcdefghijklmnop12345678ABCDEFGH", "GCM", "DEFAULT",
 ... "", "This is an AAD mixed into the 
input",)],
 ... ["input", "key", "mode", "padding", "iv", "aad"]
 ... )
->>> df.select(base64(aes_encrypt(
-... df.input, df.key, df.mode, df.padding, to_binary(df.iv, 
lit("hex")), df.aad)
-... ).alias('r')).collect()
-[Row(r='QiYi+sTLm7KD9UcZ2nlRdYDe/PX4')]
+>>> df.select(sf.base64(sf.aes_encrypt(
+... df.input, df.key, df.mode, df.padding, sf.to_binary(df.iv, 
sf.lit("hex")), df.aad)
+... )).show(truncate=False)
++---+
+|base64(aes_encrypt(input, key, mode, padding, to_binary(iv, hex), aad))|
++---+
+|QiYi+sTLm7KD9UcZ2nlRdYDe/PX4   |
++---+
 
->>> df.select(base64(aes_encrypt(
-... df.input, df.key, df.mode, df.padding, to_binary(df.iv, 
lit("hex")))
-... ).alias('r')).collect()
-[Row(r='QiYi+sRNYDAOTjdSEcYBFsAWPL1f')]
+Example 2: Encrypt data with key, mode, padding and iv.
 
+>>> import pyspark.sql.functions as sf
+>>> df = spark.createDataFrame([(
+... "Spark", "abcdefghijklmnop12345678ABCDEFGH", "GCM", "DEFAULT",
+... "", "This is an AAD mixed into the 
input",)],
+... ["input", "key", "mode", "padding", "iv", "aad"]
+... )
+>>> df.select(sf.base64(sf.aes_encrypt(
+... df.input, df.key, df.mode, df.padding, sf.to_binary(df.iv, 
sf.lit("hex")))
+... )).show(truncate=False)
+++
+|base64(aes_encrypt(input, key, mode, padding, to_binary(iv, hex), ))|
+++
+|QiYi+sRNYDAOTjdSEcYBFsAWPL1f|
+++
+
+Example 3: Encrypt data with key, mode and padding.
+
+>>> import pyspark.sql.functions as sf
 >>> df = spark.createDataFrame([(
 ... "Spark SQL", "1234567890abcdef", "ECB", "PKCS",)],
 ... ["input", "key", "mode", "padding"]
 ... )
->>> df.select(aes_decrypt(aes_encrypt(df.input, df.key, df.mode, 
df.padding),
-...