(spark) branch master updated: [SPARK-46300][PYTHON][CONNECT] Match minor behaviour matching in Column with full test coverage

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 19fa431ef611 [SPARK-46300][PYTHON][CONNECT] Match minor behaviour 
matching in Column with full test coverage
19fa431ef611 is described below

commit 19fa431ef61181bd9bfe96a74f6d977b720d281e
Author: Hyukjin Kwon 
AuthorDate: Thu Dec 7 15:50:11 2023 +0900

[SPARK-46300][PYTHON][CONNECT] Match minor behaviour matching in Column 
with full test coverage

### What changes were proposed in this pull request?

This PR matches the corner case behaviours in `Column` between Spark 
Connect and non-Spark Connect with adding unittests with the full test coverage 
within `pyspark.sql.column`.

### Why are the changes needed?

- For feature parity.
- To improve the test coverage.
See 
https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/column.py

This is not being tested.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually ran the new unittest.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44228 from HyukjinKwon/SPARK-46300.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/column.py   | 16 +--
 python/pyspark/sql/connect/column.py   |  2 +-
 python/pyspark/sql/connect/expressions.py  |  5 
 .../sql/tests/connect/test_connect_column.py   |  2 +-
 python/pyspark/sql/tests/test_column.py| 32 +-
 python/pyspark/sql/tests/test_functions.py | 14 +-
 python/pyspark/sql/tests/test_types.py | 12 
 7 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index 9357b4842bbd..198dd9ff3e40 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -75,7 +75,7 @@ def _to_java_expr(col: "ColumnOrName") -> JavaObject:
 
 @overload
 def _to_seq(sc: SparkContext, cols: Iterable[JavaObject]) -> JavaObject:
-pass
+...
 
 
 @overload
@@ -84,7 +84,7 @@ def _to_seq(
 cols: Iterable["ColumnOrName"],
 converter: Optional[Callable[["ColumnOrName"], JavaObject]],
 ) -> JavaObject:
-pass
+...
 
 
 def _to_seq(
@@ -924,10 +924,20 @@ class Column:
 
 Examples
 
+
+Example 1. Using integers for the input arguments.
+
 >>> df = spark.createDataFrame(
 ...  [(2, "Alice"), (5, "Bob")], ["age", "name"])
 >>> df.select(df.name.substr(1, 3).alias("col")).collect()
 [Row(col='Ali'), Row(col='Bob')]
+
+Example 2. Using columns for the input arguments.
+
+>>> df = spark.createDataFrame(
+...  [(3, 4, "Alice"), (2, 3, "Bob")], ["sidx", "eidx", "name"])
+>>> df.select(df.name.substr(df.sidx, df.eidx).alias("col")).collect()
+[Row(col='ice'), Row(col='ob')]
 """
 if type(startPos) != type(length):
 raise PySparkTypeError(
@@ -1199,7 +1209,7 @@ class Column:
 else:
 return Column(getattr(self._jc, "as")(alias[0]))
 else:
-if metadata:
+if metadata is not None:
 raise PySparkValueError(
 error_class="ONLY_ALLOWED_FOR_SINGLE_COLUMN",
 message_parameters={"arg_name": "metadata"},
diff --git a/python/pyspark/sql/connect/column.py 
b/python/pyspark/sql/connect/column.py
index a6d9ca8a2ff4..13b00fd83d8b 100644
--- a/python/pyspark/sql/connect/column.py
+++ b/python/pyspark/sql/connect/column.py
@@ -256,7 +256,7 @@ class Column:
 else:
 raise PySparkTypeError(
 error_class="NOT_COLUMN_OR_INT",
-message_parameters={"arg_name": "length", "arg_type": 
type(length).__name__},
+message_parameters={"arg_name": "startPos", "arg_type": 
type(length).__name__},
 )
 return Column(UnresolvedFunction("substr", [self._expr, start_expr, 
length_expr]))
 
diff --git a/python/pyspark/sql/connect/expressions.py 
b/python/pyspark/sql/connect/expressions.py
index 88c4f4d267b3..384422eed7d1 100644
--- a/python/pyspark/sql/connect/expressions.py
+++ b/python/pyspark/sql/connect/expressions.py
@@ -97,6 +97,11 @@ class Expression:
 
 def alias(self, *alias: str, **kwargs: Any) -> "ColumnAlias":
 metadata = kwargs.pop("metadata", None)
+if len(alias) > 1 and metadata is not None:
+raise PySparkValueError(
+error_class="ONLY_ALLOWED_FOR_SINGLE_COLUMN",
+message_parameters={"arg_name": "metadata"},
+)
   

(spark) branch master updated: [SPARK-45515][CORE][SQL][FOLLOWUP] Use enhanced switch expressions to replace the regular switch statement

2023-12-06 Thread beliefer
This is an automated email from the ASF dual-hosted git repository.

beliefer pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b3c56cffdb6 [SPARK-45515][CORE][SQL][FOLLOWUP] Use enhanced switch 
expressions to replace the regular switch statement
b3c56cffdb6 is described below

commit b3c56cffdb6f731a1c8677a6bc896be0144ac0fc
Author: Jiaan Geng 
AuthorDate: Thu Dec 7 13:50:44 2023 +0800

[SPARK-45515][CORE][SQL][FOLLOWUP] Use enhanced switch expressions to 
replace the regular switch statement

### What changes were proposed in this pull request?
This PR follows up https://github.com/apache/spark/pull/43349.

This pr also does not include parts of the hive and hive-thriftserver 
module.

### Why are the changes needed?
Please see https://github.com/apache/spark/pull/43349.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44183 from beliefer/SPARK-45515_followup.

Authored-by: Jiaan Geng 
Signed-off-by: Jiaan Geng 
---
 .../org/apache/spark/network/util/DBProvider.java  |  16 +--
 .../apache/spark/network/RpcIntegrationSuite.java  |  10 +-
 .../java/org/apache/spark/network/StreamSuite.java |  19 ++-
 .../shuffle/checksum/ShuffleChecksumHelper.java|  15 +-
 .../apache/spark/unsafe/UnsafeAlignedOffset.java   |  10 +-
 .../apache/spark/launcher/SparkLauncherSuite.java  |  13 +-
 .../apache/spark/launcher/CommandBuilderUtils.java |  98 +++--
 .../spark/launcher/SparkClassCommandBuilder.java   |  45 +++---
 .../spark/launcher/SparkSubmitCommandBuilder.java  |  55 +++-
 .../spark/launcher/InProcessLauncherSuite.java |  43 +++---
 .../sql/connector/util/V2ExpressionSQLBuilder.java | 154 +
 .../parquet/ParquetVectorUpdaterFactory.java   |  39 +++---
 .../parquet/VectorizedColumnReader.java|  50 +++
 .../parquet/VectorizedRleValuesReader.java |  89 +---
 14 files changed, 251 insertions(+), 405 deletions(-)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/util/DBProvider.java
 
b/common/network-common/src/main/java/org/apache/spark/network/util/DBProvider.java
index 1adb9cfe5d3..5a25bdda233 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/util/DBProvider.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/util/DBProvider.java
@@ -38,17 +38,17 @@ public class DBProvider {
 StoreVersion version,
 ObjectMapper mapper) throws IOException {
   if (dbFile != null) {
-switch (dbBackend) {
-  case LEVELDB:
+return switch (dbBackend) {
+  case LEVELDB -> {
 org.iq80.leveldb.DB levelDB = LevelDBProvider.initLevelDB(dbFile, 
version, mapper);
 logger.warn("The LEVELDB is deprecated. Please use ROCKSDB 
instead.");
-return levelDB != null ? new LevelDB(levelDB) : null;
-  case ROCKSDB:
+yield levelDB != null ? new LevelDB(levelDB) : null;
+  }
+  case ROCKSDB -> {
 org.rocksdb.RocksDB rocksDB = RocksDBProvider.initRockDB(dbFile, 
version, mapper);
-return rocksDB != null ? new RocksDB(rocksDB) : null;
-  default:
-throw new IllegalArgumentException("Unsupported DBBackend: " + 
dbBackend);
-}
+yield rocksDB != null ? new RocksDB(rocksDB) : null;
+  }
+};
   }
   return null;
 }
diff --git 
a/common/network-common/src/test/java/org/apache/spark/network/RpcIntegrationSuite.java
 
b/common/network-common/src/test/java/org/apache/spark/network/RpcIntegrationSuite.java
index 55a0cc73f8b..40495d6912c 100644
--- 
a/common/network-common/src/test/java/org/apache/spark/network/RpcIntegrationSuite.java
+++ 
b/common/network-common/src/test/java/org/apache/spark/network/RpcIntegrationSuite.java
@@ -67,14 +67,10 @@ public class RpcIntegrationSuite {
 String msg = JavaUtils.bytesToString(message);
 String[] parts = msg.split("/");
 switch (parts[0]) {
-  case "hello":
+  case "hello" ->
 callback.onSuccess(JavaUtils.stringToBytes("Hello, " + parts[1] + 
"!"));
-break;
-  case "return error":
-callback.onFailure(new RuntimeException("Returned: " + parts[1]));
-break;
-  case "throw error":
-throw new RuntimeException("Thrown: " + parts[1]);
+  case "return error" -> callback.onFailure(new 
RuntimeException("Returned: " + parts[1]));
+  case "throw error" -> throw new RuntimeException("Thrown: " + 
parts[1]);
 }
   }
 
diff --git 

(spark) branch master updated: [SPARK-46298][PYTHON][CONNECT] Match deprecation warning, test case, and error of Catalog.createExternalTable

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2ddb6be4724 [SPARK-46298][PYTHON][CONNECT] Match deprecation warning, 
test case, and error of Catalog.createExternalTable
2ddb6be4724 is described below

commit 2ddb6be472431feceecd3daece8bafc8c80d7eb1
Author: Hyukjin Kwon 
AuthorDate: Thu Dec 7 14:00:41 2023 +0900

[SPARK-46298][PYTHON][CONNECT] Match deprecation warning, test case, and 
error of Catalog.createExternalTable

### What changes were proposed in this pull request?

This PR adds tests for catalog error cases for `createExternalTable`.

Also, this PR includes several minor cleanups:
- Show a deprecation for `spark.catalog.createExternalTable` (to match with 
the non-Spark Connect)
- Remove `_reset` at `Catalog` which is not used anywhere.
- Switch the implementation of Spark Connect 
`spark.catalog.createExternalTable` to directly call 
`spark.catalog.createTable`, and remove the corresponding Python protobuf 
definition.
  - this PR does not remove the protobuf message definition itself for 
potential compatibility concern.

### Why are the changes needed?

- For feature parity.
- To improve the test coverage.
See 
https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/catalog.py

This is not being tested.

### Does this PR introduce _any_ user-facing change?

Virtually no (except the ones descried above)

### How was this patch tested?

Manually ran the new unittest.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44226 from HyukjinKwon/SPARK-46298.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/catalog.py|  8 
 python/pyspark/sql/connect/catalog.py| 22 -
 python/pyspark/sql/connect/plan.py   | 34 
 python/pyspark/sql/tests/test_catalog.py |  9 -
 4 files changed, 21 insertions(+), 52 deletions(-)

diff --git a/python/pyspark/sql/catalog.py b/python/pyspark/sql/catalog.py
index b5337734b3b..6595659a4da 100644
--- a/python/pyspark/sql/catalog.py
+++ b/python/pyspark/sql/catalog.py
@@ -1237,14 +1237,6 @@ class Catalog:
 """
 self._jcatalog.refreshByPath(path)
 
-def _reset(self) -> None:
-"""(Internal use only) Drop all existing databases (except "default"), 
tables,
-partitions and functions, and set the current database to "default".
-
-This is mainly used for tests.
-"""
-self._jsparkSession.sessionState().catalog().reset()
-
 
 def _test() -> None:
 import os
diff --git a/python/pyspark/sql/connect/catalog.py 
b/python/pyspark/sql/connect/catalog.py
index 9143a03d324..ef1bff9d28c 100644
--- a/python/pyspark/sql/connect/catalog.py
+++ b/python/pyspark/sql/connect/catalog.py
@@ -14,6 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+from pyspark.errors import PySparkTypeError
 from pyspark.sql.connect.utils import check_dependencies
 
 check_dependencies(__name__)
@@ -215,16 +216,11 @@ class Catalog:
 schema: Optional[StructType] = None,
 **options: str,
 ) -> DataFrame:
-catalog = plan.CreateExternalTable(
-table_name=tableName,
-path=path,  # type: ignore[arg-type]
-source=source,
-schema=schema,
-options=options,
+warnings.warn(
+"createExternalTable is deprecated since Spark 4.0, please use 
createTable instead.",
+FutureWarning,
 )
-df = DataFrame(catalog, session=self._sparkSession)
-df._to_table()  # Eager execution.
-return df
+return self.createTable(tableName, path, source, schema, **options)
 
 createExternalTable.__doc__ = PySparkCatalog.createExternalTable.__doc__
 
@@ -237,6 +233,14 @@ class Catalog:
 description: Optional[str] = None,
 **options: str,
 ) -> DataFrame:
+if schema is not None and not isinstance(schema, StructType):
+raise PySparkTypeError(
+error_class="NOT_STRUCT",
+message_parameters={
+"arg_name": "schema",
+"arg_type": type(schema).__name__,
+},
+)
 catalog = plan.CreateTable(
 table_name=tableName,
 path=path,  # type: ignore[arg-type]
diff --git a/python/pyspark/sql/connect/plan.py 
b/python/pyspark/sql/connect/plan.py
index 67a33c2b6cf..cdc06b0f31c 100644
--- a/python/pyspark/sql/connect/plan.py
+++ b/python/pyspark/sql/connect/plan.py
@@ 

(spark) branch master updated: [SPARK-46296][PYTHON][TESTS] Test missing test coverage for captured errors (pyspark.errors.exceptions)

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6337412973a [SPARK-46296][PYTHON][TESTS] Test missing test coverage 
for captured errors (pyspark.errors.exceptions)
6337412973a is described below

commit 6337412973af9431502d411679525c518214457a
Author: Hyukjin Kwon 
AuthorDate: Thu Dec 7 13:32:31 2023 +0900

[SPARK-46296][PYTHON][TESTS] Test missing test coverage for captured errors 
(pyspark.errors.exceptions)

### What changes were proposed in this pull request?

This PR adds tests for negative cases of `getErrorClass` and `getSqlState`. 
And test case for `getMessageParameters` for errors.

### Why are the changes needed?

To improve the test coverage.

See 
https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/errors/exceptions/captured.py
 This is not being tested.

### Does this PR introduce _any_ user-facing change?

No, test-only

### How was this patch tested?

Manually ran the new unittest.

### Was this patch authored or co-authored using generative AI tooling?

Np.

Closes #44224 from HyukjinKwon/SPARK-46296.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/errors/exceptions/captured.py | 2 +-
 python/pyspark/sql/tests/test_utils.py   | 8 
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/errors/exceptions/captured.py 
b/python/pyspark/errors/exceptions/captured.py
index df23be8a979..ec987e0854e 100644
--- a/python/pyspark/errors/exceptions/captured.py
+++ b/python/pyspark/errors/exceptions/captured.py
@@ -104,7 +104,7 @@ class CapturedException(PySparkException):
 if self._origin is not None and is_instance_of(
 gw, self._origin, "org.apache.spark.SparkThrowable"
 ):
-return self._origin.getMessageParameters()
+return dict(self._origin.getMessageParameters())
 else:
 return None
 
diff --git a/python/pyspark/sql/tests/test_utils.py 
b/python/pyspark/sql/tests/test_utils.py
index b99ca41e84f..f633837002e 100644
--- a/python/pyspark/sql/tests/test_utils.py
+++ b/python/pyspark/sql/tests/test_utils.py
@@ -1749,6 +1749,14 @@ class UtilsTests(ReusedSQLTestCase, UtilsTestsMixin):
 except AnalysisException as e:
 self.assertEqual(e.getErrorClass(), 
"UNRESOLVED_COLUMN.WITHOUT_SUGGESTION")
 self.assertEqual(e.getSqlState(), "42703")
+self.assertEqual(e.getMessageParameters(), {"objectName": "`a`"})
+
+try:
+self.spark.sql("""SELECT assert_true(FALSE)""")
+except AnalysisException as e:
+self.assertIsNone(e.getErrorClass())
+self.assertIsNone(e.getSqlState())
+self.assertEqual(e.getMessageParameters(), {})
 
 
 if __name__ == "__main__":


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.3 updated: [SPARK-45580][SQL][3.3] Handle case where a nested subquery becomes an existence join

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 6a4488f2f48 [SPARK-45580][SQL][3.3] Handle case where a nested 
subquery becomes an existence join
6a4488f2f48 is described below

commit 6a4488f2f4861df41025480cceda643e9e74484e
Author: Bruce Robbins 
AuthorDate: Wed Dec 6 19:24:13 2023 -0800

[SPARK-45580][SQL][3.3] Handle case where a nested subquery becomes an 
existence join

### What changes were proposed in this pull request?

This is a back-port of https://github.com/apache/spark/pull/44193.

In `RewritePredicateSubquery`, prune existence flags from the final join 
when `rewriteExistentialExpr` returns an existence join. This change prunes the 
flags (attributes with the name "exists") by adding a `Project` node.

For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether 
`rewriteExistentialExpr` returns an existence join or not. In the case when 
`rewriteExistentialExpr` does not return an existence join, 
`RemoveNoopOperators` will remove the unneeded `Project` node.

### Why are the changes needed?

This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
```
(Note: the above query will not have the extraneous boolean column when run 
from the Dataset API. That is because the Dataset API truncates the rows based 
on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:
```
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
```

### Does this PR introduce _any_ user-facing change?

No, except for the removal of the extraneous boolean flag and the fix to 
the error condition.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44223 from bersprockets/schema_change_br33.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/optimizer/subquery.scala|  9 +++--
 .../scala/org/apache/spark/sql/SubquerySuite.scala | 46 ++
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
index 7ef5ef55fab..ff198c798b9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
@@ -113,16 +113,19 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] 
with PredicateHelper {
   withSubquery.foldLeft(newFilter) {
 case (p, Exists(sub, _, _, conditions)) =>
   val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p)
-  buildJoin(outerPlan, sub, LeftSemi, joinCond)
+  val join = buildJoin(outerPlan, sub, LeftSemi, joinCond)
+  Project(p.output, join)
 case (p, Not(Exists(sub, _, _, conditions))) =>
   val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p)
-  buildJoin(outerPlan, sub, LeftAnti, joinCond)
+  val join = buildJoin(outerPlan, sub, LeftAnti, joinCond)
+  Project(p.output, join)
 case (p, InSubquery(values, ListQuery(sub, _, _, _, conditions))) =>
   // Deduplicate conflicting attributes if any.
   val newSub = dedupSubqueryOnSelfJoin(p, sub, Some(values))
   val inConditions = values.zip(newSub.output).map(EqualTo.tupled)
   val (joinCond, 

(spark) branch branch-3.4 updated: [SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 8e40ec6fa52 [SPARK-45580][SQL][3.4] Handle case where a nested 
subquery becomes an existence join
8e40ec6fa52 is described below

commit 8e40ec6fa525420c1da5ce3b8846ef9f540b9d49
Author: Bruce Robbins 
AuthorDate: Wed Dec 6 19:23:19 2023 -0800

[SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an 
existence join

### What changes were proposed in this pull request?

This is a back-port of https://github.com/apache/spark/pull/44193.

In `RewritePredicateSubquery`, prune existence flags from the final join 
when `rewriteExistentialExpr` returns an existence join. This change prunes the 
flags (attributes with the name "exists") by adding a `Project` node.

For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether 
`rewriteExistentialExpr` returns an existence join or not. In the case when 
`rewriteExistentialExpr` does not return an existence join, 
`RemoveNoopOperators` will remove the unneeded `Project` node.

### Why are the changes needed?

This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
```
(Note: the above query will not have the extraneous boolean column when run 
from the Dataset API. That is because the Dataset API truncates the rows based 
on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:
```
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
```

### Does this PR introduce _any_ user-facing change?

No, except for the removal of the extraneous boolean flag and the fix to 
the error condition.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44219 from bersprockets/schema_change_br34.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/optimizer/subquery.scala|  9 +++--
 .../scala/org/apache/spark/sql/SubquerySuite.scala | 46 ++
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
index 1d2f5602630..861f2f2fabf 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
@@ -118,16 +118,19 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] 
with PredicateHelper {
   withSubquery.foldLeft(newFilter) {
 case (p, Exists(sub, _, _, conditions, subHint)) =>
   val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p)
-  buildJoin(outerPlan, sub, LeftSemi, joinCond, subHint)
+  val join = buildJoin(outerPlan, sub, LeftSemi, joinCond, subHint)
+  Project(p.output, join)
 case (p, Not(Exists(sub, _, _, conditions, subHint))) =>
   val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p)
-  buildJoin(outerPlan, sub, LeftAnti, joinCond, subHint)
+  val join = buildJoin(outerPlan, sub, LeftAnti, joinCond, subHint)
+  Project(p.output, join)
 case (p, InSubquery(values, ListQuery(sub, _, _, _, conditions, 
subHint))) =>
   // Deduplicate conflicting attributes if any.
   val newSub = dedupSubqueryOnSelfJoin(p, sub, Some(values))
   val inConditions = 

(spark) branch master updated: [SPARK-46299][DOCS] Make `spark.deploy.recovery*` docs up-to-date

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 43ca0b929ab [SPARK-46299][DOCS] Make `spark.deploy.recovery*` docs 
up-to-date
43ca0b929ab is described below

commit 43ca0b929ab3c2f10d1879e5df622195564f8885
Author: Dongjoon Hyun 
AuthorDate: Wed Dec 6 19:19:41 2023 -0800

[SPARK-46299][DOCS] Make `spark.deploy.recovery*` docs up-to-date

### What changes were proposed in this pull request?

This PR aims to update `Spark Standalone` cluster recovery configurations.

### Why are the changes needed?

We need to document
- #44173
- #44129
- #44113

![Screenshot 2023-12-06 at 7 15 24 
PM](https://github.com/apache/spark/assets/9700541/04f0be6f-cdfb-4d87-b1b5-c4bf131f460a)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44227 from dongjoon-hyun/SPARK-46299.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 docs/spark-standalone.md | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index 7a89c8124bd..25d2fba47ce 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -735,18 +735,38 @@ In order to enable this recovery mode, you can set 
SPARK_DAEMON_JAVA_OPTS in spa
   
 spark.deploy.recoveryMode
 NONE
-The recovery mode setting to recover submitted Spark jobs with cluster 
mode when it failed and relaunches.
-  Set to FILESYSTEM to enable single-node recovery mode, ZOOKEEPER to use 
Zookeeper-based recovery mode, and
+The recovery mode setting to recover submitted Spark jobs with cluster 
mode when it failed and relaunches. Set to
+  FILESYSTEM to enable file-system-based single-node recovery mode,
+  ROCKSDB to enable RocksDB-based single-node recovery mode,
+  ZOOKEEPER to use Zookeeper-based recovery mode, and
   CUSTOM to provide a customer provider class via additional 
`spark.deploy.recoveryMode.factory` configuration.
+  NONE is the default value which disables this recovery mode.
 
 0.8.1
   
   
 spark.deploy.recoveryDirectory
 ""
-The directory in which Spark will store recovery state, accessible 
from the Master's perspective.
+The directory in which Spark will store recovery state, accessible 
from the Master's perspective.
+  Note that the directory should be clearly manualy if 
spark.deploy.recoveryMode,
+  spark.deploy.recoverySerializer, or 
spark.deploy.recoveryCompressionCodec is changed.
+
 0.8.1
   
+  
+spark.deploy.recoverySerializer
+JAVA
+A serializer for writing/reading objects to/from persistence engines; 
JAVA (default) or KRYO.
+  Java serializer has been the default mode since Spark 0.8.1.
+  Kryo serializer is a new fast and compact mode from Spark 4.0.0.
+4.0.0
+  
+  
+spark.deploy.recoveryCompressionCodec
+(none)
+A compression codec for persistence engines. none (default), lz4, lzf, 
snappy, and zstd. Currently, only FILESYSTEM mode supports this 
configuration.
+4.0.0
+  
   
 spark.deploy.recoveryMode.factory
 ""


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46297][PYTHON][INFRA] Exclude generated files from the code coverage report

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d02674713ac [SPARK-46297][PYTHON][INFRA] Exclude generated files from 
the code coverage report
d02674713ac is described below

commit d02674713ac2183a622f280b7f979678ab35a920
Author: Hyukjin Kwon 
AuthorDate: Thu Dec 7 12:05:51 2023 +0900

[SPARK-46297][PYTHON][INFRA] Exclude generated files from the code coverage 
report

### What changes were proposed in this pull request?

This PR proposes to exclude generated files from the code coverage report, 
`pyspark/sql/connect/proto/*`.

### Why are the changes needed?

For correct test coverage report, and make it easier to read.


https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/tree/python/pyspark/sql/connect/proto

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44225 from HyukjinKwon/SPARK-46297.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/run-tests-with-coverage | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/python/run-tests-with-coverage b/python/run-tests-with-coverage
index 0ca054dcdf1..33996409618 100755
--- a/python/run-tests-with-coverage
+++ b/python/run-tests-with-coverage
@@ -60,10 +60,10 @@ find $COVERAGE_DIR/coverage_data -size 0 -print0 | xargs -0 
rm -fr
 echo "Combining collected coverage data under $COVERAGE_DIR/coverage_data"
 $COV_EXEC combine
 echo "Creating XML report file at python/coverage.xml"
-$COV_EXEC xml --ignore-errors --include "pyspark/*" --omit 
"pyspark/cloudpickle/*"
+$COV_EXEC xml --ignore-errors --include "pyspark/*" --omit 
"pyspark/cloudpickle/*" --omit "pyspark/sql/connect/proto/*"
 echo "Reporting the coverage data at $COVERAGE_DIR/coverage_data/coverage"
-$COV_EXEC report --include "pyspark/*" --omit "pyspark/cloudpickle/*"
+$COV_EXEC report --include "pyspark/*" --omit "pyspark/cloudpickle/*" --omit 
"pyspark/sql/connect/proto/*"
 echo "Generating HTML files for PySpark coverage under $COVERAGE_DIR/htmlcov"
-$COV_EXEC html --ignore-errors --include "pyspark/*" --directory 
"$COVERAGE_DIR/htmlcov" --omit "pyspark/cloudpickle/*"
+$COV_EXEC html --ignore-errors --include "pyspark/*" --directory 
"$COVERAGE_DIR/htmlcov" --omit "pyspark/cloudpickle/*" --omit 
"pyspark/sql/connect/proto/*"
 
 popd


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46058][CORE] Add separate flag for privateKeyPassword

2023-12-06 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9c9020ebafd [SPARK-46058][CORE] Add separate flag for 
privateKeyPassword
9c9020ebafd is described below

commit 9c9020ebafd88684d7f10a2b871f9bc14ebba8b4
Author: Hasnain Lakhani 
AuthorDate: Wed Dec 6 19:58:51 2023 -0600

[SPARK-46058][CORE] Add separate flag for privateKeyPassword

### What changes were proposed in this pull request?

This PR adds a separate way of configuring the private key for RPC SSL 
support when using openssl.

### Why are the changes needed?

Right now with config inheritance we support:

* JKS with password A, PEM with password B
* JKS with no password, PEM with password A
* JKS and PEM with no password

But we do not support the case where JKS has a password and PEM does not. 
If we set `keyPassword` we will attempt to use it, and cannot set 
`spark.ssl.rpc.keyPassword` to null to override the password. So let's make it 
a separate flag as the easiest workaround.

This was noticed while migrating some existing deployments to the RPC SSL 
support where we use openssl support for RPC and use a key with no password

### Does this PR introduce _any_ user-facing change?

Yes, this affects how the (currently unreleased) RPC SSL feature is 
configured going forward

### How was this patch tested?

Updated test configs to match the issue I saw, which would fail 
`SSLFactory.init()` saying key was invalid. Tests now pass.

```
build/sbt
> project network-common
> testOnly
> project network-shuffle
> testOnly
> project core
> test *Ssl*
```

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43998 from hasnain-db/new-flag.

Authored-by: Hasnain Lakhani 
Signed-off-by: Mridul Muralidharan gmail.com>
---
 .../org/apache/spark/network/TransportContext.java |  1 +
 .../org/apache/spark/network/ssl/SSLFactory.java   | 18 +++---
 .../apache/spark/network/util/TransportConf.java   | 11 +++--
 .../apache/spark/network/ssl/SslSampleConfigs.java | 22 +++--
 .../src/test/resources/unencrypted-certchain.pem   | 21 
 .../src/test/resources/unencrypted-key.pem | 28 ++
 .../src/test/resources/unencrypted-certchain.pem   | 21 
 .../src/test/resources/unencrypted-key.pem | 28 ++
 .../main/scala/org/apache/spark/SSLOptions.scala   | 23 ++
 .../scala/org/apache/spark/SecurityManager.scala   |  2 ++
 core/src/test/resources/unencrypted-certchain.pem  | 21 
 core/src/test/resources/unencrypted-key.pem| 28 ++
 .../scala/org/apache/spark/SSLOptionsSuite.scala   |  2 ++
 docs/security.md   |  8 +++
 .../src/test/resources/unencrypted-certchain.pem   | 21 
 .../yarn/src/test/resources/unencrypted-key.pem| 28 ++
 16 files changed, 266 insertions(+), 17 deletions(-)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
 
b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
index 90ca4f4c46a..9f3b9c59256 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
@@ -262,6 +262,7 @@ public class TransportContext implements Closeable {
   .requestedCiphers(conf.sslRpcRequestedCiphers())
   .keyStore(conf.sslRpcKeyStore(), conf.sslRpcKeyStorePassword())
   .privateKey(conf.sslRpcPrivateKey())
+  .privateKeyPassword(conf.sslRpcPrivateKeyPassword())
   .keyPassword(conf.sslRpcKeyPassword())
   .certChain(conf.sslRpcCertChain())
   .trustStore(
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/ssl/SSLFactory.java
 
b/common/network-common/src/main/java/org/apache/spark/network/ssl/SSLFactory.java
index 19c19ec2820..0ae83eb5fd6 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/ssl/SSLFactory.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/ssl/SSLFactory.java
@@ -106,7 +106,7 @@ public class SSLFactory {
   .build();
 
 nettyServerSslContext = SslContextBuilder
-  .forServer(b.certChain, b.privateKey, b.keyPassword)
+  .forServer(b.certChain, b.privateKey, b.privateKeyPassword)
   .sslProvider(getSslProvider(b))
   .build();
   }
@@ -160,6 +160,7 @@ public class SSLFactory {
 private File keyStore;
 private String keyStorePassword;
 private File privateKey;
+

(spark) branch master updated: [SPARK-46292][CORE][UI] Show a summary of workers in MasterPage

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e2441c41de4 [SPARK-46292][CORE][UI] Show a summary of workers in 
MasterPage
e2441c41de4 is described below

commit e2441c41de476b09542db60836d7d853d47f6158
Author: Dongjoon Hyun 
AuthorDate: Wed Dec 6 17:49:37 2023 -0800

[SPARK-46292][CORE][UI] Show a summary of workers in MasterPage

### What changes were proposed in this pull request?

This PR aims to show a summary of workers in MasterPage.

### Why are the changes needed?

Although `Alive Workers` is a useful information, it's insufficient to 
analyze the whole cluster status because we don't know how many workers are in 
other status. Especially, this is useful during the recovery process of Spark 
Master HA setting.

In short, this helps the users identify the issues intuitively.

```
- Alive Workers: 1
+ Workers: 1 Alive, 1 Dead, 0 Decommissioned, 0 Unknown
```

Here is a screenshot.

![Screenshot 2023-12-06 at 3 13 43 
PM](https://github.com/apache/spark/assets/9700541/f078b6ae-ab22-4721-8c67-661121bb9807)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44218 from dongjoon-hyun/SPARK-46292.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala   | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
index b2f35984d37..f25e3495d79 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
@@ -144,7 +144,11 @@ private[ui] class MasterPage(parent: MasterWebUI) extends 
WebUIPage("") {
   
 }.getOrElse { Seq.empty }
   }
-  Alive Workers: {aliveWorkers.length}
+  Workers: {aliveWorkers.length} Alive,
+{workers.count(_.state == WorkerState.DEAD)} Dead,
+{workers.count(_.state == WorkerState.DECOMMISSIONED)} 
Decommissioned,
+{workers.count(_.state == WorkerState.UNKNOWN)} Unknown
+  
   Cores in use: 
{aliveWorkers.map(_.cores).sum} Total,
 {aliveWorkers.map(_.coresUsed).sum} Used
   Memory in use:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46290][PYTHON] Change saveMode to a boolean flag for DataSourceWriter

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9da1e4ca7bb [SPARK-46290][PYTHON] Change saveMode to a boolean flag 
for DataSourceWriter
9da1e4ca7bb is described below

commit 9da1e4ca7bb89a8b5730d9e496c378c8357e003a
Author: allisonwang-db 
AuthorDate: Thu Dec 7 09:42:04 2023 +0900

[SPARK-46290][PYTHON] Change saveMode to a boolean flag for DataSourceWriter

### What changes were proposed in this pull request?

This PR updates the `writer` method in the Python data source API from
```
def writer(self, schema: StructType, saveMode: str)
```
to
```
def writer(self, schema: StructType, overwrite: bool)
```
The motivation here is that `saveMode` offers four modes: append, 
overwrite, error, and ignore, but practically speaking, only append and 
overwrite are meaningful. Also, DSv2 only supports the append and overwrite 
mode. Python data sources should be consistent.

### Why are the changes needed?

To make the API simpler.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44216 from allisonwang-db/spark-46290-overwrite.

Authored-by: allisonwang-db 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/datasource.py | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/datasource.py b/python/pyspark/sql/datasource.py
index 4713ca5366a..e20d44039a6 100644
--- a/python/pyspark/sql/datasource.py
+++ b/python/pyspark/sql/datasource.py
@@ -130,7 +130,7 @@ class DataSource(ABC):
 message_parameters={"feature": "reader"},
 )
 
-def writer(self, schema: StructType, saveMode: str) -> "DataSourceWriter":
+def writer(self, schema: StructType, overwrite: bool) -> 
"DataSourceWriter":
 """
 Returns a ``DataSourceWriter`` instance for writing data.
 
@@ -140,9 +140,8 @@ class DataSource(ABC):
 --
 schema : StructType
 The schema of the data to be written.
-saveMode : str
-A string identifies the save mode. It can be one of the following:
-`append`, `overwrite`, `error`, `ignore`.
+overwrite : bool
+A flag indicating whether to overwrite existing data when writing 
to the data source.
 
 Returns
 ---


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46231][PYTHON][FOLLOWUP] Cleanup duplicated test

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e013c4e859e [SPARK-46231][PYTHON][FOLLOWUP] Cleanup duplicated test
e013c4e859e is described below

commit e013c4e859e5df4fc23592bb77007c5b41c3c72d
Author: Haejoon Lee 
AuthorDate: Thu Dec 7 08:58:09 2023 +0900

[SPARK-46231][PYTHON][FOLLOWUP] Cleanup duplicated test

### What changes were proposed in this pull request?

This PR followups https://github.com/apache/spark/pull/44148 to remove 
duplicated test.

### Why are the changes needed?

Cleanup duplicated test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44217 from itholic/SPARK-46231-followup.

Authored-by: Haejoon Lee 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py 
b/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py
index 455bb09a7dc..b500be7a969 100644
--- a/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py
+++ b/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py
@@ -720,9 +720,6 @@ class GroupedAggPandasUDFTestsMixin:
 
 
 class GroupedAggPandasUDFTests(GroupedAggPandasUDFTestsMixin, 
ReusedSQLTestCase):
-def test_unsupported_types(self):
-super().test_unsupported_types()
-
 pass
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [MINOR][PYTHON][DOCS] Rename "Koalas" > "Pandas-on-Spark" for testing util doc

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9815bb19ef1 [MINOR][PYTHON][DOCS] Rename "Koalas" > "Pandas-on-Spark" 
for testing util doc
9815bb19ef1 is described below

commit 9815bb19ef109e47471a36bbb98b7cf440622087
Author: Haejoon Lee 
AuthorDate: Thu Dec 7 08:57:29 2023 +0900

[MINOR][PYTHON][DOCS] Rename "Koalas" > "Pandas-on-Spark" for testing util 
doc

### What changes were proposed in this pull request?

This PR proposes to rename Koalas to Pandas-on-Spark for testing util 
documentation.

### Why are the changes needed?

To get rid of legacy documentation.

### Does this PR introduce _any_ user-facing change?

No API changes.

### How was this patch tested?

The existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44220 from itholic/minor_doc_koalas_removal.

Authored-by: Haejoon Lee 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/testing/pandasutils.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/testing/pandasutils.py 
b/python/pyspark/testing/pandasutils.py
index 9abffefdbe7..6c6d10a04db 100644
--- a/python/pyspark/testing/pandasutils.py
+++ b/python/pyspark/testing/pandasutils.py
@@ -480,8 +480,8 @@ class PandasOnSparkTestUtils:
 check_row_order: bool = True,
 ):
 """
-Asserts if two arbitrary objects are equal or not. If given objects 
are Koalas DataFrame
-or Series, they are converted into pandas' and compared.
+Asserts if two arbitrary objects are equal or not. If given objects are
+Pandas-on-Spark DataFrame or Series, they are converted into pandas' 
and compared.
 
 :param left: object to compare
 :param right: object to compare


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-45580][SQL][3.5] Handle case where a nested subquery becomes an existence join

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new dbb61981b80 [SPARK-45580][SQL][3.5] Handle case where a nested 
subquery becomes an existence join
dbb61981b80 is described below

commit dbb61981b804dbc03cf140c7c76653348e2ac740
Author: Bruce Robbins 
AuthorDate: Wed Dec 6 15:24:48 2023 -0800

[SPARK-45580][SQL][3.5] Handle case where a nested subquery becomes an 
existence join

### What changes were proposed in this pull request?

This is a back-port of #44193.

In `RewritePredicateSubquery`, prune existence flags from the final join 
when `rewriteExistentialExpr` returns an existence join. This change prunes the 
flags (attributes with the name "exists") by adding a `Project` node.

For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether 
`rewriteExistentialExpr` returns an existence join or not. In the case when 
`rewriteExistentialExpr` does not return an existence join, 
`RemoveNoopOperators` will remove the unneeded `Project` node.

### Why are the changes needed?

This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
```
(Note: the above query will not have the extraneous boolean column when run 
from the Dataset API. That is because the Dataset API truncates the rows based 
on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:
```
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
```

### Does this PR introduce _any_ user-facing change?

No, except for the removal of the extraneous boolean flag and the fix to 
the error condition.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44215 from bersprockets/schema_change_br35.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/optimizer/subquery.scala|  9 +++--
 .../scala/org/apache/spark/sql/SubquerySuite.scala | 46 ++
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
index 91cd838ad61..ee200531578 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
@@ -118,16 +118,19 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] 
with PredicateHelper {
   withSubquery.foldLeft(newFilter) {
 case (p, Exists(sub, _, _, conditions, subHint)) =>
   val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p)
-  buildJoin(outerPlan, sub, LeftSemi, joinCond, subHint)
+  val join = buildJoin(outerPlan, sub, LeftSemi, joinCond, subHint)
+  Project(p.output, join)
 case (p, Not(Exists(sub, _, _, conditions, subHint))) =>
   val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p)
-  buildJoin(outerPlan, sub, LeftAnti, joinCond, subHint)
+  val join = buildJoin(outerPlan, sub, LeftAnti, joinCond, subHint)
+  Project(p.output, join)
 case (p, InSubquery(values, ListQuery(sub, _, _, _, conditions, 
subHint))) =>
   // Deduplicate conflicting attributes if any.
   val newSub = dedupSubqueryOnSelfJoin(p, sub, Some(values))
   val inConditions = values.zip(newSub.output).map(EqualTo.tupled)

(spark) branch branch-3.5 updated: [SPARK-46274][SQL] Fix Range operator computeStats() to check long validity before converting

2023-12-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a697725d99a [SPARK-46274][SQL] Fix Range operator computeStats() to 
check long validity before converting
a697725d99a is described below

commit a697725d99a0177a2b1fbb0607e859ac10af1c4e
Author: Nick Young 
AuthorDate: Wed Dec 6 15:20:19 2023 -0800

[SPARK-46274][SQL] Fix Range operator computeStats() to check long validity 
before converting

### What changes were proposed in this pull request?

Range operator's `computeStats()` function unsafely casts from `BigInt` to 
`Long` and causes issues downstream with statistics estimation. Adds bounds 
checking to avoid crashing.

### Why are the changes needed?

Downstream statistics estimation will crash and fail loudly; to avoid this 
and help maintain clean code we should fix this.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44191 from n-young-db/range-compute-stats.

Authored-by: Nick Young 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 9fd575ae46f8a4dbd7da18887a44c693d8788332)
Signed-off-by: Wenchen Fan 
---
 .../catalyst/plans/logical/basicLogicalOperators.scala   | 12 +++-
 .../statsEstimation/BasicStatsEstimationSuite.scala  | 16 
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index b4d7716a566..58c03ee72d6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -1063,10 +1063,12 @@ case class Range(
 if (numElements == 0) {
   Statistics(sizeInBytes = 0, rowCount = Some(0))
 } else {
-  val (minVal, maxVal) = if (step > 0) {
-(start, start + (numElements - 1) * step)
+  val (minVal, maxVal) = if (!numElements.isValidLong) {
+(None, None)
+  } else if (step > 0) {
+(Some(start), Some(start + (numElements.toLong - 1) * step))
   } else {
-(start + (numElements - 1) * step, start)
+(Some(start + (numElements.toLong - 1) * step), Some(start))
   }
 
   val histogram = if (conf.histogramEnabled) {
@@ -1077,8 +1079,8 @@ case class Range(
 
   val colStat = ColumnStat(
 distinctCount = Some(numElements),
-max = Some(maxVal),
-min = Some(minVal),
+max = maxVal,
+min = minVal,
 nullCount = Some(0),
 avgLen = Some(LongType.defaultSize),
 maxLen = Some(LongType.defaultSize),
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
index 33e521eb65a..d1276615c5f 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
@@ -176,6 +176,22 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 expectedStatsCboOff = rangeStats, extraConfig)
   }
 
+test("range with invalid long value") {
+  val numElements = BigInt(Long.MaxValue) - BigInt(Long.MinValue)
+  val range = Range(Long.MinValue, Long.MaxValue, 1, None)
+  val rangeAttrs = AttributeMap(range.output.map(attr =>
+(attr, ColumnStat(
+  distinctCount = Some(numElements),
+  nullCount = Some(0),
+  maxLen = Some(LongType.defaultSize),
+  avgLen = Some(LongType.defaultSize)
+  val rangeStats = Statistics(
+sizeInBytes = numElements * 8,
+rowCount = Some(numElements),
+attributeStats = rangeAttrs)
+  checkStats(range, rangeStats, rangeStats)
+}
+
   test("windows") {
 val windows = plan.window(Seq(min(attribute).as("sum_attr")), 
Seq(attribute), Nil)
 val windowsStats = Statistics(sizeInBytes = plan.size.get * (4 + 4 + 8) / 
(4 + 8))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46274][SQL] Fix Range operator computeStats() to check long validity before converting

2023-12-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9fd575ae46f [SPARK-46274][SQL] Fix Range operator computeStats() to 
check long validity before converting
9fd575ae46f is described below

commit 9fd575ae46f8a4dbd7da18887a44c693d8788332
Author: Nick Young 
AuthorDate: Wed Dec 6 15:20:19 2023 -0800

[SPARK-46274][SQL] Fix Range operator computeStats() to check long validity 
before converting

### What changes were proposed in this pull request?

Range operator's `computeStats()` function unsafely casts from `BigInt` to 
`Long` and causes issues downstream with statistics estimation. Adds bounds 
checking to avoid crashing.

### Why are the changes needed?

Downstream statistics estimation will crash and fail loudly; to avoid this 
and help maintain clean code we should fix this.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44191 from n-young-db/range-compute-stats.

Authored-by: Nick Young 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/plans/logical/basicLogicalOperators.scala   | 12 +++-
 .../statsEstimation/BasicStatsEstimationSuite.scala  | 16 
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index c66ead30ab3..497f485b67f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -1083,10 +1083,12 @@ case class Range(
 if (numElements == 0) {
   Statistics(sizeInBytes = 0, rowCount = Some(0))
 } else {
-  val (minVal, maxVal) = if (step > 0) {
-(start, start + (numElements - 1) * step)
+  val (minVal, maxVal) = if (!numElements.isValidLong) {
+(None, None)
+  } else if (step > 0) {
+(Some(start), Some(start + (numElements.toLong - 1) * step))
   } else {
-(start + (numElements - 1) * step, start)
+(Some(start + (numElements.toLong - 1) * step), Some(start))
   }
 
   val histogram = if (conf.histogramEnabled) {
@@ -1097,8 +1099,8 @@ case class Range(
 
   val colStat = ColumnStat(
 distinctCount = Some(numElements),
-max = Some(maxVal),
-min = Some(minVal),
+max = maxVal,
+min = minVal,
 nullCount = Some(0),
 avgLen = Some(LongType.defaultSize),
 maxLen = Some(LongType.defaultSize),
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
index e421d5f3929..63410447948 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
@@ -177,6 +177,22 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 expectedStatsCboOff = rangeStats, extraConfig)
   }
 
+test("range with invalid long value") {
+  val numElements = BigInt(Long.MaxValue) - BigInt(Long.MinValue)
+  val range = Range(Long.MinValue, Long.MaxValue, 1, None)
+  val rangeAttrs = AttributeMap(range.output.map(attr =>
+(attr, ColumnStat(
+  distinctCount = Some(numElements),
+  nullCount = Some(0),
+  maxLen = Some(LongType.defaultSize),
+  avgLen = Some(LongType.defaultSize)
+  val rangeStats = Statistics(
+sizeInBytes = numElements * 8,
+rowCount = Some(numElements),
+attributeStats = rangeAttrs)
+  checkStats(range, rangeStats, rangeStats)
+}
+
   test("windows") {
 val windows = plan.window(Seq(min(attribute).as("sum_attr")), 
Seq(attribute), Nil)
 val windowsStats = Statistics(sizeInBytes = plan.size.get * (4 + 4 + 8) / 
(4 + 8))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46230][PYTHON] Migrate `RetriesExceeded` into PySpark error

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 31a48381f51 [SPARK-46230][PYTHON] Migrate `RetriesExceeded` into 
PySpark error
31a48381f51 is described below

commit 31a48381f5139a51045a10df344df3ce7ad1adb7
Author: Haejoon Lee 
AuthorDate: Wed Dec 6 11:00:43 2023 -0800

[SPARK-46230][PYTHON] Migrate `RetriesExceeded` into PySpark error

### What changes were proposed in this pull request?

This PR proposes to migrate `RetriesExceeded` into PySpark error.

### Why are the changes needed?

All errors defined from PySpark should be inherits `PySparkException` to 
keep the consistency of error messages generated from PySpark.

### Does this PR introduce _any_ user-facing change?

No, it's internal refactoring for better error handling.

### How was this patch tested?

The existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44147 from itholic/retires_exception.

Authored-by: Haejoon Lee 
Signed-off-by: Dongjoon Hyun 
---
 python/docs/source/reference/pyspark.errors.rst|  1 +
 python/pyspark/errors/__init__.py  |  2 ++
 python/pyspark/errors/error_classes.py |  5 +
 python/pyspark/errors/exceptions/base.py   |  7 +++
 python/pyspark/sql/connect/client/retries.py   | 11 ++-
 python/pyspark/sql/tests/connect/client/test_client.py |  2 +-
 python/pyspark/sql/tests/connect/test_connect_basic.py |  3 ++-
 7 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/python/docs/source/reference/pyspark.errors.rst 
b/python/docs/source/reference/pyspark.errors.rst
index a4997506b41..270a8a8c716 100644
--- a/python/docs/source/reference/pyspark.errors.rst
+++ b/python/docs/source/reference/pyspark.errors.rst
@@ -48,6 +48,7 @@ Classes
 PySparkIndexError
 PythonException
 QueryExecutionException
+RetriesExceeded
 SessionNotSameException
 SparkRuntimeException
 SparkUpgradeException
diff --git a/python/pyspark/errors/__init__.py 
b/python/pyspark/errors/__init__.py
index 07033d21643..a4f64e85f87 100644
--- a/python/pyspark/errors/__init__.py
+++ b/python/pyspark/errors/__init__.py
@@ -46,6 +46,7 @@ from pyspark.errors.exceptions.base import (  # noqa: F401
 PySparkAssertionError,
 PySparkNotImplementedError,
 PySparkPicklingError,
+RetriesExceeded,
 PySparkKeyError,
 )
 
@@ -78,5 +79,6 @@ __all__ = [
 "PySparkAssertionError",
 "PySparkNotImplementedError",
 "PySparkPicklingError",
+"RetriesExceeded",
 "PySparkKeyError",
 ]
diff --git a/python/pyspark/errors/error_classes.py 
b/python/pyspark/errors/error_classes.py
index c93ffa94149..965fd04a913 100644
--- a/python/pyspark/errors/error_classes.py
+++ b/python/pyspark/errors/error_classes.py
@@ -813,6 +813,11 @@ ERROR_CLASSES_JSON = """
   "Columns do not match in their data type: ."
 ]
   },
+  "RETRIES_EXCEEDED" : {
+"message" : [
+  "The maximum number of retries has been exceeded."
+]
+  },
   "SCHEMA_MISMATCH_FOR_PANDAS_UDF" : {
 "message" : [
   "Result vector from pandas_udf was not the required length: expected 
, got ."
diff --git a/python/pyspark/errors/exceptions/base.py 
b/python/pyspark/errors/exceptions/base.py
index b7d8ed88ec0..b60800da3ff 100644
--- a/python/pyspark/errors/exceptions/base.py
+++ b/python/pyspark/errors/exceptions/base.py
@@ -260,6 +260,13 @@ class PySparkPicklingError(PySparkException, 
PicklingError):
 """
 
 
+class RetriesExceeded(PySparkException):
+"""
+Represents an exception which is considered retriable, but retry limits
+were exceeded
+"""
+
+
 class PySparkKeyError(PySparkException, KeyError):
 """
 Wrapper class for KeyError to support error classes.
diff --git a/python/pyspark/sql/connect/client/retries.py 
b/python/pyspark/sql/connect/client/retries.py
index 88fc3fe1ffd..44e5e1834a2 100644
--- a/python/pyspark/sql/connect/client/retries.py
+++ b/python/pyspark/sql/connect/client/retries.py
@@ -22,7 +22,7 @@ import typing
 from typing import Optional, Callable, Generator, List, Type
 from types import TracebackType
 from pyspark.sql.connect.client.logging import logger
-from pyspark.errors import PySparkRuntimeError
+from pyspark.errors import PySparkRuntimeError, RetriesExceeded
 
 """
 This module contains retry system. The system is designed to be
@@ -233,7 +233,7 @@ class Retrying:
 
 # Exceeded retries
 logger.debug(f"Given up on retrying. error: {repr(exception)}")
-raise RetriesExceeded from exception
+raise RetriesExceeded(error_class="RETRIES_EXCEEDED", 
message_parameters={}) from exception
 
 def __iter__(self) -> 

(spark) branch master updated: [SPARK-46270][SQL][CORE][SS] Use java16 instanceof expressions to replace the java8 instanceof statement

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 231d89f89ed [SPARK-46270][SQL][CORE][SS] Use java16 instanceof 
expressions to replace the java8 instanceof statement
231d89f89ed is described below

commit 231d89f89ede2cac6cad596f2a3b36673ad0b2f3
Author: Jiaan Geng 
AuthorDate: Wed Dec 6 10:59:39 2023 -0800

[SPARK-46270][SQL][CORE][SS] Use java16 instanceof expressions to replace 
the java8 instanceof statement

### What changes were proposed in this pull request?
This PR uses java14 `instanceof` to replace the java8 `instanceof`.
For example:
```
if (obj instanceof String) {
String s = (String) obj;// grr...
...
}
```
We can change it to
```
if (obj instanceof String s) {
// Let pattern matching do the work!
...
}
```

### Why are the changes needed?
Using [`[JEP 394: Pattern Matching for instanceof]` 
](https://openjdk.org/jeps/394)can bring the following benefits:

1. **More concise syntax**: Pattern matching allows the desired "shape" of 
an object to be expressed concisely (the pattern), and for various statements 
and expressions to test that "shape" against their input (the matching).

2. **Safer**: The motto is: "A pattern variable is in scope where it has 
definitely matched". This allows for the safe reuse of pattern variables and is 
both intuitive and familiar, since Java developers are already used to flow 
sensitive analyses.

3. **Avoid explicit casts**: The use of pattern matching in instanceof 
should significantly reduce the overall number of explicit casts in Java 
programs. Type test patterns are particularly useful when writing equality 
methods.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44187 from beliefer/SPARK-46270.

Authored-by: Jiaan Geng 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/util/kvstore/ArrayWrappers.java   | 28 
 .../spark/util/kvstore/KVStoreSerializer.java  |  4 +-
 .../apache/spark/util/kvstore/LevelDBTypeInfo.java | 12 ++--
 .../apache/spark/util/kvstore/RocksDBTypeInfo.java | 12 ++--
 .../spark/network/client/StreamInterceptor.java|  4 +-
 .../network/client/TransportResponseHandler.java   | 21 ++
 .../protocol/EncryptedMessageWithHeader.java   | 14 ++--
 .../spark/network/protocol/MessageWithHeader.java  |  8 +--
 .../spark/network/protocol/SslMessageEncoder.java  |  7 +-
 .../spark/network/sasl/SaslClientBootstrap.java|  4 +-
 .../network/server/TransportChannelHandler.java|  8 +--
 .../network/server/TransportRequestHandler.java| 28 
 .../network/ssl/ReloadingX509TrustManager.java |  4 +-
 .../org/apache/spark/network/ssl/SSLFactory.java   |  4 +-
 .../org/apache/spark/network/util/NettyLogger.java | 13 ++--
 .../apache/spark/network/TestManagedBuffer.java|  4 +-
 .../spark/network/crypto/AuthIntegrationSuite.java |  4 +-
 .../apache/spark/network/shuffle/ErrorHandler.java |  4 +-
 .../network/shuffle/ExternalBlockHandler.java  | 27 +++-
 .../shuffle/RetryingBlockTransferorSuite.java  |  8 +--
 .../network/yarn/YarnShuffleServiceMetrics.java| 24 +++
 .../apache/spark/util/sketch/BloomFilterImpl.java  | 16 ++---
 .../spark/util/sketch/CountMinSketchImpl.java  | 16 ++---
 .../java/org/apache/spark/util/sketch/Utils.java   | 16 ++---
 .../org/apache/spark/unsafe/types/UTF8String.java  |  6 +-
 .../org/apache/spark/io/ReadAheadInputStream.java  |  4 +-
 .../unsafe/sort/UnsafeExternalSorter.java  |  4 +-
 .../unsafe/sort/UnsafeInMemorySorter.java  |  5 +-
 .../org/apache/spark/launcher/LauncherServer.java  |  4 +-
 .../expressions/SpecializedGettersReader.java  |  8 +--
 .../sql/catalyst/expressions/UnsafeDataUtils.java  |  6 +-
 .../spark/sql/catalyst/expressions/UnsafeRow.java  | 12 ++--
 .../spark/sql/connector/read/streaming/Offset.java |  4 +-
 .../sql/connector/util/V2ExpressionSQLBuilder.java | 17 ++---
 .../spark/sql/vectorized/ArrowColumnVector.java| 76 +++---
 .../spark/sql/vectorized/ColumnarBatchRow.java |  8 +--
 .../apache/spark/sql/vectorized/ColumnarRow.java   |  4 +-
 .../datasources/orc/OrcAtomicColumnVector.java | 20 +++---
 .../execution/datasources/orc/OrcFooterReader.java | 14 ++--
 .../parquet/ParquetVectorUpdaterFactory.java   |  9 ++-
 .../parquet/VectorizedColumnReader.java|  7 +-
 .../execution/vectorized/ConstantColumnVector.java |  4 +-
 .../execution/vectorized/MutableColumnarRow.java   |  4 +-
 .../JavaAdvancedDataSourceV2WithV2Filter.java  |  4 +-
 

(spark) branch master updated: [SPARK-46186][CONNECT][TESTS][FOLLOWUP] Remove flakiness of `ReattachableExecuteSuite`

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 982c7268f4c [SPARK-46186][CONNECT][TESTS][FOLLOWUP] Remove flakiness 
of `ReattachableExecuteSuite`
982c7268f4c is described below

commit 982c7268f4c7ea1fa03ea679146f9e83f31bece7
Author: Juliusz Sompolski 
AuthorDate: Wed Dec 6 10:57:45 2023 -0800

[SPARK-46186][CONNECT][TESTS][FOLLOWUP] Remove flakiness of 
`ReattachableExecuteSuite`

### What changes were proposed in this pull request?

The test added in https://github.com/apache/spark/pull/44095 could be flaky 
because `MEDIUM_RESULTS_QUERY` could very quickly finish before interrupt was 
sent. Replace it with a query that sleeps 30 seconds, so that we are sure that 
interrupt runs before it finishes.

### Why are the changes needed?

Remove test flakiness.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Rerun ReattachableExecuteSuite 100+ times to check it isn't flaky.

### Was this patch authored or co-authored using generative AI tooling?

Github Copilot was assisting in some boilerplate auto-completion.

Generated-by: Github Copilot

Closes #44189 from juliuszsompolski/SPARK-46186-followup.

Authored-by: Juliusz Sompolski 
Signed-off-by: Dongjoon Hyun 
---
 .../sql/connect/execution/ReattachableExecuteSuite.scala | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala
 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala
index 02b75f04495..f80229c6198 100644
--- 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala
+++ 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala
@@ -298,6 +298,15 @@ class ReattachableExecuteSuite extends 
SparkConnectServerTest {
   }
 
   test("SPARK-46186 interrupt directly after query start") {
+// register a sleep udf in the session
+val serverSession =
+  SparkConnectService.getOrCreateIsolatedSession(defaultUserId, 
defaultSessionId).session
+serverSession.udf.register(
+  "sleep",
+  ((ms: Int) => {
+Thread.sleep(ms);
+ms
+  }))
 // This test depends on fast timing.
 // If something is wrong, it can fail only from time to time.
 withRawBlockingStub { stub =>
@@ -309,12 +318,13 @@ class ReattachableExecuteSuite extends 
SparkConnectServerTest {
 .setOperationId(operationId)
 .build()
   val iter = stub.executePlan(
-buildExecutePlanRequest(buildPlan(MEDIUM_RESULTS_QUERY), operationId = 
operationId))
+buildExecutePlanRequest(buildPlan("select sleep(3) as s"), 
operationId = operationId))
   // wait for execute holder to exist, but the execute thread may not have 
started yet.
   Eventually.eventually(timeout(eventuallyTimeout)) {
 assert(SparkConnectService.executionManager.listExecuteHolders.length 
== 1)
   }
   stub.interrupt(interruptRequest)
+  // make sure the client gets the OPERATION_CANCELED error
   val e = intercept[StatusRuntimeException] {
 while (iter.hasNext) iter.next()
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c96fef2ea55 [SPARK-45580][SQL] Handle case where a nested subquery 
becomes an existence join
c96fef2ea55 is described below

commit c96fef2ea55ee85ac66905584e9dee31471de9f1
Author: Bruce Robbins 
AuthorDate: Wed Dec 6 10:55:15 2023 -0800

[SPARK-45580][SQL] Handle case where a nested subquery becomes an existence 
join

### What changes were proposed in this pull request?

In `RewritePredicateSubquery`, prune existence flags from the final join 
when `rewriteExistentialExpr` returns an existence join. This change prunes the 
flags (attributes with the name "exists") by adding a `Project` node.

For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether 
`rewriteExistentialExpr` returns an existence join or not. In the case when 
`rewriteExistentialExpr` does not return an existence join, 
`RemoveNoopOperators` will remove the unneeded `Project` node.

### Why are the changes needed?

This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
```
(Note: the above query will not have the extraneous boolean column when run 
from the Dataset API. That is because the Dataset API truncates the rows based 
on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:
```
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
```

### Does this PR introduce _any_ user-facing change?

No, except for the removal of the extraneous boolean flag and the fix to 
the error condition.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44193 from bersprockets/schema_change.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/optimizer/subquery.scala|  9 +++--
 .../scala/org/apache/spark/sql/SubquerySuite.scala | 46 ++
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
index 1f1a16e9093..6ca2cb79aaf 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
@@ -132,19 +132,22 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] 
with PredicateHelper {
   withSubquery.foldLeft(newFilter) {
 case (p, Exists(sub, _, _, conditions, subHint)) =>
   val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p)
-  buildJoin(outerPlan, rewriteDomainJoinsIfPresent(outerPlan, sub, 
joinCond),
+  val join = buildJoin(outerPlan, 
rewriteDomainJoinsIfPresent(outerPlan, sub, joinCond),
 LeftSemi, joinCond, subHint)
+  Project(p.output, join)
 case (p, Not(Exists(sub, _, _, conditions, subHint))) =>
   val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p)
-  buildJoin(outerPlan, rewriteDomainJoinsIfPresent(outerPlan, sub, 
joinCond),
+  val join = buildJoin(outerPlan, 
rewriteDomainJoinsIfPresent(outerPlan, sub, joinCond),
 LeftAnti, joinCond, subHint)
+  Project(p.output, join)
 case (p, InSubquery(values, ListQuery(sub, _, _, _, conditions, 
subHint))) =>
   // Deduplicate conflicting attributes if any.
   val newSub = 

(spark) branch master updated: [SPARK-46232][PYTHON][FOLLOWUP] Migrate `ValueError` into `PySparkValueError`

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 467df65ceba [SPARK-46232][PYTHON][FOLLOWUP] Migrate `ValueError` into 
`PySparkValueError`
467df65ceba is described below

commit 467df65ceba5f6a8957ca7d72f5537434bf32e81
Author: Haejoon Lee 
AuthorDate: Wed Dec 6 10:51:22 2023 -0800

[SPARK-46232][PYTHON][FOLLOWUP] Migrate `ValueError` into 
`PySparkValueError`

### What changes were proposed in this pull request?

This PR followups for https://github.com/apache/spark/pull/44149 to address 
missing case.

### Why are the changes needed?

To improve error handling.

### Does this PR introduce _any_ user-facing change?

No API changes.

### How was this patch tested?

The existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44202 from itholic/SPARK-46232-followup.

Authored-by: Haejoon Lee 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/pandas/serializers.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/pandas/serializers.py 
b/python/pyspark/sql/pandas/serializers.py
index 8b2b583ddaa..834f22c86c0 100644
--- a/python/pyspark/sql/pandas/serializers.py
+++ b/python/pyspark/sql/pandas/serializers.py
@@ -738,8 +738,9 @@ class 
CogroupPandasUDFSerializer(ArrowStreamPandasUDFSerializer):
 )
 
 elif dataframes_in_group != 0:
-raise ValueError(
-"Invalid number of pandas.DataFrames in group 
{0}".format(dataframes_in_group)
+raise PySparkValueError(
+error_class="INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP",
+message_parameters={"dataframes_in_group": 
str(dataframes_in_group)},
 )
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46283][INFRA] Remove `streaming-kinesis-asl` module from `MODULES_TO_TEST` for branch-3.x daily tests

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 809cec01205 [SPARK-46283][INFRA] Remove `streaming-kinesis-asl` module 
from `MODULES_TO_TEST` for branch-3.x daily tests
809cec01205 is described below

commit 809cec012055d6f15987f338122d2fdb5bdd5c92
Author: yangjie01 
AuthorDate: Wed Dec 6 10:50:27 2023 -0800

[SPARK-46283][INFRA] Remove `streaming-kinesis-asl` module from 
`MODULES_TO_TEST` for branch-3.x daily tests

### What changes were proposed in this pull request?

After the merge of https://github.com/apache/spark/pull/43736, the master 
branch began testing the `streaming-kinesis-asl` module.

At the same time, because the daily test will reuse `build_and_test.yml`, 
the daily test of branch-3.x also began testing `streaming-kinesis-asl`.

However, in branch-3.x, the env `ENABLE_KINESIS_TESTS` is hard-coded as 1 
in `dev/sparktestsupport/modules.py`:


https://github.com/apache/spark/blob/1321b4e64deaa1e58bf297c25b72319083056568/dev/sparktestsupport/modules.py#L332-L346

which leads to the failure of the daily test of branch-3.x:

- branch-3.3: https://github.com/apache/spark/actions/runs/7111246311
- branch-3.4: https://github.com/apache/spark/actions/runs/7098435892
- branch-3.5: https://github.com/apache/spark/actions/runs/7099811235

```
[info] 
org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite *** 
ABORTED *** (1 second, 14 milliseconds)
[info]   java.lang.Exception: Kinesis tests enabled using environment 
variable ENABLE_KINESIS_TESTS
[info] but could not find AWS credentials. Please follow instructions in 
AWS documentation
[info] to set the credentials in your system such that the 
DefaultAWSCredentialsProviderChain
[info] can find the credentials.
[info]   at 
org.apache.spark.streaming.kinesis.KinesisTestUtils$.getAWSCredentials(KinesisTestUtils.scala:258)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient$lzycompute(KinesisTestUtils.scala:58)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient(KinesisTestUtils.scala:57)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisTestUtils.describeStream(KinesisTestUtils.scala:168)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisTestUtils.findNonExistentStreamName(KinesisTestUtils.scala:181)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisTestUtils.createStream(KinesisTestUtils.scala:84)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisStreamTests.$anonfun$beforeAll$1(KinesisStreamSuite.scala:61)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisFunSuite.runIfTestsEnabled(KinesisFunSuite.scala:41)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisFunSuite.runIfTestsEnabled$(KinesisFunSuite.scala:39)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisStreamTests.runIfTestsEnabled(KinesisStreamSuite.scala:42)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisStreamTests.beforeAll(KinesisStreamSuite.scala:59)
[info]   at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at 
org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisStreamTests.org$scalatest$BeforeAndAfter$$super$run(KinesisStreamSuite.scala:42)
[info]   at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:273)
[info]   at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:271)
[info]   at 
org.apache.spark.streaming.kinesis.KinesisStreamTests.run(KinesisStreamSuite.scala:42)
[info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:750)
[info] Test run 
org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite started
[info] Test 
org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite.testJavaKinesisDStreamBuilderOldApi
 started
[info] Test 
org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite.testJavaKinesisDStreamBuilder
 started
[info] Test run 
org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite 
finished: 0 failed, 0 

(spark) branch branch-3.3 updated: [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 37d10ec3644 [SPARK-46286][DOCS] Document 
`spark.io.compression.zstd.bufferPool.enabled`
37d10ec3644 is described below

commit 37d10ec3644d41396cd7378fdc3fe405b680203c
Author: Kent Yao 
AuthorDate: Wed Dec 6 10:46:31 2023 -0800

[SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`

This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation

- Missing docs
- https://github.com/apache/spark/pull/31502#issuecomment-774792276 
potential regression

no

doc build

no

Closes #44207 from yaooqinn/SPARK-46286.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 6b6980de451e655ef4b9f63d502b73c09a513d4c)
Signed-off-by: Dongjoon Hyun 
---
 docs/configuration.md | 8 
 1 file changed, 8 insertions(+)

diff --git a/docs/configuration.md b/docs/configuration.md
index b96defb2adb..2a205522989 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1545,6 +1545,14 @@ Apart from these, the following properties are also 
available, and may be useful
   
   2.3.0
 
+
+  spark.io.compression.zstd.bufferPool.enabled
+  true
+  
+If true, enable buffer pool of ZSTD JNI library.
+  
+  3.2.0
+
 
   spark.kryo.classesToRegister
   (none)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 93fef098a0d [SPARK-46286][DOCS] Document 
`spark.io.compression.zstd.bufferPool.enabled`
93fef098a0d is described below

commit 93fef098a0d5d6c95205a46ebf9c959e325c9d7e
Author: Kent Yao 
AuthorDate: Wed Dec 6 10:46:31 2023 -0800

[SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`

This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation

- Missing docs
- https://github.com/apache/spark/pull/31502#issuecomment-774792276 
potential regression

no

doc build

no

Closes #44207 from yaooqinn/SPARK-46286.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 6b6980de451e655ef4b9f63d502b73c09a513d4c)
Signed-off-by: Dongjoon Hyun 
---
 docs/configuration.md | 8 
 1 file changed, 8 insertions(+)

diff --git a/docs/configuration.md b/docs/configuration.md
index 198a6dd4b2b..6bd49f398d9 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1727,6 +1727,14 @@ Apart from these, the following properties are also 
available, and may be useful
   
   2.3.0
 
+
+  spark.io.compression.zstd.bufferPool.enabled
+  true
+  
+If true, enable buffer pool of ZSTD JNI library.
+  
+  3.2.0
+
 
   spark.kryo.classesToRegister
   (none)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new b5cbe1fcdb4 [SPARK-46286][DOCS] Document 
`spark.io.compression.zstd.bufferPool.enabled`
b5cbe1fcdb4 is described below

commit b5cbe1fcdb464fc064ffb5fbef3edfa408d6638f
Author: Kent Yao 
AuthorDate: Wed Dec 6 10:46:31 2023 -0800

[SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`

This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation

- Missing docs
- https://github.com/apache/spark/pull/31502#issuecomment-774792276 
potential regression

no

doc build

no

Closes #44207 from yaooqinn/SPARK-46286.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 6b6980de451e655ef4b9f63d502b73c09a513d4c)
Signed-off-by: Dongjoon Hyun 
---
 docs/configuration.md | 8 
 1 file changed, 8 insertions(+)

diff --git a/docs/configuration.md b/docs/configuration.md
index 248f9333c9a..f79406c5b6d 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1752,6 +1752,14 @@ Apart from these, the following properties are also 
available, and may be useful
   
   2.3.0
 
+
+  spark.io.compression.zstd.bufferPool.enabled
+  true
+  
+If true, enable buffer pool of ZSTD JNI library.
+  
+  3.2.0
+
 
   spark.kryo.classesToRegister
   (none)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6b6980de451 [SPARK-46286][DOCS] Document 
`spark.io.compression.zstd.bufferPool.enabled`
6b6980de451 is described below

commit 6b6980de451e655ef4b9f63d502b73c09a513d4c
Author: Kent Yao 
AuthorDate: Wed Dec 6 10:46:31 2023 -0800

[SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`

### What changes were proposed in this pull request?

This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation

### Why are the changes needed?

- Missing docs
- https://github.com/apache/spark/pull/31502#issuecomment-774792276 
potential regression

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

doc build

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44207 from yaooqinn/SPARK-46286.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 docs/configuration.md | 8 
 1 file changed, 8 insertions(+)

diff --git a/docs/configuration.md b/docs/configuration.md
index 2ad07cf59f7..f261e3b2deb 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1760,6 +1760,14 @@ Apart from these, the following properties are also 
available, and may be useful
   
   2.3.0
 
+
+  spark.io.compression.zstd.bufferPool.enabled
+  true
+  
+If true, enable buffer pool of ZSTD JNI library.
+  
+  3.2.0
+
 
   spark.io.compression.zstd.workers
   0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46287][PYTHON][CONNECT] `DataFrame.isEmpty` should work with all datatypes

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f4e41e0e318 [SPARK-46287][PYTHON][CONNECT] `DataFrame.isEmpty` should 
work with all datatypes
f4e41e0e318 is described below

commit f4e41e0e318ea1269de5991f4635637e6e5233f3
Author: Ruifeng Zheng 
AuthorDate: Wed Dec 6 10:45:12 2023 -0800

[SPARK-46287][PYTHON][CONNECT] `DataFrame.isEmpty` should work with all 
datatypes

### What changes were proposed in this pull request?
`DataFrame.isEmpty` should work with all datatypes

the schema maybe not compatible with arrow, so should not use 
`collect/take` to check `isEmpty`

### Why are the changes needed?
bugfix

### Does this PR introduce _any_ user-facing change?
before:
```
In [1]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").isEmpty()
23/12/06 20:39:58 WARN CheckAllocator: More than one 
DefaultAllocationManager on classpath. Choosing first found
--- 
/ 1]
KeyError  Traceback (most recent call last)
Cell In[1], line 1
> 1 spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").isEmpty()

File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:181, in 
DataFrame.isEmpty(self)
180 def isEmpty(self) -> bool:
--> 181 return len(self.take(1)) == 0

...

File 
~/.dev/miniconda3/envs/spark_dev_311/lib/python3.11/site-packages/pyarrow/public-api.pxi:208,
 in pyarrow.lib.pyarrow_wrap_array()

File 
~/.dev/miniconda3/envs/spark_dev_311/lib/python3.11/site-packages/pyarrow/array.pxi:3659,
 in pyarrow.lib.get_array_class_from_type()

KeyError: 21
```

after
```
In [1]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").isEmpty()
23/12/06 20:40:26 WARN CheckAllocator: More than one 
DefaultAllocationManager on classpath. Choosing first found
Out[1]: False
```

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44209 from zhengruifeng/py_connect_df_isempty.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/connect/dataframe.py| 2 +-
 python/pyspark/sql/tests/connect/test_connect_basic.py | 5 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 6a1d4571216..66059ad96eb 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -178,7 +178,7 @@ class DataFrame:
 write.__doc__ = PySparkDataFrame.write.__doc__
 
 def isEmpty(self) -> bool:
-return len(self.take(1)) == 0
+return len(self.select().take(1)) == 0
 
 isEmpty.__doc__ = PySparkDataFrame.isEmpty.__doc__
 
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index fb5eaece7f4..5e0cf535391 100755
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -2004,6 +2004,11 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 self.assertFalse(self.connect.sql("SELECT 1 AS X").isEmpty())
 self.assertTrue(self.connect.sql("SELECT 1 AS X LIMIT 0").isEmpty())
 
+def test_is_empty_with_unsupported_types(self):
+df = self.spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS interval")
+self.assertEqual(df.count(), 1)
+self.assertFalse(df.isEmpty())
+
 def test_session(self):
 self.assertEqual(self.connect, self.connect.sql("SELECT 
1").sparkSession)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46288][PS][TESTS] Remove unused code in `pyspark.pandas.tests.frame.*`

2023-12-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f6861c3918b [SPARK-46288][PS][TESTS] Remove unused code in 
`pyspark.pandas.tests.frame.*`
f6861c3918b is described below

commit f6861c3918bdedf5d8d89dbecced3317cc9dc490
Author: Ruifeng Zheng 
AuthorDate: Wed Dec 6 10:44:04 2023 -0800

[SPARK-46288][PS][TESTS] Remove unused code in 
`pyspark.pandas.tests.frame.*`

### What changes were proposed in this pull request?
Remove unused code in `pyspark.pandas.tests.frame.*`

### Why are the changes needed?
code clean up

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44212 from zhengruifeng/ps_frame_cleanup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/pandas/tests/frame/test_conversion.py  |  6 --
 python/pyspark/pandas/tests/frame/test_reindexing.py  | 13 -
 python/pyspark/pandas/tests/frame/test_spark.py   |  6 --
 python/pyspark/pandas/tests/frame/test_take.py| 14 --
 python/pyspark/pandas/tests/frame/test_time_series.py | 13 -
 python/pyspark/pandas/tests/frame/test_truncate.py| 14 --
 6 files changed, 66 deletions(-)

diff --git a/python/pyspark/pandas/tests/frame/test_conversion.py 
b/python/pyspark/pandas/tests/frame/test_conversion.py
index 116a7d31c11..eefb461239e 100644
--- a/python/pyspark/pandas/tests/frame/test_conversion.py
+++ b/python/pyspark/pandas/tests/frame/test_conversion.py
@@ -34,12 +34,6 @@ class FrameConversionMixin:
 index=np.random.rand(9),
 )
 
-@property
-def df_pair(self):
-pdf = self.pdf
-psdf = ps.from_pandas(pdf)
-return pdf, psdf
-
 def test_astype(self):
 psdf = self.psdf
 
diff --git a/python/pyspark/pandas/tests/frame/test_reindexing.py 
b/python/pyspark/pandas/tests/frame/test_reindexing.py
index 606efd95188..b3639945391 100644
--- a/python/pyspark/pandas/tests/frame/test_reindexing.py
+++ b/python/pyspark/pandas/tests/frame/test_reindexing.py
@@ -30,19 +30,6 @@ from pyspark.testing.sqlutils import SQLTestUtils
 # This file contains test cases for 'Reindexing / Selection / Label 
manipulation'
 # 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/frame.html#reindexing-selection-label-manipulation
 class FrameReindexingMixin:
-@property
-def pdf(self):
-return pd.DataFrame(
-{"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 
0]},
-index=np.random.rand(9),
-)
-
-@property
-def df_pair(self):
-pdf = self.pdf
-psdf = ps.from_pandas(pdf)
-return pdf, psdf
-
 def test_add_prefix(self):
 pdf = pd.DataFrame({"A": [1, 2, 3, 4], "B": [3, 4, 5, 6]}, 
index=np.random.rand(4))
 psdf = ps.from_pandas(pdf)
diff --git a/python/pyspark/pandas/tests/frame/test_spark.py 
b/python/pyspark/pandas/tests/frame/test_spark.py
index 4413279e32f..36466695c30 100644
--- a/python/pyspark/pandas/tests/frame/test_spark.py
+++ b/python/pyspark/pandas/tests/frame/test_spark.py
@@ -43,12 +43,6 @@ class FrameSparkMixin:
 index=np.random.rand(9),
 )
 
-@property
-def df_pair(self):
-pdf = self.pdf
-psdf = ps.from_pandas(pdf)
-return pdf, psdf
-
 def test_empty_dataframe(self):
 pdf = pd.DataFrame({"a": pd.Series([], dtype="i1"), "b": pd.Series([], 
dtype="str")})
 
diff --git a/python/pyspark/pandas/tests/frame/test_take.py 
b/python/pyspark/pandas/tests/frame/test_take.py
index 28d20e9bd99..3654436848b 100644
--- a/python/pyspark/pandas/tests/frame/test_take.py
+++ b/python/pyspark/pandas/tests/frame/test_take.py
@@ -16,7 +16,6 @@
 #
 import unittest
 
-import numpy as np
 import pandas as pd
 
 from pyspark import pandas as ps
@@ -25,19 +24,6 @@ from pyspark.testing.sqlutils import SQLTestUtils
 
 
 class FrameTakeMixin:
-@property
-def pdf(self):
-return pd.DataFrame(
-{"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 
0]},
-index=np.random.rand(9),
-)
-
-@property
-def df_pair(self):
-pdf = self.pdf
-psdf = ps.from_pandas(pdf)
-return pdf, psdf
-
 def test_take(self):
 pdf = pd.DataFrame(
 {"A": range(0, 5), "B": range(10, 0, -2), "C": 
range(10, 5, -1)}
diff --git a/python/pyspark/pandas/tests/frame/test_time_series.py 
b/python/pyspark/pandas/tests/frame/test_time_series.py
index eed9086ada7..61dc095d6ba 100644
--- a/python/pyspark/pandas/tests/frame/test_time_series.py
+++ 

(spark) branch master updated: [SPARK-46173][SQL] Skipping trimAll call during date parsing

2023-12-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 348d88161a9 [SPARK-46173][SQL] Skipping trimAll call during date 
parsing
348d88161a9 is described below

commit 348d88161a9bc98b37c0e097c81c9ecabeb2d76a
Author: Aleksandar Tomic 
AuthorDate: Wed Dec 6 07:15:53 2023 -0800

[SPARK-46173][SQL] Skipping trimAll call during date parsing

### What changes were proposed in this pull request?

This PR is a response to a user complaint that stated that date to string 
casts are too slow if input string has many whitespace/isocontrol character as 
it's prefix or sufix.

Current behaviour is that in StringToDate function we first call trimAll 
function to remove any whitespace/isocontrol chars. TrimAll creates new string 
and then we work against that string. This is not really needed since 
StringToDate can just simply skip these characters and do all the processing in 
single loop without extra allocations. Proposed fix is exactly that.

### Why are the changes needed?

These changes should drastically improve edge case where input string in 
cast to date has many whitespace as prefix/sufix.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Extending existing tests to cover both prefix and suffix cases with 
whitespaces/control chars.

The change also includes a small unit benchmark for this particular case. 
Looking for feedback if we want to keep the benchmark, given that this is a 
rather esoteric edge case.

The benchmark measures time needed to do 10k calls of stringToDate function 
against a input strings that are in format <65k whitespace><65 
whitespace> and <65k whitespace>. Here are the results:

| input example | prior to change | after the change |
| -- | -- | - |
| 65k whitespace prefix + 65k suffix | 3250ms | 484ms |
| 65k whitespace suffix | 1572ms | <1ms |

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44110 from dbatomic/SPARK-46173-trim-all-skip.

Authored-by: Aleksandar Tomic 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/unsafe/types/UTF8String.java  | 16 
 .../sql/catalyst/util/SparkDateTimeUtils.scala | 22 ++
 .../sql/catalyst/util/DateTimeUtilsSuite.scala | 20 +++-
 3 files changed, 45 insertions(+), 13 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
index e362a13eb5f..a2dd03ff18b 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
@@ -148,6 +148,14 @@ public final class UTF8String implements 
Comparable, Externalizable,
 return fromBytes(spaces);
   }
 
+  /**
+   * Determines if the specified character (Unicode code point) is white space 
or an ISO control
+   * character according to Java.
+   */
+  public static boolean isWhitespaceOrISOControl(int codePoint) {
+return Character.isWhitespace(codePoint) || 
Character.isISOControl(codePoint);
+  }
+
   private UTF8String(Object base, long offset, int numBytes) {
 this.base = base;
 this.offset = offset;
@@ -498,14 +506,6 @@ public final class UTF8String implements 
Comparable, Externalizable,
 return UTF8String.fromBytes(newBytes);
   }
 
-  /**
-   * Determines if the specified character (Unicode code point) is white space 
or an ISO control
-   * character according to Java.
-   */
-  private boolean isWhitespaceOrISOControl(int codePoint) {
-return Character.isWhitespace(codePoint) || 
Character.isISOControl(codePoint);
-  }
-
   /**
* Trims space characters (ASCII 32) from both ends of this string.
*
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
index 5e9fb0dd25f..35118b449e2 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala
@@ -305,21 +305,35 @@ trait SparkDateTimeUtils {
   (segment == 0 && digits >= 4 && digits <= maxDigitsYear) ||
 (segment != 0 && digits > 0 && digits <= 2)
 }
-if (s == null || s.trimAll().numBytes() == 0) {
+if (s == null) {
   return None
 }
+
 val segments: Array[Int] = Array[Int](1, 1, 1)
 var sign = 1
 var i = 0
 var currentSegmentValue = 0
 var currentSegmentDigits = 0
-val bytes = 

(spark) branch master updated: [SPARK-45888][SS] Apply error class framework to State (Metadata) Data Source

2023-12-06 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 59777222e72 [SPARK-45888][SS] Apply error class framework to State 
(Metadata) Data Source
59777222e72 is described below

commit 59777222e726c63cbd9077a2c76f762e06f6a5b3
Author: Jungtaek Lim 
AuthorDate: Wed Dec 6 22:38:40 2023 +0900

[SPARK-45888][SS] Apply error class framework to State (Metadata) Data 
Source

### What changes were proposed in this pull request?

This PR proposes to apply error class framework to the new data source, 
State (Metadata) Data Source.

### Why are the changes needed?

Error class framework is a standard to represent all exceptions in Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified UT.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44025 from HeartSaVioR/SPARK-45888.

Authored-by: Jungtaek Lim 
Signed-off-by: Jungtaek Lim 
---
 common/utils/src/main/resources/error/README.md|   1 +
 .../src/main/resources/error/error-classes.json|  75 ++
 ...itions-stds-invalid-option-value-error-class.md |  40 ++
 docs/sql-error-conditions.md   |  60 
 .../datasources/v2/state/StateDataSource.scala |  33 +++--
 .../v2/state/StateDataSourceErrors.scala   | 160 +
 .../datasources/v2/state/StateScanBuilder.scala|   3 +-
 .../datasources/v2/state/StateTable.scala  |   9 +-
 .../StreamStreamJoinStatePartitionReader.scala |   2 +-
 .../v2/state/metadata/StateMetadataSource.scala|   4 +-
 .../v2/state/StateDataSourceReadSuite.scala|  33 +++--
 .../state/OperatorStateMetadataSuite.scala |   6 +-
 12 files changed, 389 insertions(+), 37 deletions(-)

diff --git a/common/utils/src/main/resources/error/README.md 
b/common/utils/src/main/resources/error/README.md
index 556a634e992..b062c773907 100644
--- a/common/utils/src/main/resources/error/README.md
+++ b/common/utils/src/main/resources/error/README.md
@@ -636,6 +636,7 @@ The following SQLSTATEs are collated from:
 |42613|42   |Syntax Error or Access Rule Violation |613 
|Clauses are mutually exclusive. |DB2|N 
  |DB2  
   |
 |42614|42   |Syntax Error or Access Rule Violation |614 |A 
duplicate keyword or clause is invalid.   |DB2|N
   |DB2 
|
 |42615|42   |Syntax Error or Access Rule Violation |615 
|An invalid alternative was detected.|DB2|N 
  |DB2  
   |
+|42616|42   |Syntax Error or Access Rule Violation |616 
|Invalid options specified   |DB2|N 
  |DB2  
   |
 |42617|42   |Syntax Error or Access Rule Violation |617 
|The statement string is blank or empty. |DB2|N 
  |DB2  
   |
 |42618|42   |Syntax Error or Access Rule Violation |618 |A 
variable is not allowed.  |DB2|N
   |DB2 
|
 |42620|42   |Syntax Error or Access Rule Violation |620 
|Read-only SCROLL was specified with the UPDATE clause.  |DB2|N 
  |DB2  
   |
diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index e54d346e1bc..7a672fa5e55 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -3066,6 +3066,81 @@
 ],
 "sqlState" : "42713"
   },
+  "STDS_COMMITTED_BATCH_UNAVAILABLE" : {
+"message" : [
+  "No committed batch found, checkpoint location: . 
Ensure that the query has run and committed any microbatch before stopping."
+],
+"sqlState" : "KD006"
+  },
+  "STDS_CONFLICT_OPTIONS" : {
+"message" : [
+  "The options  cannot be specified together. Please specify the 
one."
+],
+"sqlState" : "42613"
+  },
+  "STDS_FAILED_TO_READ_STATE_SCHEMA" : {
+"message" : [
+  "Failed to read the state schema. Either the file does not exist, or the 
file 

(spark) branch master updated: [SPARK-46284][PYTHON][CONNECT] Add `session_user` function to Python

2023-12-06 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ea8b1392d84 [SPARK-46284][PYTHON][CONNECT] Add `session_user` function 
to Python
ea8b1392d84 is described below

commit ea8b1392d84757cdab03e40c2c3efea1d8ef3c82
Author: Ruifeng Zheng 
AuthorDate: Wed Dec 6 18:30:44 2023 +0800

[SPARK-46284][PYTHON][CONNECT] Add `session_user` function to Python

### What changes were proposed in this pull request?
`session_user` function was added in Scala in 
https://github.com/apache/spark/pull/42549, this PR adds it to Python

### Why are the changes needed?
for parity

### Does this PR introduce _any_ user-facing change?
yes
```
>>> import pyspark.sql.functions as sf
>>> spark.range(1).select(sf.session_user()).show() # doctest: +SKIP
+--+
|current_user()|
+--+
| ruifeng.zheng|
+--+
```

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44205 from zhengruifeng/connect_session_user.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../docs/source/reference/pyspark.sql/functions.rst   |  1 +
 python/pyspark/sql/connect/functions/builtin.py   |  7 +++
 python/pyspark/sql/functions/builtin.py   | 19 +++
 python/pyspark/sql/tests/test_functions.py|  1 -
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/reference/pyspark.sql/functions.rst 
b/python/docs/source/reference/pyspark.sql/functions.rst
index 3b6a55a2a6f..d1dba5f2bed 100644
--- a/python/docs/source/reference/pyspark.sql/functions.rst
+++ b/python/docs/source/reference/pyspark.sql/functions.rst
@@ -585,6 +585,7 @@ Misc Functions
 monotonically_increasing_id
 raise_error
 reflect
+session_user
 spark_partition_id
 try_aes_decrypt
 try_reflect
diff --git a/python/pyspark/sql/connect/functions/builtin.py 
b/python/pyspark/sql/connect/functions/builtin.py
index 882fbbccf63..48a7a223e6e 100644
--- a/python/pyspark/sql/connect/functions/builtin.py
+++ b/python/pyspark/sql/connect/functions/builtin.py
@@ -3613,6 +3613,13 @@ def user() -> Column:
 user.__doc__ = pysparkfuncs.user.__doc__
 
 
+def session_user() -> Column:
+return _invoke_function("session_user")
+
+
+session_user.__doc__ = pysparkfuncs.session_user.__doc__
+
+
 def assert_true(col: "ColumnOrName", errMsg: Optional[Union[Column, str]] = 
None) -> Column:
 if errMsg is None:
 return _invoke_function_over_columns("assert_true", col)
diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index ac237f10c2e..87ae84c4e2d 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -8914,6 +8914,25 @@ def user() -> Column:
 return _invoke_function("user")
 
 
+@_try_remote_functions
+def session_user() -> Column:
+"""Returns the user name of current execution context.
+
+.. versionadded:: 4.0.0
+
+Examples
+
+>>> import pyspark.sql.functions as sf
+>>> spark.range(1).select(sf.session_user()).show() # doctest: +SKIP
++--+
+|current_user()|
++--+
+| ruifeng.zheng|
++--+
+"""
+return _invoke_function("session_user")
+
+
 @_try_remote_functions
 def crc32(col: "ColumnOrName") -> Column:
 """
diff --git a/python/pyspark/sql/tests/test_functions.py 
b/python/pyspark/sql/tests/test_functions.py
index 7d8acbb2b18..2bdcfa6085f 100644
--- a/python/pyspark/sql/tests/test_functions.py
+++ b/python/pyspark/sql/tests/test_functions.py
@@ -66,7 +66,6 @@ class FunctionsTestsMixin:
 "random",  # namespace conflict with python built-in module
 "uuid",  # namespace conflict with python built-in module
 "chr",  # namespace conflict with python built-in function
-"session_user",  # Scala only for now, needs implementation
 "partitioning$",  # partitioning expressions for DSv2
 ]
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46281][PS][TESTS] Remove unused code in `pyspark.pandas.tests.computation.*`

2023-12-06 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b9cc813adf0 [SPARK-46281][PS][TESTS] Remove unused code in 
`pyspark.pandas.tests.computation.*`
b9cc813adf0 is described below

commit b9cc813adf0ba592f020bc9778b373800ad81b8f
Author: Ruifeng Zheng 
AuthorDate: Wed Dec 6 17:06:16 2023 +0800

[SPARK-46281][PS][TESTS] Remove unused code in 
`pyspark.pandas.tests.computation.*`

### What changes were proposed in this pull request?
Remove unused code in `pyspark.pandas.tests.computation.*`

### Why are the changes needed?
code clean up

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44200 from zhengruifeng/ps_test_comp_cleanup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/tests/computation/test_any_all.py | 13 -
 python/pyspark/pandas/tests/computation/test_apply_func.py  |  6 --
 python/pyspark/pandas/tests/computation/test_binary_ops.py  |  6 --
 python/pyspark/pandas/tests/computation/test_combine.py | 13 -
 python/pyspark/pandas/tests/computation/test_cumulative.py  | 13 -
 python/pyspark/pandas/tests/computation/test_eval.py|  8 
 python/pyspark/pandas/tests/computation/test_melt.py| 13 -
 .../pyspark/pandas/tests/computation/test_missing_data.py   | 13 -
 python/pyspark/pandas/tests/computation/test_pivot.py   | 13 -
 9 files changed, 98 deletions(-)

diff --git a/python/pyspark/pandas/tests/computation/test_any_all.py 
b/python/pyspark/pandas/tests/computation/test_any_all.py
index 6c120aead4e..5e946be7b08 100644
--- a/python/pyspark/pandas/tests/computation/test_any_all.py
+++ b/python/pyspark/pandas/tests/computation/test_any_all.py
@@ -25,19 +25,6 @@ from pyspark.testing.sqlutils import SQLTestUtils
 
 
 class FrameAnyAllMixin:
-@property
-def pdf(self):
-return pd.DataFrame(
-{"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 
0]},
-index=np.random.rand(9),
-)
-
-@property
-def df_pair(self):
-pdf = self.pdf
-psdf = ps.from_pandas(pdf)
-return pdf, psdf
-
 def test_all(self):
 pdf = pd.DataFrame(
 {
diff --git a/python/pyspark/pandas/tests/computation/test_apply_func.py 
b/python/pyspark/pandas/tests/computation/test_apply_func.py
index 00b14441991..de82c061b58 100644
--- a/python/pyspark/pandas/tests/computation/test_apply_func.py
+++ b/python/pyspark/pandas/tests/computation/test_apply_func.py
@@ -40,12 +40,6 @@ class FrameApplyFunctionMixin:
 index=np.random.rand(9),
 )
 
-@property
-def df_pair(self):
-pdf = self.pdf
-psdf = ps.from_pandas(pdf)
-return pdf, psdf
-
 def test_apply(self):
 pdf = pd.DataFrame(
 {
diff --git a/python/pyspark/pandas/tests/computation/test_binary_ops.py 
b/python/pyspark/pandas/tests/computation/test_binary_ops.py
index 99a612576aa..09de7d6d015 100644
--- a/python/pyspark/pandas/tests/computation/test_binary_ops.py
+++ b/python/pyspark/pandas/tests/computation/test_binary_ops.py
@@ -35,12 +35,6 @@ class FrameBinaryOpsMixin:
 index=np.random.rand(9),
 )
 
-@property
-def df_pair(self):
-pdf = self.pdf
-psdf = ps.from_pandas(pdf)
-return pdf, psdf
-
 def test_binary_operators(self):
 pdf = pd.DataFrame(
 {"A": [0, 2, 4], "B": [4, 2, 0], "X": [-1, 10, 0]}, 
index=np.random.rand(3)
diff --git a/python/pyspark/pandas/tests/computation/test_combine.py 
b/python/pyspark/pandas/tests/computation/test_combine.py
index 9df4fd7ef2b..4de4dd1cccb 100644
--- a/python/pyspark/pandas/tests/computation/test_combine.py
+++ b/python/pyspark/pandas/tests/computation/test_combine.py
@@ -27,19 +27,6 @@ from pyspark.testing.sqlutils import SQLTestUtils
 # This file contains test cases for 'Combining / joining / merging'
 # 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/frame.html#combining-joining-merging
 class FrameCombineMixin:
-@property
-def pdf(self):
-return pd.DataFrame(
-{"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 
0]},
-index=np.random.rand(9),
-)
-
-@property
-def df_pair(self):
-pdf = self.pdf
-psdf = ps.from_pandas(pdf)
-return pdf, psdf
-
 def test_concat(self):
 pdf = pd.DataFrame([[1, 2], [3, 4]], columns=list("AB"))
 psdf = ps.from_pandas(pdf)
diff --git 

(spark) branch master updated: [SPARK-46280][PS][TESTS] Move test_parity_frame_resample and test_parity_series_resample to `pyspark.pandas.tests.connect.resample`

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d6f61d4e9d5 [SPARK-46280][PS][TESTS] Move test_parity_frame_resample 
and test_parity_series_resample to `pyspark.pandas.tests.connect.resample`
d6f61d4e9d5 is described below

commit d6f61d4e9d58c9806e345932482ecafa77c1a405
Author: Ruifeng Zheng 
AuthorDate: Wed Dec 6 17:24:05 2023 +0900

[SPARK-46280][PS][TESTS] Move test_parity_frame_resample and 
test_parity_series_resample to `pyspark.pandas.tests.connect.resample`

### What changes were proposed in this pull request?
Move test_parity_frame_resample and test_parity_series_resample to 
`pyspark.pandas.tests.connect.resample`

### Why are the changes needed?
re-org resampling tests, move them to the right place

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44199 from zhengruifeng/ps_test_move_df_resample.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 dev/sparktestsupport/modules.py   | 8 
 .../test_parity_frame.py} | 8 
 .../test_parity_series.py}| 8 
 .../tests/{test_frame_resample.py => resample/test_frame.py}  | 6 +++---
 .../tests/{test_series_resample.py => resample/test_series.py}| 6 +++---
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index d204fcf8295..834b3bd235a 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -743,10 +743,10 @@ pyspark_pandas = Module(
 "pyspark.pandas.tests.test_repr",
 "pyspark.pandas.tests.resample.test_on",
 "pyspark.pandas.tests.resample.test_error",
+"pyspark.pandas.tests.resample.test_frame",
 "pyspark.pandas.tests.resample.test_missing",
+"pyspark.pandas.tests.resample.test_series",
 "pyspark.pandas.tests.resample.test_timezone",
-"pyspark.pandas.tests.test_frame_resample",
-"pyspark.pandas.tests.test_series_resample",
 "pyspark.pandas.tests.test_reshape",
 "pyspark.pandas.tests.test_rolling",
 "pyspark.pandas.tests.test_scalars",
@@ -1109,8 +1109,8 @@ pyspark_pandas_connect_part2 = Module(
 "pyspark.pandas.tests.connect.indexes.test_parity_datetime_property",
 "pyspark.pandas.tests.connect.test_parity_frame_interpolate",
 "pyspark.pandas.tests.connect.test_parity_series_interpolate",
-"pyspark.pandas.tests.connect.test_parity_frame_resample",
-"pyspark.pandas.tests.connect.test_parity_series_resample",
+"pyspark.pandas.tests.connect.resample.test_parity_frame",
+"pyspark.pandas.tests.connect.resample.test_parity_series",
 "pyspark.pandas.tests.connect.test_parity_ewm",
 "pyspark.pandas.tests.connect.test_parity_rolling",
 "pyspark.pandas.tests.connect.test_parity_expanding",
diff --git a/python/pyspark/pandas/tests/connect/test_parity_frame_resample.py 
b/python/pyspark/pandas/tests/connect/resample/test_parity_frame.py
similarity index 82%
rename from python/pyspark/pandas/tests/connect/test_parity_frame_resample.py
rename to python/pyspark/pandas/tests/connect/resample/test_parity_frame.py
index b4f4d5aceea..8e12f6be51e 100644
--- a/python/pyspark/pandas/tests/connect/test_parity_frame_resample.py
+++ b/python/pyspark/pandas/tests/connect/resample/test_parity_frame.py
@@ -16,19 +16,19 @@
 #
 import unittest
 
-from pyspark.pandas.tests.test_frame_resample import FrameResampleTestsMixin
+from pyspark.pandas.tests.resample.test_frame import ResampleFrameMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 from pyspark.testing.pandasutils import PandasOnSparkTestUtils, TestUtils
 
 
-class FrameResampleParityTests(
-FrameResampleTestsMixin, PandasOnSparkTestUtils, TestUtils, 
ReusedConnectTestCase
+class ResampleParityFrameTests(
+ResampleFrameMixin, PandasOnSparkTestUtils, TestUtils, 
ReusedConnectTestCase
 ):
 pass
 
 
 if __name__ == "__main__":
-from pyspark.pandas.tests.connect.test_parity_frame_resample import *  # 
noqa: F401
+from pyspark.pandas.tests.connect.resample.test_parity_frame import *  # 
noqa: F401
 
 try:
 import xmlrunner  # type: ignore[import]
diff --git a/python/pyspark/pandas/tests/connect/test_parity_series_resample.py 
b/python/pyspark/pandas/tests/connect/resample/test_parity_series.py
similarity index 81%
rename from python/pyspark/pandas/tests/connect/test_parity_series_resample.py
rename to 

(spark) branch master updated: [SPARK-46276][PYTHON][TESTS] Improve test coverage of pyspark utils

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e7616dc9502 [SPARK-46276][PYTHON][TESTS] Improve test coverage of 
pyspark utils
e7616dc9502 is described below

commit e7616dc95021471b42e9dfcf2ef99b4e3f726e52
Author: Xinrong Meng 
AuthorDate: Wed Dec 6 17:21:01 2023 +0900

[SPARK-46276][PYTHON][TESTS] Improve test coverage of pyspark utils

### What changes were proposed in this pull request?
Improve test coverage of pyspark utils

### Why are the changes needed?
Subtasks of 
[SPARK-46041](https://issues.apache.org/jira/browse/SPARK-46041) to improve 
test coverage

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test changes only.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44192 from xinrong-meng/improve_test_util.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/tests/test_util.py | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/python/pyspark/tests/test_util.py 
b/python/pyspark/tests/test_util.py
index af104d683aa..ad0b106d229 100644
--- a/python/pyspark/tests/test_util.py
+++ b/python/pyspark/tests/test_util.py
@@ -20,6 +20,8 @@ import unittest
 from py4j.protocol import Py4JJavaError
 
 from pyspark import keyword_only
+from pyspark.util import _parse_memory
+from pyspark.loose_version import LooseVersion
 from pyspark.testing.utils import PySparkTestCase, eventually
 from pyspark.find_spark_home import _find_spark_home
 
@@ -105,6 +107,23 @@ class UtilTests(PySparkTestCase):
 lambda: self.assertTrue(random.random() < 0.1)
 )()
 
+def test_loose_version(self):
+v1 = LooseVersion("1.2.3")
+self.assertEqual(str(v1), "1.2.3")
+self.assertEqual(repr(v1), "LooseVersion ('1.2.3')")
+v2 = "1.2.3"
+self.assertTrue(v1 == v2)
+v3 = 1.1
+with self.assertRaises(TypeError):
+v1 > v3
+v4 = LooseVersion("1.2.4")
+self.assertTrue(v1 <= v4)
+
+def test_parse_memory(self):
+self.assertEqual(_parse_memory("1g"), 1024)
+with self.assertRaisesRegex(ValueError, "invalid format"):
+_parse_memory("2gs")
+
 
 if __name__ == "__main__":
 from pyspark.tests.test_util import *  # noqa: F401


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed` should respect the dict ordering

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 032e78297b0 [SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed` 
should respect the dict ordering
032e78297b0 is described below

commit 032e78297b02adb4266818776b55e09057705084
Author: Ruifeng Zheng 
AuthorDate: Wed Dec 6 17:16:07 2023 +0900

[SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed` should respect the 
dict ordering

### What changes were proposed in this pull request?
Make `DataFrame.withColumnsRenamed` respect the dict ordering

### Why are the changes needed?
the ordering in `withColumnsRenamed` matters

in scala
```
scala> val df = spark.range(1000)
val df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.withColumnsRenamed(Map("id" -> "a", "a" -> "b"))
val res0: org.apache.spark.sql.DataFrame = [b: bigint]

scala> df.withColumnsRenamed(Map("a" -> "b", "id" -> "a"))
val res1: org.apache.spark.sql.DataFrame = [a: bigint]
```

However, in py4j the Python `dict` -> JVM `map` conversion can not 
guarantee the ordering

### Does this PR introduce _any_ user-facing change?
yes, behavior change

before this PR
```
In [1]: df = spark.range(10)

In [2]: df.withColumnsRenamed({"id": "a", "a": "b"})
Out[2]: DataFrame[a: bigint]

In [3]: df.withColumnsRenamed({"a": "b", "id": "a"})
Out[3]: DataFrame[a: bigint]
```

after this PR
```
In [1]: df = spark.range(10)

In [2]: df.withColumnsRenamed({"id": "a", "a": "b"})
Out[2]: DataFrame[b: bigint]

In [3]: df.withColumnsRenamed({"a": "b", "id": "a"})
Out[3]: DataFrame[a: bigint]
```

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44177 from zhengruifeng/sql_withColumnsRenamed_sql.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/dataframe.py| 13 ++-
 .../sql/tests/connect/test_parity_dataframe.py |  5 
 python/pyspark/sql/tests/test_dataframe.py |  9 
 .../main/scala/org/apache/spark/sql/Dataset.scala  | 27 +++---
 .../org/apache/spark/sql/DataFrameSuite.scala  |  7 ++
 5 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 5211d874ba3..1419d1f3cb6 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -6272,7 +6272,18 @@ class DataFrame(PandasMapOpsMixin, 
PandasConversionMixin):
 message_parameters={"arg_name": "colsMap", "arg_type": 
type(colsMap).__name__},
 )
 
-return DataFrame(self._jdf.withColumnsRenamed(colsMap), 
self.sparkSession)
+col_names: List[str] = []
+new_col_names: List[str] = []
+for k, v in colsMap.items():
+col_names.append(k)
+new_col_names.append(v)
+
+return DataFrame(
+self._jdf.withColumnsRenamed(
+_to_seq(self._sc, col_names), _to_seq(self._sc, new_col_names)
+),
+self.sparkSession,
+)
 
 def withMetadata(self, columnName: str, metadata: Dict[str, Any]) -> 
"DataFrame":
 """Returns a new :class:`DataFrame` by updating an existing column 
with metadata.
diff --git a/python/pyspark/sql/tests/connect/test_parity_dataframe.py 
b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
index b7b4fdcd287..fbef282e0b9 100644
--- a/python/pyspark/sql/tests/connect/test_parity_dataframe.py
+++ b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
@@ -77,6 +77,11 @@ class DataFrameParityTests(DataFrameTestsMixin, 
ReusedConnectTestCase):
 def test_toDF_with_string(self):
 super().test_toDF_with_string()
 
+# TODO(SPARK-46261): Python Client withColumnsRenamed should respect the 
dict ordering
+@unittest.skip("Fails in Spark Connect, should enable.")
+def test_ordering_of_with_columns_renamed(self):
+super().test_ordering_of_with_columns_renamed()
+
 
 if __name__ == "__main__":
 import unittest
diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index 52806f4f4a3..c25fe60ad17 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -163,6 +163,15 @@ class DataFrameTestsMixin:
 message_parameters={"arg_name": "colsMap", "arg_type": "tuple"},
 )
 
+def test_ordering_of_with_columns_renamed(self):
+df = self.spark.range(10)
+
+df1 = df.withColumnsRenamed({"id": "a", "a": "b"})
+

(spark) branch master updated: [SPARK-46213][PYTHON] Introduce `PySparkImportError` for error framework

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 35a99a87c51 [SPARK-46213][PYTHON] Introduce `PySparkImportError` for 
error framework
35a99a87c51 is described below

commit 35a99a87c51c504c0231715e14bdbcc89a6b63d0
Author: Haejoon Lee 
AuthorDate: Wed Dec 6 17:15:38 2023 +0900

[SPARK-46213][PYTHON] Introduce `PySparkImportError` for error framework

### What changes were proposed in this pull request?

This PR proposes to introduce `PySparkImportError` for error framework.

**NOTE**: This PR was merged from 
https://github.com/apache/spark/pull/44123, but reverted back because it's not 
parsed special character properly. So this PR also including changes for fixing 
`python/pyspark/errors/utils.py` as well.

### Why are the changes needed?

For better error handling.

### Does this PR introduce _any_ user-facing change?

No API changes, but it's improve the user-facing error messages.

### How was this patch tested?

The existing CI should pass

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44176 from itholic/import_error_followup.

Authored-by: Haejoon Lee 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/reference/pyspark.errors.rst |  1 +
 python/pyspark/errors/__init__.py   |  2 ++
 python/pyspark/errors/error_classes.py  |  5 
 python/pyspark/errors/exceptions/base.py|  6 
 python/pyspark/errors/utils.py  | 11 +--
 python/pyspark/sql/connect/utils.py | 36 ---
 python/pyspark/sql/pandas/utils.py  | 39 +
 7 files changed, 75 insertions(+), 25 deletions(-)

diff --git a/python/docs/source/reference/pyspark.errors.rst 
b/python/docs/source/reference/pyspark.errors.rst
index 56fdde2584c..a4997506b41 100644
--- a/python/docs/source/reference/pyspark.errors.rst
+++ b/python/docs/source/reference/pyspark.errors.rst
@@ -44,6 +44,7 @@ Classes
 PySparkRuntimeError
 PySparkTypeError
 PySparkValueError
+PySparkImportError
 PySparkIndexError
 PythonException
 QueryExecutionException
diff --git a/python/pyspark/errors/__init__.py 
b/python/pyspark/errors/__init__.py
index 0a55084a4a5..07033d21643 100644
--- a/python/pyspark/errors/__init__.py
+++ b/python/pyspark/errors/__init__.py
@@ -39,6 +39,7 @@ from pyspark.errors.exceptions.base import (  # noqa: F401
 SparkNoSuchElementException,
 PySparkTypeError,
 PySparkValueError,
+PySparkImportError,
 PySparkIndexError,
 PySparkAttributeError,
 PySparkRuntimeError,
@@ -70,6 +71,7 @@ __all__ = [
 "SparkNoSuchElementException",
 "PySparkTypeError",
 "PySparkValueError",
+"PySparkImportError",
 "PySparkIndexError",
 "PySparkAttributeError",
 "PySparkRuntimeError",
diff --git a/python/pyspark/errors/error_classes.py 
b/python/pyspark/errors/error_classes.py
index 7dd5cd92705..c93ffa94149 100644
--- a/python/pyspark/errors/error_classes.py
+++ b/python/pyspark/errors/error_classes.py
@@ -1018,6 +1018,11 @@ ERROR_CLASSES_JSON = """
   " is not supported."
 ]
   },
+  "UNSUPPORTED_PACKAGE_VERSION" : {
+"message" : [
+  " >=  must be installed; however, your 
version is ."
+]
+  },
   "UNSUPPORTED_PARAM_TYPE_FOR_HIGHER_ORDER_FUNCTION" : {
 "message" : [
   "Function `` should use only POSITIONAL or POSITIONAL OR 
KEYWORD arguments."
diff --git a/python/pyspark/errors/exceptions/base.py 
b/python/pyspark/errors/exceptions/base.py
index e7f1e4386d7..b7d8ed88ec0 100644
--- a/python/pyspark/errors/exceptions/base.py
+++ b/python/pyspark/errors/exceptions/base.py
@@ -264,3 +264,9 @@ class PySparkKeyError(PySparkException, KeyError):
 """
 Wrapper class for KeyError to support error classes.
 """
+
+
+class PySparkImportError(PySparkException, ImportError):
+"""
+Wrapper class for ImportError to support error classes.
+"""
diff --git a/python/pyspark/errors/utils.py b/python/pyspark/errors/utils.py
index a4894dcb1a6..e1f249506dd 100644
--- a/python/pyspark/errors/utils.py
+++ b/python/pyspark/errors/utils.py
@@ -16,7 +16,7 @@
 #
 
 import re
-from typing import Dict
+from typing import Dict, Match
 
 from pyspark.errors.error_classes import ERROR_CLASSES_MAP
 
@@ -40,9 +40,14 @@ class ErrorClassesReader:
 f"Undefined error message parameter for error class: 
{error_class}. "
 f"Parameters: {message_parameters}"
 )
-table = str.maketrans("<>", "{}")
 
-return message_template.translate(table).format(**message_parameters)
+def replace_match(match: Match[str]) -> str:
+return 

(spark) branch master updated: [SPARK-46252][PYTHON][TESTS] Improve test coverage of memory_profiler.py

2023-12-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 44036f2be3d [SPARK-46252][PYTHON][TESTS] Improve test coverage of 
memory_profiler.py
44036f2be3d is described below

commit 44036f2be3d5da7d8e20ee5082868dd70eb35884
Author: Xinrong Meng 
AuthorDate: Wed Dec 6 17:15:07 2023 +0900

[SPARK-46252][PYTHON][TESTS] Improve test coverage of memory_profiler.py

### What changes were proposed in this pull request?
Improve test coverage of memory_profiler.py

### Why are the changes needed?
Subtasks of 
[SPARK-46041](https://issues.apache.org/jira/browse/SPARK-46041) to improve 
test coverage

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test changes only.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44167 from xinrong-meng/test_profiler.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/tests/test_memory_profiler.py | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/python/pyspark/tests/test_memory_profiler.py 
b/python/pyspark/tests/test_memory_profiler.py
index 31ad62e11d9..c2c185b9863 100644
--- a/python/pyspark/tests/test_memory_profiler.py
+++ b/python/pyspark/tests/test_memory_profiler.py
@@ -45,6 +45,32 @@ class MemoryProfilerTests(PySparkTestCase):
 self.sc = SparkContext("local[4]", class_name, conf=conf)
 self.spark = SparkSession(sparkContext=self.sc)
 
+def test_code_map(self):
+from pyspark.profiler import CodeMapForUDF
+
+code_map = CodeMapForUDF(include_children=False, backend="psutil")
+
+def f(x):
+return x + 1
+
+code = f.__code__
+code_map.add(code)
+code_map.add(code)  # no-op, will return directly
+
+self.assertIn(code, code_map)
+self.assertEqual(len(code_map._toplevel), 1)
+
+def test_udf_line_profiler(self):
+from pyspark.profiler import UDFLineProfiler
+
+profiler = UDFLineProfiler()
+
+def f(x):
+return x + 1
+
+profiler.add_function(f)
+self.assertTrue(profiler.code_map)
+
 def test_memory_profiler(self):
 self.exec_python_udf()
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org