[spark] branch master updated: [SPARK-44289][SPARK-43874][SPARK-43869][SPARK-43607][PS] Support `indexer_between_time` for pandas 2.0.0 & enabling more tests

2023-08-17 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 960c027fb91b [SPARK-44289][SPARK-43874][SPARK-43869][SPARK-43607][PS] 
Support `indexer_between_time` for pandas 2.0.0 & enabling more tests
960c027fb91b is described below

commit 960c027fb91b5b198cb858a89458cf94a7b1f05f
Author: itholic 
AuthorDate: Fri Aug 18 12:53:53 2023 +0800

[SPARK-44289][SPARK-43874][SPARK-43869][SPARK-43607][PS] Support 
`indexer_between_time` for pandas 2.0.0 & enabling more tests

### What changes were proposed in this pull request?

This PR proposes to support `DatetimeIndex.indexer_between_time` to support 
pandas 2.0.0 and above. See https://github.com/pandas-dev/pandas/issues/43248 
for more detail.

This PR also enables bunch of tests for `Series`, `Index` and `GroupBy`.

### Why are the changes needed?

To match the behavior with latest pandas.

### Does this PR introduce _any_ user-facing change?

`DatetimeIndex.indexer_between_time` now has the same behavior with the 
latest pandas.

### How was this patch tested?

Enabling & updating the existing UTs and doctests.

Closes #42533 from itholic/enable-many-tests.

Authored-by: itholic 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/groupby.py   |  7 +--
 python/pyspark/pandas/indexes/datetimes.py | 20 ++---
 .../pyspark/pandas/tests/computation/test_cov.py   | 51 --
 .../pandas/tests/data_type_ops/test_date_ops.py| 20 -
 .../pyspark/pandas/tests/groupby/test_aggregate.py | 12 +
 .../pandas/tests/groupby/test_apply_func.py| 11 ++---
 .../pyspark/pandas/tests/groupby/test_groupby.py   |  7 ++-
 python/pyspark/pandas/tests/indexes/test_base.py   | 24 --
 .../pyspark/pandas/tests/indexes/test_datetime.py  |  5 ---
 .../pyspark/pandas/tests/indexes/test_reindex.py   |  9 ++--
 python/pyspark/pandas/tests/series/test_as_type.py | 34 +--
 11 files changed, 58 insertions(+), 142 deletions(-)

diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index 2de328177937..f9d93299e3e8 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -4165,6 +4165,7 @@ class SeriesGroupBy(GroupBy[Series]):
 
 Examples
 
+>>> import numpy as np
 >>> df = ps.DataFrame({'A': [1, 2, 2, 3, 3, 3],
 ...'B': [1, 1, 2, 3, 3, np.nan]},
 ...   columns=['A', 'B'])
@@ -4183,7 +4184,7 @@ class SeriesGroupBy(GroupBy[Series]):
 2  1.01
2.01
 3  3.02
-Name: B, dtype: int64
+Name: count, dtype: int64
 
 Don't include counts of NaN when dropna is False.
 
@@ -4195,7 +4196,7 @@ class SeriesGroupBy(GroupBy[Series]):
2.01
 3  3.02
NaN1
-Name: B, dtype: int64
+Name: count, dtype: int64
 """
 warnings.warn(
 "The resulting Series will have a fixed name of 'count' from 
4.0.0.",
@@ -4232,7 +4233,7 @@ class SeriesGroupBy(GroupBy[Series]):
 psser._internal.data_fields[0].copy(name=name)
 for psser, name in zip(groupkeys, groupkey_names)
 ],
-column_labels=[self._agg_columns[0]._column_label],
+column_labels=[("count",)],
 data_spark_columns=[scol_for(sdf, agg_column)],
 )
 return first_series(DataFrame(internal))
diff --git a/python/pyspark/pandas/indexes/datetimes.py 
b/python/pyspark/pandas/indexes/datetimes.py
index 1971d90a7427..f37661cb6a0a 100644
--- a/python/pyspark/pandas/indexes/datetimes.py
+++ b/python/pyspark/pandas/indexes/datetimes.py
@@ -730,24 +730,32 @@ class DatetimeIndex(Index):
 
 Examples
 
->>> psidx = ps.date_range("2000-01-01", periods=3, freq="T")  # 
doctest: +SKIP
->>> psidx  # doctest: +SKIP
+>>> psidx = ps.date_range("2000-01-01", periods=3, freq="T")
+>>> psidx
 DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00',
'2000-01-01 00:02:00'],
   dtype='datetime64[ns]', freq=None)
 
->>> psidx.indexer_between_time("00:01", "00:02").sort_values()  # 
doctest: +SKIP
+>>> psidx.indexer_between_time("00:01", "00:02").sort_values()
 Index([1, 2], dtype='int64')
 
->>> psidx.indexer_between_time("00:01", "00:02", include_end=False)  # 
doctest: +SKIP
+>>> psidx.indexer_between_time("00:01", "00:02", include_end=False)
 Index([1], dtype='int64')
 
->>> psidx.indexer_between_time("00:01", "00:02", include_start=False)  
# doctest: +SKIP
+>>> 

[spark] branch master updated: [SPARK-43205][DOC] identifier clause docs

2023-08-17 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 02a07cd6adc [SPARK-43205][DOC] identifier clause docs
02a07cd6adc is described below

commit 02a07cd6adc7f0674bc673e3f917d71d9b290199
Author: srielau 
AuthorDate: Fri Aug 18 09:26:32 2023 +0800

[SPARK-43205][DOC] identifier clause docs

### What changes were proposed in this pull request?

Document the IDENTIFIER() clause

### Why are the changes needed?

Docs are good!

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

See attached
https://github.com/apache/spark/assets/3514644/55823375-8d1a-4473-bf19-74796d273416;>

https://github.com/apache/spark/assets/3514644/0ee852a9-6a11-4c87-bed9-43531c55fc31;>

Closes #42506 from srielau/SPARK-43205-3.5-IDENTIFIER-clause-docs.

Authored-by: srielau 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 7786d0b2f359eccd570461a399da0fca84e515c1)
Signed-off-by: Wenchen Fan 
---
 docs/sql-ref-identifier-clause.md | 106 ++
 docs/sql-ref.md   |   1 +
 2 files changed, 107 insertions(+)

diff --git a/docs/sql-ref-identifier-clause.md 
b/docs/sql-ref-identifier-clause.md
new file mode 100644
index 000..694731109f8
--- /dev/null
+++ b/docs/sql-ref-identifier-clause.md
@@ -0,0 +1,106 @@
+---
+layout: global
+title: Identifier clause
+displayTitle: IDENTIFIER clause
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Converts a constant `STRING` expression into a SQL object name.
+The purpose of this clause is to allow for templating of identifiers in SQL 
statements without opening up the risk of SQL injection attacks.
+Typically, this clause is used with a parameter marker as argument.
+
+### Syntax
+
+```sql
+IDENTIFIER ( strExpr )
+```
+
+### Parameters
+
+- **strExpr**: A constant `STRING` expression. Typically, the expression 
includes a parameter marker.
+
+### Returns
+
+A (qualified) identifier which can be used as a:
+
+- qualified table name
+- namespace name
+- function name
+- qualified column or attribute reference
+
+### Examples
+
+These examples use named parameter markers to templatize queries.
+
+```scala
+// Creation of a table using parameter marker.
+spark.sql("CREATE TABLE IDENTIFIER(:mytab)(c1 INT)", args = Map("mytab" -> 
"tab1")).show()
+
+spark.sql("DESCRIBE IDENTIFIER(:mytab)", args = Map("mytab" -> "tab1")).show()
+++-+---+
+|col_name|data_type|comment|
+++-+---+
+|  c1|  int|   NULL|
+++-+---+
+
+// Altering a table with a fixed schema and a parameterized table name. 
+spark.sql("ALTER TABLE IDENTIFIER('default.' || :mytab) ADD COLUMN c2 INT", 
args = Map("mytab" -> "tab1")).show()
+
+spark.sql("DESCRIBE IDENTIFIER(:mytab)", args = Map("mytab" -> 
"default.tab1")).show()
+++-+---+
+|col_name|data_type|comment|
+++-+---+
+|  c1|  int|   NULL|
+|  c2|  int|   NULL|
+++-+---+
+
+// A parameterized reference to a table in a query. This table name is 
qualified and uses back-ticks.
+spark.sql("SELECT * FROM IDENTIFIER(:mytab)", args = Map("mytab" -> 
"`default`.`tab1`")).show()
++---+---+
+| c1| c2|
++---+---+
++---+---+
+
+
+// You cannot qualify the IDENTIFIER clause or use it as a qualifier itself.
+spark.sql("SELECT * FROM myschema.IDENTIFIER(:mytab)", args = Map("mytab" -> 
"`tab1`")).show()
+[INVALID_SQL_SYNTAX.INVALID_TABLE_VALUED_FUNC_NAME] `myschema`.`IDENTIFIER`.
+
+spark.sql("SELECT * FROM IDENTIFIER(:myschema).mytab", args = Map("mychema" -> 
"`default`")).show()
+[PARSE_SYNTAX_ERROR]
+
+// Dropping a table with separate schema and table parameters.
+spark.sql("DROP TABLE IDENTIFIER(:myschema || '.' || :mytab)", args = 
Map("myschema" -> "default", "mytab" -> "tab1")).show()
+
+// A parameterized column reference
+spark.sql("SELECT IDENTIFIER(:col) FROM VALUES(1) AS T(c1)", args = Map("col" 

[spark] branch branch-3.5 updated: [SPARK-43205][DOC] identifier clause docs

2023-08-17 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 7786d0b2f35 [SPARK-43205][DOC] identifier clause docs
7786d0b2f35 is described below

commit 7786d0b2f359eccd570461a399da0fca84e515c1
Author: srielau 
AuthorDate: Fri Aug 18 09:26:32 2023 +0800

[SPARK-43205][DOC] identifier clause docs

### What changes were proposed in this pull request?

Document the IDENTIFIER() clause

### Why are the changes needed?

Docs are good!

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

See attached
https://github.com/apache/spark/assets/3514644/55823375-8d1a-4473-bf19-74796d273416;>

https://github.com/apache/spark/assets/3514644/0ee852a9-6a11-4c87-bed9-43531c55fc31;>

Closes #42506 from srielau/SPARK-43205-3.5-IDENTIFIER-clause-docs.

Authored-by: srielau 
Signed-off-by: Wenchen Fan 
---
 docs/sql-ref-identifier-clause.md | 106 ++
 docs/sql-ref.md   |   1 +
 2 files changed, 107 insertions(+)

diff --git a/docs/sql-ref-identifier-clause.md 
b/docs/sql-ref-identifier-clause.md
new file mode 100644
index 000..694731109f8
--- /dev/null
+++ b/docs/sql-ref-identifier-clause.md
@@ -0,0 +1,106 @@
+---
+layout: global
+title: Identifier clause
+displayTitle: IDENTIFIER clause
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Converts a constant `STRING` expression into a SQL object name.
+The purpose of this clause is to allow for templating of identifiers in SQL 
statements without opening up the risk of SQL injection attacks.
+Typically, this clause is used with a parameter marker as argument.
+
+### Syntax
+
+```sql
+IDENTIFIER ( strExpr )
+```
+
+### Parameters
+
+- **strExpr**: A constant `STRING` expression. Typically, the expression 
includes a parameter marker.
+
+### Returns
+
+A (qualified) identifier which can be used as a:
+
+- qualified table name
+- namespace name
+- function name
+- qualified column or attribute reference
+
+### Examples
+
+These examples use named parameter markers to templatize queries.
+
+```scala
+// Creation of a table using parameter marker.
+spark.sql("CREATE TABLE IDENTIFIER(:mytab)(c1 INT)", args = Map("mytab" -> 
"tab1")).show()
+
+spark.sql("DESCRIBE IDENTIFIER(:mytab)", args = Map("mytab" -> "tab1")).show()
+++-+---+
+|col_name|data_type|comment|
+++-+---+
+|  c1|  int|   NULL|
+++-+---+
+
+// Altering a table with a fixed schema and a parameterized table name. 
+spark.sql("ALTER TABLE IDENTIFIER('default.' || :mytab) ADD COLUMN c2 INT", 
args = Map("mytab" -> "tab1")).show()
+
+spark.sql("DESCRIBE IDENTIFIER(:mytab)", args = Map("mytab" -> 
"default.tab1")).show()
+++-+---+
+|col_name|data_type|comment|
+++-+---+
+|  c1|  int|   NULL|
+|  c2|  int|   NULL|
+++-+---+
+
+// A parameterized reference to a table in a query. This table name is 
qualified and uses back-ticks.
+spark.sql("SELECT * FROM IDENTIFIER(:mytab)", args = Map("mytab" -> 
"`default`.`tab1`")).show()
++---+---+
+| c1| c2|
++---+---+
++---+---+
+
+
+// You cannot qualify the IDENTIFIER clause or use it as a qualifier itself.
+spark.sql("SELECT * FROM myschema.IDENTIFIER(:mytab)", args = Map("mytab" -> 
"`tab1`")).show()
+[INVALID_SQL_SYNTAX.INVALID_TABLE_VALUED_FUNC_NAME] `myschema`.`IDENTIFIER`.
+
+spark.sql("SELECT * FROM IDENTIFIER(:myschema).mytab", args = Map("mychema" -> 
"`default`")).show()
+[PARSE_SYNTAX_ERROR]
+
+// Dropping a table with separate schema and table parameters.
+spark.sql("DROP TABLE IDENTIFIER(:myschema || '.' || :mytab)", args = 
Map("myschema" -> "default", "mytab" -> "tab1")).show()
+
+// A parameterized column reference
+spark.sql("SELECT IDENTIFIER(:col) FROM VALUES(1) AS T(c1)", args = Map("col" 
-> "t.c1")).show()
++---+
+| c1|
++---+
+|  1|
++---+
+
+// Passing in a function name as a 

[spark] branch branch-3.3 updated: [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI buttons

2023-08-17 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 3d27f201224 [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark 
Worker LogPage UI buttons
3d27f201224 is described below

commit 3d27f201224719af7c2dc7b9cfc2f428e771286b
Author: Dongjoon Hyun 
AuthorDate: Thu Aug 17 18:07:07 2023 -0700

[SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI 
buttons

### What changes were proposed in this pull request?

This PR aims to fix `getBaseURI` errors when we clicks Spark Worker LogPage 
UI buttons in Apache Spark 3.2.0+.

### Why are the changes needed?

Run a Spark job and open the Spark Worker UI, http://localhost:8081 .
```
$ sbin/start-master.sh
$ sbin/start-worker.sh spark://127.0.0.1:7077
$ bin/spark-shell --master spark://127.0.0.1:7077
```

Click `stderr` and `Load New` button. The button is out of order currently 
due to the following error because `getBaseURI` is defined in `utils.js`.

![Screenshot 2023-08-17 at 2 38 45 
PM](https://github.com/apache/spark/assets/9700541/c2358ae3-46d2-43fe-9cc1-ce343725ce4c)

### Does this PR introduce _any_ user-facing change?

This will make the buttons work.

### How was this patch tested?

Manual.

Closes #42546 from dongjoon-hyun/SPARK-44857.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit f807bd239aebe182a04f5452d6efdf458e44143c)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
index bda95f323f3..6800a19b549 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
@@ -102,6 +102,7 @@ private[ui] class LogPage(parent: WorkerWebUI) extends 
WebUIPage("logPage") with
   s"initLogPage('$logParams', $curLogLength, $startByte, $endByte, 
$logLength, $byteLength);"
 
 val content =
+   ++
   
 {linkToMaster}
 {range}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI buttons

2023-08-17 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 321ee6a489b [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark 
Worker LogPage UI buttons
321ee6a489b is described below

commit 321ee6a489b45bb2e9779d618d06226c32411474
Author: Dongjoon Hyun 
AuthorDate: Thu Aug 17 18:07:07 2023 -0700

[SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI 
buttons

### What changes were proposed in this pull request?

This PR aims to fix `getBaseURI` errors when we clicks Spark Worker LogPage 
UI buttons in Apache Spark 3.2.0+.

### Why are the changes needed?

Run a Spark job and open the Spark Worker UI, http://localhost:8081 .
```
$ sbin/start-master.sh
$ sbin/start-worker.sh spark://127.0.0.1:7077
$ bin/spark-shell --master spark://127.0.0.1:7077
```

Click `stderr` and `Load New` button. The button is out of order currently 
due to the following error because `getBaseURI` is defined in `utils.js`.

![Screenshot 2023-08-17 at 2 38 45 
PM](https://github.com/apache/spark/assets/9700541/c2358ae3-46d2-43fe-9cc1-ce343725ce4c)

### Does this PR introduce _any_ user-facing change?

This will make the buttons work.

### How was this patch tested?

Manual.

Closes #42546 from dongjoon-hyun/SPARK-44857.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit f807bd239aebe182a04f5452d6efdf458e44143c)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
index bda95f323f3..6800a19b549 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
@@ -102,6 +102,7 @@ private[ui] class LogPage(parent: WorkerWebUI) extends 
WebUIPage("logPage") with
   s"initLogPage('$logParams', $curLogLength, $startByte, $endByte, 
$logLength, $byteLength);"
 
 val content =
+   ++
   
 {linkToMaster}
 {range}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI buttons

2023-08-17 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new a5e91757014 [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark 
Worker LogPage UI buttons
a5e91757014 is described below

commit a5e91757014cbaa4c7bdbd031d13b1d55c8a6b4b
Author: Dongjoon Hyun 
AuthorDate: Thu Aug 17 18:07:07 2023 -0700

[SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI 
buttons

### What changes were proposed in this pull request?

This PR aims to fix `getBaseURI` errors when we clicks Spark Worker LogPage 
UI buttons in Apache Spark 3.2.0+.

### Why are the changes needed?

Run a Spark job and open the Spark Worker UI, http://localhost:8081 .
```
$ sbin/start-master.sh
$ sbin/start-worker.sh spark://127.0.0.1:7077
$ bin/spark-shell --master spark://127.0.0.1:7077
```

Click `stderr` and `Load New` button. The button is out of order currently 
due to the following error because `getBaseURI` is defined in `utils.js`.

![Screenshot 2023-08-17 at 2 38 45 
PM](https://github.com/apache/spark/assets/9700541/c2358ae3-46d2-43fe-9cc1-ce343725ce4c)

### Does this PR introduce _any_ user-facing change?

This will make the buttons work.

### How was this patch tested?

Manual.

Closes #42546 from dongjoon-hyun/SPARK-44857.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit f807bd239aebe182a04f5452d6efdf458e44143c)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
index bda95f323f3..6800a19b549 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
@@ -102,6 +102,7 @@ private[ui] class LogPage(parent: WorkerWebUI) extends 
WebUIPage("logPage") with
   s"initLogPage('$logParams', $curLogLength, $startByte, $endByte, 
$logLength, $byteLength);"
 
 val content =
+   ++
   
 {linkToMaster}
 {range}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI buttons

2023-08-17 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f807bd239ae [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark 
Worker LogPage UI buttons
f807bd239ae is described below

commit f807bd239aebe182a04f5452d6efdf458e44143c
Author: Dongjoon Hyun 
AuthorDate: Thu Aug 17 18:07:07 2023 -0700

[SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI 
buttons

### What changes were proposed in this pull request?

This PR aims to fix `getBaseURI` errors when we clicks Spark Worker LogPage 
UI buttons in Apache Spark 3.2.0+.

### Why are the changes needed?

Run a Spark job and open the Spark Worker UI, http://localhost:8081 .
```
$ sbin/start-master.sh
$ sbin/start-worker.sh spark://127.0.0.1:7077
$ bin/spark-shell --master spark://127.0.0.1:7077
```

Click `stderr` and `Load New` button. The button is out of order currently 
due to the following error because `getBaseURI` is defined in `utils.js`.

![Screenshot 2023-08-17 at 2 38 45 
PM](https://github.com/apache/spark/assets/9700541/c2358ae3-46d2-43fe-9cc1-ce343725ce4c)

### Does this PR introduce _any_ user-facing change?

This will make the buttons work.

### How was this patch tested?

Manual.

Closes #42546 from dongjoon-hyun/SPARK-44857.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
index bda95f323f3..6800a19b549 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
@@ -102,6 +102,7 @@ private[ui] class LogPage(parent: WorkerWebUI) extends 
WebUIPage("logPage") with
   s"initLogPage('$logParams', $curLogLength, $startByte, $endByte, 
$logLength, $byteLength);"
 
 val content =
+   ++
   
 {linkToMaster}
 {range}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43875][PS][TESTS] Enabling Categorical tests for Pandas 2.0.0 and above

2023-08-17 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c321b3dd66f [SPARK-43875][PS][TESTS] Enabling Categorical tests for 
Pandas 2.0.0 and above
c321b3dd66f is described below

commit c321b3dd66f6b02a9dd39672c58fc20405fb40e4
Author: itholic 
AuthorDate: Fri Aug 18 08:39:05 2023 +0800

[SPARK-43875][PS][TESTS] Enabling Categorical tests for Pandas 2.0.0 and 
above

### What changes were proposed in this pull request?

This PR proposes to enable Categorical tests for pandas 2.0.0 and above. 
See https://pandas.pydata.org/docs/whatsnew/v2.0.0.html for more detail.

### Why are the changes needed?

To match the behavior with pandas 2.0.0 and above.

### Does this PR introduce _any_ user-facing change?

No, this is test-only.

### How was this patch tested?

Enabling & updating the existing UTs.

Closes #42530 from itholic/pandas_categorical_test.

Authored-by: itholic 
Signed-off-by: Ruifeng Zheng 
---
 .../pyspark/pandas/tests/indexes/test_category.py  | 26 --
 python/pyspark/pandas/tests/test_categorical.py| 17 --
 2 files changed, 8 insertions(+), 35 deletions(-)

diff --git a/python/pyspark/pandas/tests/indexes/test_category.py 
b/python/pyspark/pandas/tests/indexes/test_category.py
index 6aa92b7e1e3..d2405f6adb3 100644
--- a/python/pyspark/pandas/tests/indexes/test_category.py
+++ b/python/pyspark/pandas/tests/indexes/test_category.py
@@ -75,10 +75,6 @@ class CategoricalIndexTestsMixin:
 ):
 ps.CategoricalIndex([1, 2, 3]).all()
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43568): Enable 
CategoricalIndexTests.test_categories_setter for pandas 2.0.0.",
-)
 def test_categories_setter(self):
 pdf = pd.DataFrame(
 {
@@ -92,20 +88,10 @@ class CategoricalIndexTestsMixin:
 pidx = pdf.index
 psidx = psdf.index
 
-pidx.categories = ["z", "y", "x"]
-psidx.categories = ["z", "y", "x"]
-# Pandas deprecated all the in-place category-setting behaviors, 
dtypes also not be
-# refreshed in categories.setter since Pandas 1.4+, we should also 
consider to clean up
-# this test when in-place category-setting removed:
-# https://github.com/pandas-dev/pandas/issues/46820
-if LooseVersion("1.4") >= LooseVersion(pd.__version__) >= 
LooseVersion("1.1"):
-self.assert_eq(pidx, psidx)
-self.assert_eq(pdf, psdf)
-else:
-pidx = pidx.set_categories(pidx.categories)
-pdf.index = pidx
-self.assert_eq(pidx, psidx)
-self.assert_eq(pdf, psdf)
+pidx = pidx.rename_categories(["z", "y", "x"])
+psidx = psidx.rename_categories(["z", "y", "x"])
+self.assert_eq(pidx, psidx)
+self.assert_eq(pdf, psdf)
 
 with self.assertRaises(ValueError):
 psidx.categories = [1, 2, 3, 4]
@@ -122,10 +108,6 @@ class CategoricalIndexTestsMixin:
 self.assertRaises(ValueError, lambda: psidx.add_categories(3))
 self.assertRaises(ValueError, lambda: psidx.add_categories([4, 4]))
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43633): Enable 
CategoricalIndexTests.test_remove_categories for pandas 2.0.0.",
-)
 def test_remove_categories(self):
 pidx = pd.CategoricalIndex([1, 2, 3], categories=[3, 2, 1])
 psidx = ps.from_pandas(pidx)
diff --git a/python/pyspark/pandas/tests/test_categorical.py 
b/python/pyspark/pandas/tests/test_categorical.py
index c45e063d6f4..ba361ab565c 100644
--- a/python/pyspark/pandas/tests/test_categorical.py
+++ b/python/pyspark/pandas/tests/test_categorical.py
@@ -198,10 +198,6 @@ class CategoricalTestsMixin:
 
 self.assert_eq(pscser.astype(str), pcser.astype(str))
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43564): Enable CategoricalTests.test_factorize for pandas 
2.0.0.",
-)
 def test_factorize(self):
 pser = pd.Series(["a", "b", "c", None], dtype=CategoricalDtype(["c", 
"a", "d", "b"]))
 psser = ps.from_pandas(pser)
@@ -212,8 +208,8 @@ class CategoricalTestsMixin:
 self.assert_eq(kcodes.tolist(), pcodes.tolist())
 self.assert_eq(kuniques, puniques)
 
-pcodes, puniques = pser.factorize(na_sentinel=-2)
-kcodes, kuniques = psser.factorize(na_sentinel=-2)
+pcodes, puniques = pser.factorize(use_na_sentinel=-2)
+kcodes, kuniques = psser.factorize(use_na_sentinel=-2)
 
 self.assert_eq(kcodes.tolist(), pcodes.tolist())
 self.assert_eq(kuniques, puniques)
@@ 

[spark] branch branch-3.5 updated: [SPARK-44831][PYTHON][DOCS] Refine DocString of `DataFrame.{union, unionAll, unionByName}`

2023-08-17 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 3b3ceeaaee0 [SPARK-44831][PYTHON][DOCS] Refine DocString of 
`DataFrame.{union, unionAll, unionByName}`
3b3ceeaaee0 is described below

commit 3b3ceeaaee01e57872b392e7c660e70e6b1eb86a
Author: Ruifeng Zheng 
AuthorDate: Fri Aug 18 08:11:54 2023 +0800

[SPARK-44831][PYTHON][DOCS] Refine DocString of `DataFrame.{union, 
unionAll, unionByName}`

### What changes were proposed in this pull request?
Refine DocString of `Union*`:

1. fix minor grammar mistakes
2. add more examples

### Why are the changes needed?
to improve the docs

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
CI

Closes #42515 from zhengruifeng/doc_refince_union.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
(cherry picked from commit 48faaa8ee73d8005d2ed0668b4d0e860fc92ca4d)
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/dataframe.py | 140 ++--
 1 file changed, 105 insertions(+), 35 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 8be2c224265..932c29910bb 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -3741,7 +3741,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 )
 
 def union(self, other: "DataFrame") -> "DataFrame":
-"""Return a new :class:`DataFrame` containing union of rows in this 
and another
+"""Return a new :class:`DataFrame` containing the union of rows in 
this and another
 :class:`DataFrame`.
 
 .. versionadded:: 2.0.0
@@ -3752,11 +3752,12 @@ class DataFrame(PandasMapOpsMixin, 
PandasConversionMixin):
 Parameters
 --
 other : :class:`DataFrame`
-Another :class:`DataFrame` that needs to be unioned
+Another :class:`DataFrame` that needs to be unioned.
 
 Returns
 ---
 :class:`DataFrame`
+A new :class:`DataFrame` containing the combined rows with 
corresponding columns.
 
 See Also
 
@@ -3764,34 +3765,81 @@ class DataFrame(PandasMapOpsMixin, 
PandasConversionMixin):
 
 Notes
 -
-This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union
-(that does deduplication of elements), use this function followed by 
:func:`distinct`.
+This method performs a SQL-style set union of the rows from both 
`DataFrame` objects,
+with no automatic deduplication of elements.
 
-Also as standard in SQL, this function resolves columns by position 
(not by name).
+Use the `distinct()` method to perform deduplication of rows.
+
+The method resolves columns by position (not by name), following the 
standard behavior
+in SQL.
 
 Examples
 
->>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
->>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
->>> df1.union(df2).show()
-++++
-|col0|col1|col2|
-++++
-|   1|   2|   3|
-|   4|   5|   6|
-++++
->>> df1.union(df1).show()
-++++
-|col0|col1|col2|
-++++
-|   1|   2|   3|
-|   1|   2|   3|
-++++
+Example 1: Combining two DataFrames with the same schema
+
+>>> df1 = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'value'])
+>>> df2 = spark.createDataFrame([(3, 'C'), (4, 'D')], ['id', 'value'])
+>>> df3 = df1.union(df2)
+>>> df3.show()
++---+-+
+| id|value|
++---+-+
+|  1|A|
+|  2|B|
+|  3|C|
+|  4|D|
++---+-+
+
+Example 2: Combining two DataFrames with different schemas
+
+>>> from pyspark.sql.functions import lit
+>>> df1 = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["name", 
"id"])
+>>> df2 = spark.createDataFrame([(3, "Charlie"), (4, "Dave")], ["id", 
"name"])
+>>> df1 = df1.withColumn("age", lit(30))
+>>> df2 = df2.withColumn("age", lit(40))
+>>> df3 = df1.union(df2)
+>>> df3.show()
++-+---+---+
+| name| id|age|
++-+---+---+
+|Alice|  1| 30|
+|  Bob|  2| 30|
+|3|Charlie| 40|
+|4|   Dave| 40|
++-+---+---+
+
+Example 3: Combining two DataFrames with mismatched columns
+
+>>> df1 = spark.createDataFrame([(1, 2)], ["A", "B"])
+   

[spark] branch master updated: [SPARK-44831][PYTHON][DOCS] Refine DocString of `DataFrame.{union, unionAll, unionByName}`

2023-08-17 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 48faaa8ee73 [SPARK-44831][PYTHON][DOCS] Refine DocString of 
`DataFrame.{union, unionAll, unionByName}`
48faaa8ee73 is described below

commit 48faaa8ee73d8005d2ed0668b4d0e860fc92ca4d
Author: Ruifeng Zheng 
AuthorDate: Fri Aug 18 08:11:54 2023 +0800

[SPARK-44831][PYTHON][DOCS] Refine DocString of `DataFrame.{union, 
unionAll, unionByName}`

### What changes were proposed in this pull request?
Refine DocString of `Union*`:

1. fix minor grammar mistakes
2. add more examples

### Why are the changes needed?
to improve the docs

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
CI

Closes #42515 from zhengruifeng/doc_refince_union.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/dataframe.py | 140 ++--
 1 file changed, 105 insertions(+), 35 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 8be2c224265..932c29910bb 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -3741,7 +3741,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 )
 
 def union(self, other: "DataFrame") -> "DataFrame":
-"""Return a new :class:`DataFrame` containing union of rows in this 
and another
+"""Return a new :class:`DataFrame` containing the union of rows in 
this and another
 :class:`DataFrame`.
 
 .. versionadded:: 2.0.0
@@ -3752,11 +3752,12 @@ class DataFrame(PandasMapOpsMixin, 
PandasConversionMixin):
 Parameters
 --
 other : :class:`DataFrame`
-Another :class:`DataFrame` that needs to be unioned
+Another :class:`DataFrame` that needs to be unioned.
 
 Returns
 ---
 :class:`DataFrame`
+A new :class:`DataFrame` containing the combined rows with 
corresponding columns.
 
 See Also
 
@@ -3764,34 +3765,81 @@ class DataFrame(PandasMapOpsMixin, 
PandasConversionMixin):
 
 Notes
 -
-This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union
-(that does deduplication of elements), use this function followed by 
:func:`distinct`.
+This method performs a SQL-style set union of the rows from both 
`DataFrame` objects,
+with no automatic deduplication of elements.
 
-Also as standard in SQL, this function resolves columns by position 
(not by name).
+Use the `distinct()` method to perform deduplication of rows.
+
+The method resolves columns by position (not by name), following the 
standard behavior
+in SQL.
 
 Examples
 
->>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
->>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
->>> df1.union(df2).show()
-++++
-|col0|col1|col2|
-++++
-|   1|   2|   3|
-|   4|   5|   6|
-++++
->>> df1.union(df1).show()
-++++
-|col0|col1|col2|
-++++
-|   1|   2|   3|
-|   1|   2|   3|
-++++
+Example 1: Combining two DataFrames with the same schema
+
+>>> df1 = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'value'])
+>>> df2 = spark.createDataFrame([(3, 'C'), (4, 'D')], ['id', 'value'])
+>>> df3 = df1.union(df2)
+>>> df3.show()
++---+-+
+| id|value|
++---+-+
+|  1|A|
+|  2|B|
+|  3|C|
+|  4|D|
++---+-+
+
+Example 2: Combining two DataFrames with different schemas
+
+>>> from pyspark.sql.functions import lit
+>>> df1 = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["name", 
"id"])
+>>> df2 = spark.createDataFrame([(3, "Charlie"), (4, "Dave")], ["id", 
"name"])
+>>> df1 = df1.withColumn("age", lit(30))
+>>> df2 = df2.withColumn("age", lit(40))
+>>> df3 = df1.union(df2)
+>>> df3.show()
++-+---+---+
+| name| id|age|
++-+---+---+
+|Alice|  1| 30|
+|  Bob|  2| 30|
+|3|Charlie| 40|
+|4|   Dave| 40|
++-+---+---+
+
+Example 3: Combining two DataFrames with mismatched columns
+
+>>> df1 = spark.createDataFrame([(1, 2)], ["A", "B"])
+>>> df2 = spark.createDataFrame([(3, 4)], ["C", "D"])
+>>> df3 = df1.union(df2)
+>>> df3.show()
+ 

[spark] branch branch-3.4 updated: [SPARK-44859][SS] Fix incorrect property name in structured streaming doc

2023-08-17 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 6c1107e449e [SPARK-44859][SS] Fix incorrect property name in 
structured streaming doc
6c1107e449e is described below

commit 6c1107e449edcb80fe6fc1d29c95c850aab587aa
Author: Liang-Chi Hsieh 
AuthorDate: Thu Aug 17 15:52:00 2023 -0700

[SPARK-44859][SS] Fix incorrect property name in structured streaming doc

### What changes were proposed in this pull request?

We found that one structured streaming property for asynchronous progress 
tracking is not correct when comparing with codebase.

### Why are the changes needed?

Fix incorrect property name in structured streaming document.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

None, doc change only.

Closes #42544 from viirya/minor_doc_fix.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 07f70227f8bf81928d98101a88fd2885784451f5)
Signed-off-by: Dongjoon Hyun 
---
 docs/structured-streaming-programming-guide.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 206b887ae74..bf8302b82de 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -3714,7 +3714,7 @@ The table below describes the configurations for this 
feature and default values
 | Option| Value   | Default | Description   | 
 |-|-||-|
 |asyncProgressTrackingEnabled|true/false|false|enable or disable asynchronous 
progress tracking|
-|asyncProgressCheckpointingInterval|minutes|1|the interval in which we commit 
offsets and completion commits|
+|asyncProgressTrackingCheckpointIntervalMs|millisecond|1000|the interval in 
which we commit offsets and completion commits|
 
 ## Limitations
 The initial version of the feature has the following limitations:
@@ -3735,7 +3735,7 @@ Also the following error message may be printed in the 
driver logs:
 The offset log for batch x doesn't exist, which is required to restart the 
query from the latest batch x from the offset log. Please ensure there are two 
subsequent offset logs available for the latest batch via manually deleting the 
offset file(s). Please also ensure the latest batch for commit log is equal or 
one batch earlier than the latest batch for offset log.
 ```
 
-This is caused by the fact that when async progress tracking is enabled, the 
framework will not checkpoint progress for every batch as would be done if 
async progress tracking is not used. To solve this problem simply re-enable 
“asyncProgressTrackingEnabled” and set “asyncProgressCheckpointingInterval” to 
0 and run the streaming query until at least two micro-batches have been 
processed. Async progress tracking can be now safely disabled and restarting 
query should proceed normally.
+This is caused by the fact that when async progress tracking is enabled, the 
framework will not checkpoint progress for every batch as would be done if 
async progress tracking is not used. To solve this problem simply re-enable 
“asyncProgressTrackingEnabled” and set 
“asyncProgressTrackingCheckpointIntervalMs” to 0 and run the streaming query 
until at least two micro-batches have been processed. Async progress tracking 
can be now safely disabled and restarting query should proceed normally.
 
 # Continuous Processing
 ## [Experimental]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44859][SS] Fix incorrect property name in structured streaming doc

2023-08-17 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new baa891f3af1 [SPARK-44859][SS] Fix incorrect property name in 
structured streaming doc
baa891f3af1 is described below

commit baa891f3af1fae09198971b253ceb9d6ff11ed5d
Author: Liang-Chi Hsieh 
AuthorDate: Thu Aug 17 15:52:00 2023 -0700

[SPARK-44859][SS] Fix incorrect property name in structured streaming doc

### What changes were proposed in this pull request?

We found that one structured streaming property for asynchronous progress 
tracking is not correct when comparing with codebase.

### Why are the changes needed?

Fix incorrect property name in structured streaming document.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

None, doc change only.

Closes #42544 from viirya/minor_doc_fix.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 07f70227f8bf81928d98101a88fd2885784451f5)
Signed-off-by: Dongjoon Hyun 
---
 docs/structured-streaming-programming-guide.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 53d5919d4dc..dc25adbdfd3 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -4093,7 +4093,7 @@ The table below describes the configurations for this 
feature and default values
 | Option| Value   | Default | Description   |
 |-|-||-|
 |asyncProgressTrackingEnabled|true/false|false|enable or disable asynchronous 
progress tracking|
-|asyncProgressCheckpointingInterval|minutes|1|the interval in which we commit 
offsets and completion commits|
+|asyncProgressTrackingCheckpointIntervalMs|millisecond|1000|the interval in 
which we commit offsets and completion commits|
 
 ## Limitations
 The initial version of the feature has the following limitations:
@@ -4114,7 +4114,7 @@ Also the following error message may be printed in the 
driver logs:
 The offset log for batch x doesn't exist, which is required to restart the 
query from the latest batch x from the offset log. Please ensure there are two 
subsequent offset logs available for the latest batch via manually deleting the 
offset file(s). Please also ensure the latest batch for commit log is equal or 
one batch earlier than the latest batch for offset log.
 ```
 
-This is caused by the fact that when async progress tracking is enabled, the 
framework will not checkpoint progress for every batch as would be done if 
async progress tracking is not used. To solve this problem simply re-enable 
“asyncProgressTrackingEnabled” and set “asyncProgressCheckpointingInterval” to 
0 and run the streaming query until at least two micro-batches have been 
processed. Async progress tracking can be now safely disabled and restarting 
query should proceed normally.
+This is caused by the fact that when async progress tracking is enabled, the 
framework will not checkpoint progress for every batch as would be done if 
async progress tracking is not used. To solve this problem simply re-enable 
“asyncProgressTrackingEnabled” and set 
“asyncProgressTrackingCheckpointIntervalMs” to 0 and run the streaming query 
until at least two micro-batches have been processed. Async progress tracking 
can be now safely disabled and restarting query should proceed normally.
 
 # Continuous Processing
 ## [Experimental]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44859][SS] Fix incorrect property name in structured streaming doc

2023-08-17 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 07f70227f8b [SPARK-44859][SS] Fix incorrect property name in 
structured streaming doc
07f70227f8b is described below

commit 07f70227f8bf81928d98101a88fd2885784451f5
Author: Liang-Chi Hsieh 
AuthorDate: Thu Aug 17 15:52:00 2023 -0700

[SPARK-44859][SS] Fix incorrect property name in structured streaming doc

### What changes were proposed in this pull request?

We found that one structured streaming property for asynchronous progress 
tracking is not correct when comparing with codebase.

### Why are the changes needed?

Fix incorrect property name in structured streaming document.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

None, doc change only.

Closes #42544 from viirya/minor_doc_fix.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: Dongjoon Hyun 
---
 docs/structured-streaming-programming-guide.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 53d5919d4dc..dc25adbdfd3 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -4093,7 +4093,7 @@ The table below describes the configurations for this 
feature and default values
 | Option| Value   | Default | Description   |
 |-|-||-|
 |asyncProgressTrackingEnabled|true/false|false|enable or disable asynchronous 
progress tracking|
-|asyncProgressCheckpointingInterval|minutes|1|the interval in which we commit 
offsets and completion commits|
+|asyncProgressTrackingCheckpointIntervalMs|millisecond|1000|the interval in 
which we commit offsets and completion commits|
 
 ## Limitations
 The initial version of the feature has the following limitations:
@@ -4114,7 +4114,7 @@ Also the following error message may be printed in the 
driver logs:
 The offset log for batch x doesn't exist, which is required to restart the 
query from the latest batch x from the offset log. Please ensure there are two 
subsequent offset logs available for the latest batch via manually deleting the 
offset file(s). Please also ensure the latest batch for commit log is equal or 
one batch earlier than the latest batch for offset log.
 ```
 
-This is caused by the fact that when async progress tracking is enabled, the 
framework will not checkpoint progress for every batch as would be done if 
async progress tracking is not used. To solve this problem simply re-enable 
“asyncProgressTrackingEnabled” and set “asyncProgressCheckpointingInterval” to 
0 and run the streaming query until at least two micro-batches have been 
processed. Async progress tracking can be now safely disabled and restarting 
query should proceed normally.
+This is caused by the fact that when async progress tracking is enabled, the 
framework will not checkpoint progress for every batch as would be done if 
async progress tracking is not used. To solve this problem simply re-enable 
“asyncProgressTrackingEnabled” and set 
“asyncProgressTrackingCheckpointIntervalMs” to 0 and run the streaming query 
until at least two micro-batches have been processed. Async progress tracking 
can be now safely disabled and restarting query should proceed normally.
 
 # Continuous Processing
 ## [Experimental]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (be04ac1ace9 -> 6771d9d757f)

2023-08-17 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from be04ac1ace9 [SPARK-44834][PYTHON][SQL][TESTS] Add SQL query tests for 
Python UDTFs
 add 6771d9d757f [SPARK-44433][3.5X] Terminate foreach batch runner when 
streaming query terminates

No new revisions were added by this update.

Summary of changes:
 .../sql/connect/planner/SparkConnectPlanner.scala  |  37 +--
 .../planner/StreamingForeachBatchHelper.scala  | 109 ++---
 .../spark/sql/connect/service/SessionHolder.scala  |   9 +-
 .../service/SparkConnectStreamingQueryCache.scala  |  21 ++--
 .../planner/StreamingForeachBatchHelperSuite.scala |  80 +++
 .../spark/api/python/PythonWorkerFactory.scala |   2 +-
 .../spark/api/python/StreamingPythonRunner.scala   |   9 +-
 7 files changed, 231 insertions(+), 36 deletions(-)
 create mode 100644 
connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/StreamingForeachBatchHelperSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44834][PYTHON][SQL][TESTS] Add SQL query tests for Python UDTFs

2023-08-17 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e473d892bf3 [SPARK-44834][PYTHON][SQL][TESTS] Add SQL query tests for 
Python UDTFs
e473d892bf3 is described below

commit e473d892bf3c9b5d4ff6bdb192553b44b2277279
Author: allisonwang-db 
AuthorDate: Thu Aug 17 10:38:01 2023 -0700

[SPARK-44834][PYTHON][SQL][TESTS] Add SQL query tests for Python UDTFs

### What changes were proposed in this pull request?

This PR adds a new sql query test suite for running Python UDTFs in SQL. 
You can trigger the test using
```
 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite 
-- -z udtf/udtf.sql"
```

### Why are the changes needed?

To add more test cases for Python UDTFs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added new golden file tests.

Closes #42517 from allisonwang-db/spark-44834-udtf-sql-test.

Authored-by: allisonwang-db 
Signed-off-by: Takuya UESHIN 
(cherry picked from commit be04ac1ace91f6da34b08a1510e41d3ab6f0377b)
Signed-off-by: Takuya UESHIN 
---
 .../sql-tests/analyzer-results/udtf/udtf.sql.out   | 96 ++
 .../test/resources/sql-tests/inputs/udtf/udtf.sql  | 18 
 .../resources/sql-tests/results/udtf/udtf.sql.out  | 85 +++
 .../apache/spark/sql/IntegratedUDFTestUtils.scala  | 40 +
 .../org/apache/spark/sql/SQLQueryTestSuite.scala   | 28 +++
 .../thriftserver/ThriftServerQueryTestSuite.scala  |  2 +
 6 files changed, 269 insertions(+)

diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/udtf/udtf.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/udtf/udtf.sql.out
new file mode 100644
index 000..acf96794378
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/udtf/udtf.sql.out
@@ -0,0 +1,96 @@
+-- Automatically generated by SQLQueryTestSuite
+-- !query
+CREATE OR REPLACE TEMPORARY VIEW t1 AS VALUES (0, 1), (1, 2) t(c1, c2)
+-- !query analysis
+CreateViewCommand `t1`, VALUES (0, 1), (1, 2) t(c1, c2), false, true, 
LocalTempView, true
+   +- SubqueryAlias t
+  +- LocalRelation [c1#x, c2#x]
+
+
+-- !query
+SELECT * FROM udtf(1, 2)
+-- !query analysis
+Project [x#x, y#x]
++- Generate TestUDTF(1, 2)#x, false, [x#x, y#x]
+   +- OneRowRelation
+
+
+-- !query
+SELECT * FROM udtf(-1, 0)
+-- !query analysis
+Project [x#x, y#x]
++- Generate TestUDTF(-1, 0)#x, false, [x#x, y#x]
+   +- OneRowRelation
+
+
+-- !query
+SELECT * FROM udtf(0, -1)
+-- !query analysis
+Project [x#x, y#x]
++- Generate TestUDTF(0, -1)#x, false, [x#x, y#x]
+   +- OneRowRelation
+
+
+-- !query
+SELECT * FROM udtf(0, 0)
+-- !query analysis
+Project [x#x, y#x]
++- Generate TestUDTF(0, 0)#x, false, [x#x, y#x]
+   +- OneRowRelation
+
+
+-- !query
+SELECT a, b FROM udtf(1, 2) t(a, b)
+-- !query analysis
+Project [a#x, b#x]
++- SubqueryAlias t
+   +- Project [x#x AS a#x, y#x AS b#x]
+  +- Generate TestUDTF(1, 2)#x, false, [x#x, y#x]
+ +- OneRowRelation
+
+
+-- !query
+SELECT * FROM t1, LATERAL udtf(c1, c2)
+-- !query analysis
+Project [c1#x, c2#x, x#x, y#x]
++- LateralJoin lateral-subquery#x [c1#x && c2#x], Inner
+   :  +- Generate TestUDTF(outer(c1#x), outer(c2#x))#x, false, [x#x, y#x]
+   : +- OneRowRelation
+   +- SubqueryAlias t1
+  +- View (`t1`, [c1#x,c2#x])
+ +- Project [cast(c1#x as int) AS c1#x, cast(c2#x as int) AS c2#x]
++- SubqueryAlias t
+   +- LocalRelation [c1#x, c2#x]
+
+
+-- !query
+SELECT * FROM t1 LEFT JOIN LATERAL udtf(c1, c2)
+-- !query analysis
+Project [c1#x, c2#x, x#x, y#x]
++- LateralJoin lateral-subquery#x [c1#x && c2#x], LeftOuter
+   :  +- Generate TestUDTF(outer(c1#x), outer(c2#x))#x, false, [x#x, y#x]
+   : +- OneRowRelation
+   +- SubqueryAlias t1
+  +- View (`t1`, [c1#x,c2#x])
+ +- Project [cast(c1#x as int) AS c1#x, cast(c2#x as int) AS c2#x]
++- SubqueryAlias t
+   +- LocalRelation [c1#x, c2#x]
+
+
+-- !query
+SELECT * FROM udtf(1, 2) t(c1, c2), LATERAL udtf(c1, c2)
+-- !query analysis
+Project [c1#x, c2#x, x#x, y#x]
++- LateralJoin lateral-subquery#x [c1#x && c2#x], Inner
+   :  +- Generate TestUDTF(outer(c1#x), outer(c2#x))#x, false, [x#x, y#x]
+   : +- OneRowRelation
+   +- SubqueryAlias t
+  +- Project [x#x AS c1#x, y#x AS c2#x]
+ +- Generate TestUDTF(1, 2)#x, false, [x#x, y#x]
++- OneRowRelation
+
+
+-- !query
+SELECT * FROM udtf(cast(rand(0) AS int) + 1, 1)
+-- !query analysis
+[Analyzer test output redacted due to nondeterminism]
diff --git a/sql/core/src/test/resources/sql-tests/inputs/udtf/udtf.sql 
b/sql/core/src/test/resources/sql-tests/inputs/udtf/udtf.sql
new file mode 100644
index 

[spark] branch master updated (047b2247879 -> be04ac1ace9)

2023-08-17 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 047b2247879 [SPARK-44849] Expose 
SparkConnectExecutionManager.listActiveExecutions
 add be04ac1ace9 [SPARK-44834][PYTHON][SQL][TESTS] Add SQL query tests for 
Python UDTFs

No new revisions were added by this update.

Summary of changes:
 .../sql-tests/analyzer-results/udtf/udtf.sql.out   | 96 ++
 .../test/resources/sql-tests/inputs/udtf/udtf.sql  | 18 
 .../resources/sql-tests/results/udtf/udtf.sql.out  | 85 +++
 .../apache/spark/sql/IntegratedUDFTestUtils.scala  | 40 +
 .../org/apache/spark/sql/SQLQueryTestSuite.scala   | 28 +++
 .../thriftserver/ThriftServerQueryTestSuite.scala  |  2 +
 6 files changed, 269 insertions(+)
 create mode 100644 
sql/core/src/test/resources/sql-tests/analyzer-results/udtf/udtf.sql.out
 create mode 100644 sql/core/src/test/resources/sql-tests/inputs/udtf/udtf.sql
 create mode 100644 
sql/core/src/test/resources/sql-tests/results/udtf/udtf.sql.out


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44849] Expose SparkConnectExecutionManager.listActiveExecutions

2023-08-17 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 893940db795 [SPARK-44849] Expose 
SparkConnectExecutionManager.listActiveExecutions
893940db795 is described below

commit 893940db79500d7faf3263cd75b96a814aa0f279
Author: Juliusz Sompolski 
AuthorDate: Thu Aug 17 17:49:28 2023 +0200

[SPARK-44849] Expose SparkConnectExecutionManager.listActiveExecutions

### What changes were proposed in this pull request?

Make list of active executions accessible via 
SparkConnectService.listActiveExecutions

### Why are the changes needed?

Some internal components outside `connect` would like to have access to 
that. Add a method to expose it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI

Closes #42535 from juliuszsompolski/SPARK-44849.

Authored-by: Juliusz Sompolski 
Signed-off-by: Herman van Hovell 
(cherry picked from commit 047b2247879cc3f5fd9b78366a73edbf62994811)
Signed-off-by: Herman van Hovell 
---
 .../org/apache/spark/sql/connect/service/SparkConnectService.scala  | 6 ++
 1 file changed, 6 insertions(+)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
index 15a7782c367..e8af2acfd2e 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
@@ -347,6 +347,12 @@ object SparkConnectService extends Logging {
 userSessionMapping.get((userId, sessionId), default)
   }
 
+  /**
+   * If there are no executions, return Left with System.currentTimeMillis of 
last active
+   * execution. Otherwise return Right with list of ExecuteInfo of all 
executions.
+   */
+  def listActiveExecutions: Either[Long, Seq[ExecuteInfo]] = 
executionManager.listActiveExecutions
+
   /**
* Used for testing
*/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44849] Expose SparkConnectExecutionManager.listActiveExecutions

2023-08-17 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 047b2247879 [SPARK-44849] Expose 
SparkConnectExecutionManager.listActiveExecutions
047b2247879 is described below

commit 047b2247879cc3f5fd9b78366a73edbf62994811
Author: Juliusz Sompolski 
AuthorDate: Thu Aug 17 17:49:28 2023 +0200

[SPARK-44849] Expose SparkConnectExecutionManager.listActiveExecutions

### What changes were proposed in this pull request?

Make list of active executions accessible via 
SparkConnectService.listActiveExecutions

### Why are the changes needed?

Some internal components outside `connect` would like to have access to 
that. Add a method to expose it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI

Closes #42535 from juliuszsompolski/SPARK-44849.

Authored-by: Juliusz Sompolski 
Signed-off-by: Herman van Hovell 
---
 .../org/apache/spark/sql/connect/service/SparkConnectService.scala  | 6 ++
 1 file changed, 6 insertions(+)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
index fe773e4b704..269e47609db 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
@@ -348,6 +348,12 @@ object SparkConnectService extends Logging {
 userSessionMapping.get((userId, sessionId), default)
   }
 
+  /**
+   * If there are no executions, return Left with System.currentTimeMillis of 
last active
+   * execution. Otherwise return Right with list of ExecuteInfo of all 
executions.
+   */
+  def listActiveExecutions: Either[Long, Seq[ExecuteInfo]] = 
executionManager.listActiveExecutions
+
   /**
* Used for testing
*/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-44806][CONNECT] Move internal client spark-connect-common to be able to test real in-process server with a real RPC client

2023-08-17 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 15bb95564b6 [SPARK-44806][CONNECT] Move internal client 
spark-connect-common to be able to test real in-process server with a real RPC 
client
15bb95564b6 is described below

commit 15bb95564b6f2a3e76996062155ea029438ab4bb
Author: Juliusz Sompolski 
AuthorDate: Thu Aug 17 17:05:58 2023 +0200

[SPARK-44806][CONNECT] Move internal client spark-connect-common to be able 
to test real in-process server with a real RPC client

### What changes were proposed in this pull request?

Currently, Spark Connect has the following types of integration tests:
* Client unit tests with a mock in process server (DummySparkConnectService 
with GRPC InProcessServerBuilder)
* Client unit tests with a mock RPC server (DummySparkConnectService with 
GRPC NettyServerBuilder)
* Server unit tests with in-process server and using various mocks
* E2E tests of real client with a server started in another process (using 
RemoteSparkSession)

What is lacking are E2E tests with an in-process Server (so that server 
state can be inspected asserted), and a real RPC client. This is impossible, 
because classes from `spark-connect-client-jvm` module include the client API 
which duplicates Spark SQL APIs of SparkSession, Dataset etc. When trying to 
pull a real client into the server module for testing, these classes clash.

Move the `org.apache.spark.sql.connect.client` code into 
`spark-connect-common` module, so that the internal SparkConnectClient code is 
separated from the client public API, and can be pulled into testing of the 
server. The only class that we keep in `spark-connect-client-jvm` is 
`AmmoniteClassFinder`, to avoid pulling in ammonite dependency into common.

Tried alternative approach in https://github.com/apache/spark/pull/42465. 
That doesn't work, because it also reorders the maven build in a way so that 
client is build before server, but client actually requires server to be build 
first to use tests with `RemoteSparkSession`
Tried alternative approach to depend on a shaded/relocated version of 
client in https://github.com/apache/spark/pull/42461, but that's just not 
possible to do neither in maven nor sbt.
Tried alternative approach to create client-jvm-internal module in 
https://github.com/apache/spark/pull/42441, moving things to connect-common was 
preferred to introducing new module by reviewers.
Moved it together with tests in https://github.com/apache/spark/pull/42501, 
but moving tests isn't really needed.

### Why are the changes needed?

For being able to use the internal client for testing of in-process server 
with an in-process client, but communicating over real RPC.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

All modules test and build.

Closes #42523 from juliuszsompolski/sc-client-common-mainonly.

Authored-by: Juliusz Sompolski 
Signed-off-by: Herman van Hovell 
(cherry picked from commit 72a466835a4490257eec0c9af2bbc9291c46de1e)
Signed-off-by: Herman van Hovell 
---
 ...ClassFinder.scala => AmmoniteClassFinder.scala} | 41 ++
 connector/connect/common/pom.xml   | 12 +++
 .../client/arrow/ScalaCollectionUtils.scala|  0
 .../client/arrow/ScalaCollectionUtils.scala|  0
 .../spark/sql/connect/client/ArtifactManager.scala |  0
 .../spark/sql/connect/client/ClassFinder.scala | 27 +-
 .../sql/connect/client/CloseableIterator.scala |  0
 .../client/CustomSparkConnectBlockingStub.scala|  0
 .../connect/client/CustomSparkConnectStub.scala|  0
 .../ExecutePlanResponseReattachableIterator.scala  |  0
 .../connect/client/GrpcExceptionConverter.scala|  2 +-
 .../sql/connect/client/GrpcRetryHandler.scala  |  0
 .../sql/connect/client/SparkConnectClient.scala|  0
 .../connect/client/SparkConnectClientParser.scala  |  0
 .../spark/sql/connect/client/SparkResult.scala |  0
 .../connect/client/arrow/ArrowDeserializer.scala   |  0
 .../connect/client/arrow/ArrowEncoderUtils.scala   |  0
 .../sql/connect/client/arrow/ArrowSerializer.scala |  0
 .../connect/client/arrow/ArrowVectorReader.scala   |  0
 .../arrow/ConcatenatingArrowStreamReader.scala |  0
 .../apache/spark/sql/connect/client/package.scala  |  0
 .../spark/sql/connect/client/util/Cleaner.scala|  0
 22 files changed, 17 insertions(+), 65 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ClassFinder.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/AmmoniteClassFinder.scala
similarity index 58%
copy 

[spark] branch master updated (026aa4fdbdc -> 72a466835a4)

2023-08-17 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 026aa4fdbdc [SPARK-43462][SPARK-43871][PS][TESTS] Enable 
`SeriesDateTimeTests` for pandas 2.0.0 and above
 add 72a466835a4 [SPARK-44806][CONNECT] Move internal client 
spark-connect-common to be able to test real in-process server with a real RPC 
client

No new revisions were added by this update.

Summary of changes:
 ...ClassFinder.scala => AmmoniteClassFinder.scala} | 41 ++
 connector/connect/common/pom.xml   | 12 +++
 .../client/arrow/ScalaCollectionUtils.scala|  0
 .../client/arrow/ScalaCollectionUtils.scala|  0
 .../spark/sql/connect/client/ArtifactManager.scala |  0
 .../spark/sql/connect/client/ClassFinder.scala | 27 +-
 .../sql/connect/client/CloseableIterator.scala |  0
 .../client/CustomSparkConnectBlockingStub.scala|  0
 .../connect/client/CustomSparkConnectStub.scala|  0
 .../ExecutePlanResponseReattachableIterator.scala  |  0
 .../connect/client/GrpcExceptionConverter.scala|  2 +-
 .../sql/connect/client/GrpcRetryHandler.scala  |  0
 .../sql/connect/client/SparkConnectClient.scala|  0
 .../connect/client/SparkConnectClientParser.scala  |  0
 .../spark/sql/connect/client/SparkResult.scala |  0
 .../connect/client/arrow/ArrowDeserializer.scala   |  0
 .../connect/client/arrow/ArrowEncoderUtils.scala   |  0
 .../sql/connect/client/arrow/ArrowSerializer.scala |  0
 .../connect/client/arrow/ArrowVectorReader.scala   |  0
 .../arrow/ConcatenatingArrowStreamReader.scala |  0
 .../apache/spark/sql/connect/client/package.scala  |  0
 .../spark/sql/connect/client/util/Cleaner.scala|  0
 22 files changed, 17 insertions(+), 65 deletions(-)
 copy 
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/{ClassFinder.scala
 => AmmoniteClassFinder.scala} (58%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala-2.12/org/apache/spark/sql/connect/client/arrow/ScalaCollectionUtils.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala-2.13/org/apache/spark/sql/connect/client/arrow/ScalaCollectionUtils.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/ClassFinder.scala 
(66%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/CloseableIterator.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/CustomSparkConnectBlockingStub.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/CustomSparkConnectStub.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/ExecutePlanResponseReattachableIterator.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala
 (98%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClientParser.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala 
(100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowDeserializer.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderUtils.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowSerializer.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowVectorReader.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/arrow/ConcatenatingArrowStreamReader.scala
 (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/package.scala (100%)
 rename connector/connect/{client/jvm => 
common}/src/main/scala/org/apache/spark/sql/connect/client/util/Cleaner.scala 
(100%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-43462][SPARK-43871][PS][TESTS] Enable `SeriesDateTimeTests` for pandas 2.0.0 and above

2023-08-17 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 026aa4fdbdc [SPARK-43462][SPARK-43871][PS][TESTS] Enable 
`SeriesDateTimeTests` for pandas 2.0.0 and above
026aa4fdbdc is described below

commit 026aa4fdbdcad218d29d047f99852b64e12939b4
Author: itholic 
AuthorDate: Thu Aug 17 18:02:49 2023 +0800

[SPARK-43462][SPARK-43871][PS][TESTS] Enable `SeriesDateTimeTests` for 
pandas 2.0.0 and above

### What changes were proposed in this pull request?

This PR proposes to enable `SeriesDateTimeTests`.

### Why are the changes needed?

For increasing the test coverage with pandas 2.0.0.

### Does this PR introduce _any_ user-facing change?

No, it's test-only.

### How was this patch tested?

Enabling & updating the existing tests.

Closes #42527 from itholic/pandas_series_dt_test.

Authored-by: itholic 
Signed-off-by: Ruifeng Zheng 
---
 .../pyspark/pandas/tests/test_series_datetime.py   | 62 ++
 1 file changed, 3 insertions(+), 59 deletions(-)

diff --git a/python/pyspark/pandas/tests/test_series_datetime.py 
b/python/pyspark/pandas/tests/test_series_datetime.py
index 918176b634b..7e05364ca5f 100644
--- a/python/pyspark/pandas/tests/test_series_datetime.py
+++ b/python/pyspark/pandas/tests/test_series_datetime.py
@@ -116,27 +116,23 @@ class SeriesDateTimeTestsMixin:
 self.assertRaisesRegex(TypeError, expected_err_msg, lambda: psser - 
other)
 self.assertRaises(NotImplementedError, lambda: py_datetime - psser)
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43462): Enable SeriesDateTimeTests.test_date_subtraction 
for pandas 2.0.0.",
-)
 def test_date_subtraction(self):
 pdf = self.pdf1
 psdf = ps.from_pandas(pdf)
 
 self.assert_eq(
 psdf["end_date"].dt.date - psdf["start_date"].dt.date,
-(pdf["end_date"].dt.date - pdf["start_date"].dt.date).dt.days,
+(pdf["end_date"].dt.date - pdf["start_date"].dt.date).apply(lambda 
x: x.days),
 )
 
 self.assert_eq(
 psdf["end_date"].dt.date - datetime.date(2012, 1, 1),
-(pdf["end_date"].dt.date - datetime.date(2012, 1, 1)).dt.days,
+(pdf["end_date"].dt.date - datetime.date(2012, 1, 1)).apply(lambda 
x: x.days),
 )
 
 self.assert_eq(
 datetime.date(2013, 3, 11) - psdf["start_date"].dt.date,
-(datetime.date(2013, 3, 11) - pdf["start_date"].dt.date).dt.days,
+(datetime.date(2013, 3, 11) - 
pdf["start_date"].dt.date).apply(lambda x: x.days),
 )
 
 psdf = ps.DataFrame(
@@ -176,52 +172,24 @@ class SeriesDateTimeTestsMixin:
 with self.assertRaises(NotImplementedError):
 self.check_func(lambda x: x.dt.timetz)
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43736): Enable SeriesDateTimeTests.test_year for pandas 
2.0.0.",
-)
 def test_year(self):
 self.check_func(lambda x: x.dt.year)
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43731): Enable SeriesDateTimeTests.test_month for pandas 
2.0.0.",
-)
 def test_month(self):
 self.check_func(lambda x: x.dt.month)
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43722): Enable SeriesDateTimeTests.test_day for pandas 
2.0.0.",
-)
 def test_day(self):
 self.check_func(lambda x: x.dt.day)
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43728): Enable SeriesDateTimeTests.test_hour for pandas 
2.0.0.",
-)
 def test_hour(self):
 self.check_func(lambda x: x.dt.hour)
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43730): Enable SeriesDateTimeTests.test_minute for pandas 
2.0.0.",
-)
 def test_minute(self):
 self.check_func(lambda x: x.dt.minute)
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43733): Enable SeriesDateTimeTests.test_second for pandas 
2.0.0.",
-)
 def test_second(self):
 self.check_func(lambda x: x.dt.second)
 
-@unittest.skipIf(
-LooseVersion(pd.__version__) >= LooseVersion("2.0.0"),
-"TODO(SPARK-43729): Enable SeriesDateTimeTests.test_microsecond for 
pandas 2.0.0.",
-)
 def test_microsecond(self):
 self.check_func(lambda x: x.dt.microsecond)
 
@@ -243,31 +211,15 @@ class SeriesDateTimeTestsMixin:
 def test_weekofyear(self):
 

[spark] branch master updated: [SPARK-44841][PS] Support `value_counts` for pandas 2.0.0 and above

2023-08-17 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a960e71905e [SPARK-44841][PS] Support `value_counts` for pandas 2.0.0 
and above
a960e71905e is described below

commit a960e71905e35aee4b7152baec587b23c9183694
Author: itholic 
AuthorDate: Thu Aug 17 16:50:38 2023 +0800

[SPARK-44841][PS] Support `value_counts` for pandas 2.0.0 and above

### What changes were proposed in this pull request?

This PR proposes to support object.value_counts for pandas 2.0.0 by 
matching the behavior. See 
https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#value-counts-sets-the-resulting-name-to-count
 more detail.

### Why are the changes needed?

We should match the behavior with the latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, the behavior now following the pandas 2.0.0 and above.

### How was this patch tested?

Enabling the existing UT

Closes #42525 from itholic/pandas_value_counts.

Authored-by: itholic 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/base.py   | 49 -
 python/pyspark/pandas/tests/series/test_stat.py |  4 --
 2 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/python/pyspark/pandas/base.py b/python/pyspark/pandas/base.py
index 0685af76987..1cb17de89e8 100644
--- a/python/pyspark/pandas/base.py
+++ b/python/pyspark/pandas/base.py
@@ -1317,26 +1317,29 @@ class IndexOpsMixin(object, metaclass=ABCMeta):
 
 >>> df = ps.DataFrame({'x':[0, 0, 1, 1, 1, np.nan]})
 >>> df.x.value_counts()  # doctest: +NORMALIZE_WHITESPACE
+x
 1.03
 0.02
-Name: x, dtype: int64
+Name: count, dtype: int64
 
 With `normalize` set to `True`, returns the relative frequency by
 dividing all values by the sum of values.
 
 >>> df.x.value_counts(normalize=True)  # doctest: +NORMALIZE_WHITESPACE
+x
 1.00.6
 0.00.4
-Name: x, dtype: float64
+Name: proportion, dtype: float64
 
 **dropna**
 With `dropna` set to `False` we can also see NaN index values.
 
 >>> df.x.value_counts(dropna=False)  # doctest: +NORMALIZE_WHITESPACE
+x
 1.03
 0.02
 NaN1
-Name: x, dtype: int64
+Name: count, dtype: int64
 
 For Index
 
@@ -1349,7 +1352,7 @@ class IndexOpsMixin(object, metaclass=ABCMeta):
 2.01
 3.02
 4.01
-dtype: int64
+Name: count, dtype: int64
 
 **sort**
 
@@ -1360,7 +1363,7 @@ class IndexOpsMixin(object, metaclass=ABCMeta):
 2.01
 3.02
 4.01
-dtype: int64
+Name: count, dtype: int64
 
 **normalize**
 
@@ -1372,7 +1375,7 @@ class IndexOpsMixin(object, metaclass=ABCMeta):
 2.00.2
 3.00.4
 4.00.2
-dtype: float64
+Name: proportion, dtype: float64
 
 **dropna**
 
@@ -1411,7 +1414,7 @@ class IndexOpsMixin(object, metaclass=ABCMeta):
 (falcon, length)2
 (falcon, weight)1
 (lama, weight)  3
-dtype: int64
+Name: count, dtype: int64
 
 >>> s.index.value_counts(normalize=True).sort_index()
 (cow, length)   0.11
@@ -1419,31 +1422,37 @@ class IndexOpsMixin(object, metaclass=ABCMeta):
 (falcon, length)0.22
 (falcon, weight)0.11
 (lama, weight)  0.33
-dtype: float64
+Name: proportion, dtype: float64
 
 If Index has name, keep the name up.
 
 >>> idx = ps.Index([0, 0, 0, 1, 1, 2, 3], name='pandas-on-Spark')
 >>> idx.value_counts().sort_index()
+pandas-on-Spark
 03
 12
 21
 31
-Name: pandas-on-Spark, dtype: int64
+Name: count, dtype: int64
 """
-from pyspark.pandas.series import first_series, Series
-
-if isinstance(self, Series):
-warnings.warn(
-"The resulting Series will have a fixed name of 'count' from 
4.0.0.",
-FutureWarning,
-)
+from pyspark.pandas.series import first_series
+from pyspark.pandas.indexes.multi import MultiIndex
 
 if bins is not None:
 raise NotImplementedError("value_counts currently does not support 
bins")
 
 if dropna:
-sdf_dropna = 
self._internal.spark_frame.select(self.spark.column).dropna()
+if isinstance(self, MultiIndex):
+# If even one StructField is null, that row should be dropped.
+index_spark_column_names = 

[spark] branch branch-3.5 updated: [SPARK-44721][CONNECT] Revamp retry logic and make retries run for 10 minutes

2023-08-17 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 122d982143f [SPARK-44721][CONNECT] Revamp retry logic and make retries 
run for 10 minutes
122d982143f is described below

commit 122d982143f1ae1f2447701c6a7877cbf8bef4f0
Author: Alice Sayutina 
AuthorDate: Thu Aug 17 09:56:57 2023 +0200

[SPARK-44721][CONNECT] Revamp retry logic and make retries run for 10 
minutes

### What changes were proposed in this pull request?

Change retry logic. For existing retry logic the maximum allowed wait time 
can be extremely low and even zero with small probability.

This happens, because it waits random(0, T) for T in exponentialBackoff(). 
Revamp the logic to guarantee the minimum wait time of 10 minutes. Also 
synchronize retry behavior among python and scala.

### Why are the changes needed?

This avoids certain class of client errors where client simply doesn't wait 
long enough.

### Does this PR introduce _any_ user-facing change?
Changes are small from user perspective. The retries are running longer and 
smoother.

### How was this patch tested?
UT

Closes #42399 from cdkrot/revamp_retry_logic.

Lead-authored-by: Alice Sayutina 
Co-authored-by: Alice Sayutina 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 4952a03fdc22b36c1fb5bead09c5e2cc8b4602b8)
Signed-off-by: Hyukjin Kwon 
---
 .../ExecutePlanResponseReattachableIterator.scala  |   4 +-
 .../sql/connect/client/GrpcRetryHandler.scala  |  84 -
 .../connect/client/SparkConnectClientSuite.scala   |  26 +-
 python/pyspark/errors/error_classes.py |   5 -
 python/pyspark/sql/connect/client/core.py  | 102 ++---
 .../sql/tests/connect/client/test_client.py|  24 +
 .../sql/tests/connect/test_connect_basic.py|  12 +++
 7 files changed, 170 insertions(+), 87 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ExecutePlanResponseReattachableIterator.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ExecutePlanResponseReattachableIterator.scala
index 5ef1151682b..aeb452faecf 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ExecutePlanResponseReattachableIterator.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ExecutePlanResponseReattachableIterator.scala
@@ -301,6 +301,6 @@ class ExecutePlanResponseReattachableIterator(
   /**
* Retries the given function with exponential backoff according to the 
client's retryPolicy.
*/
-  private def retry[T](fn: => T, currentRetryNum: Int = 0): T =
-GrpcRetryHandler.retry(retryPolicy)(fn, currentRetryNum)
+  private def retry[T](fn: => T): T =
+GrpcRetryHandler.retry(retryPolicy)(fn)
 }
diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala
index 6dad5b4b3a9..8b6f070b8f5 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala
@@ -17,8 +17,8 @@
 
 package org.apache.spark.sql.connect.client
 
-import scala.annotation.tailrec
-import scala.concurrent.duration.FiniteDuration
+import scala.concurrent.duration.{Duration, FiniteDuration}
+import scala.util.Random
 import scala.util.control.NonFatal
 
 import io.grpc.{Status, StatusRuntimeException}
@@ -26,13 +26,15 @@ import io.grpc.stub.StreamObserver
 
 import org.apache.spark.internal.Logging
 
-private[client] class GrpcRetryHandler(private val retryPolicy: 
GrpcRetryHandler.RetryPolicy) {
+private[client] class GrpcRetryHandler(
+private val retryPolicy: GrpcRetryHandler.RetryPolicy,
+private val sleep: Long => Unit = Thread.sleep) {
 
   /**
* Retries the given function with exponential backoff according to the 
client's retryPolicy.
*/
-  def retry[T](fn: => T, currentRetryNum: Int = 0): T =
-GrpcRetryHandler.retry(retryPolicy)(fn, currentRetryNum)
+  def retry[T](fn: => T): T =
+GrpcRetryHandler.retry(retryPolicy, sleep)(fn)
 
   /**
* Generalizes the retry logic for RPC calls that return an iterator.
@@ -148,37 +150,62 @@ private[client] object GrpcRetryHandler extends Logging {
 
   /**
* Retries the given function with exponential backoff according to the 
client's retryPolicy.
+   *
* @param retryPolicy
*   The retry policy
+   * @param sleep
+   *   The function which sleeps (takes number of milliseconds to sleep)
 

[spark] branch master updated (fce83d49993 -> 4952a03fdc2)

2023-08-17 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from fce83d49993 [SPARK-44822][PYTHON][SQL] Make Python UDTFs by default 
non-deterministic
 add 4952a03fdc2 [SPARK-44721][CONNECT] Revamp retry logic and make retries 
run for 10 minutes

No new revisions were added by this update.

Summary of changes:
 .../ExecutePlanResponseReattachableIterator.scala  |   4 +-
 .../sql/connect/client/GrpcRetryHandler.scala  |  84 -
 .../connect/client/SparkConnectClientSuite.scala   |  26 +-
 python/pyspark/errors/error_classes.py |   5 -
 python/pyspark/sql/connect/client/core.py  | 102 ++---
 .../sql/tests/connect/client/test_client.py|  24 +
 .../sql/tests/connect/test_connect_basic.py|  12 +++
 7 files changed, 170 insertions(+), 87 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-docker] branch master updated: [SPARK-44494] Pin minikube to v1.30.1 to fix spark-docker K8s CI

2023-08-17 Thread yikun
This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new 6fd201e  [SPARK-44494] Pin minikube to v1.30.1 to fix spark-docker K8s 
CI
6fd201e is described below

commit 6fd201e7c6e6a36c7a18e3b5877c3616081a05cf
Author: Yikun Jiang 
AuthorDate: Thu Aug 17 15:30:59 2023 +0800

[SPARK-44494] Pin minikube to v1.30.1 to fix spark-docker K8s CI

### What changes were proposed in this pull request?
Pin minikube to v1.30.1 to fix spark-docker K8s CI.

### Why are the changes needed?
Pin minikube to v1.30.1 to fix spark-docker K8s CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #53 from Yikun/minikube.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 .github/workflows/main.yml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 870c8c7..fe755ed 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -243,7 +243,9 @@ jobs:
   - name: Test - Start minikube
 run: |
   # See more in "Installation" https://minikube.sigs.k8s.io/docs/start/
-  curl -LO 
https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
+  # curl -LO 
https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
+  # TODO(SPARK-44495): Resume to use the latest minikube for 
k8s-integration-tests.
+  curl -LO 
https://storage.googleapis.com/minikube/releases/v1.30.1/minikube-linux-amd64
   sudo install minikube-linux-amd64 /usr/local/bin/minikube
   # Github Action limit cpu:2, memory: 6947MB, limit to 2U6G for 
better resource statistic
   minikube start --cpus 2 --memory 6144


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org