(spark) branch master updated: [SPARK-46871][PS][TESTS] Clean up the imports in `pyspark.pandas.tests.computation.*`

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f9f413e5ff6a [SPARK-46871][PS][TESTS] Clean up the imports in 
`pyspark.pandas.tests.computation.*`
f9f413e5ff6a is described below

commit f9f413e5ff6abe00a664e2dc75fb0ade2ff2986a
Author: Ruifeng Zheng 
AuthorDate: Thu Jan 25 22:40:35 2024 -0800

[SPARK-46871][PS][TESTS] Clean up the imports in 
`pyspark.pandas.tests.computation.*`

### What changes were proposed in this pull request?
Clean up the imports in `pyspark.pandas.tests.computation.*`

### Why are the changes needed?
1, remove unused imports;
2, define the test dataset in the vanilla side, so that won't need to 
define it again in the parity tests;

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44895 from zhengruifeng/ps_test_comput_cleanup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/pandas/tests/computation/test_any_all.py  |  8 ++--
 python/pyspark/pandas/tests/computation/test_apply_func.py   | 12 ++--
 python/pyspark/pandas/tests/computation/test_binary_ops.py   | 12 ++--
 python/pyspark/pandas/tests/computation/test_combine.py  |  8 ++--
 python/pyspark/pandas/tests/computation/test_compute.py  |  8 ++--
 python/pyspark/pandas/tests/computation/test_corr.py |  6 +-
 python/pyspark/pandas/tests/computation/test_corrwith.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_cov.py  |  8 ++--
 python/pyspark/pandas/tests/computation/test_cumulative.py   |  8 ++--
 python/pyspark/pandas/tests/computation/test_describe.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_eval.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_melt.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_missing_data.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_pivot.py|  4 ++--
 python/pyspark/pandas/tests/computation/test_pivot_table.py  |  4 ++--
 .../pyspark/pandas/tests/computation/test_pivot_table_adv.py |  4 ++--
 .../pandas/tests/computation/test_pivot_table_multi_idx.py   |  4 ++--
 .../tests/computation/test_pivot_table_multi_idx_adv.py  |  4 ++--
 python/pyspark/pandas/tests/computation/test_stats.py|  6 +-
 .../pandas/tests/connect/computation/test_parity_any_all.py  | 11 ++-
 .../tests/connect/computation/test_parity_apply_func.py  |  9 -
 .../tests/connect/computation/test_parity_binary_ops.py  | 11 ++-
 .../pandas/tests/connect/computation/test_parity_combine.py  |  6 +-
 .../pandas/tests/connect/computation/test_parity_compute.py  |  6 +-
 .../pandas/tests/connect/computation/test_parity_corr.py |  7 +--
 .../pandas/tests/connect/computation/test_parity_corrwith.py | 11 ++-
 .../pandas/tests/connect/computation/test_parity_cov.py  | 11 ++-
 .../tests/connect/computation/test_parity_cumulative.py  |  9 -
 .../pandas/tests/connect/computation/test_parity_describe.py |  5 +
 .../pandas/tests/connect/computation/test_parity_eval.py | 11 ++-
 .../pandas/tests/connect/computation/test_parity_melt.py | 11 ++-
 .../tests/connect/computation/test_parity_missing_data.py|  9 -
 32 files changed, 164 insertions(+), 89 deletions(-)

diff --git a/python/pyspark/pandas/tests/computation/test_any_all.py 
b/python/pyspark/pandas/tests/computation/test_any_all.py
index 5e946be7b08b..784e355f3b58 100644
--- a/python/pyspark/pandas/tests/computation/test_any_all.py
+++ b/python/pyspark/pandas/tests/computation/test_any_all.py
@@ -20,7 +20,7 @@ import numpy as np
 import pandas as pd
 
 from pyspark import pandas as ps
-from pyspark.testing.pandasutils import ComparisonTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestCase
 from pyspark.testing.sqlutils import SQLTestUtils
 
 
@@ -149,7 +149,11 @@ class FrameAnyAllMixin:
 psdf.any(axis=1)
 
 
-class FrameAnyAllTests(FrameAnyAllMixin, ComparisonTestBase, SQLTestUtils):
+class FrameAnyAllTests(
+FrameAnyAllMixin,
+PandasOnSparkTestCase,
+SQLTestUtils,
+):
 pass
 
 
diff --git a/python/pyspark/pandas/tests/computation/test_apply_func.py 
b/python/pyspark/pandas/tests/computation/test_apply_func.py
index de82c061b58c..ad43a2f2b270 100644
--- a/python/pyspark/pandas/tests/computation/test_apply_func.py
+++ b/python/pyspark/pandas/tests/computation/test_apply_func.py
@@ -25,7 +25,7 @@ import pandas as pd
 from pyspark import pandas as ps
 from pyspark.loose_version import LooseVersion
 

(spark) branch branch-3.4 updated: [SPARK-46855][INFRA][3.4] Add `sketch` to the dependencies of the `catalyst` in `module.py`

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 441c33da0dbb [SPARK-46855][INFRA][3.4] Add `sketch` to the 
dependencies of the `catalyst` in `module.py`
441c33da0dbb is described below

commit 441c33da0dbba26c54d6a46805f8902605472007
Author: yangjie01 
AuthorDate: Thu Jan 25 22:36:32 2024 -0800

[SPARK-46855][INFRA][3.4] Add `sketch` to the dependencies of the 
`catalyst` in `module.py`

### What changes were proposed in this pull request?
This pr add `sketch` to the dependencies of the `catalyst` module in 
`module.py` due to `sketch` is direct dependency of `catalyst` module.

### Why are the changes needed?
Ensure that when modifying the `sketch` module, both `catalyst` and 
cascading modules will trigger tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44894 from LuciferYang/SPARK-46855-34.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 dev/sparktestsupport/modules.py | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index ac24ea19d0e7..100dd236c81d 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -168,6 +168,15 @@ launcher = Module(
 ],
 )
 
+sketch = Module(
+name="sketch",
+dependencies=[tags],
+source_file_regexes=[
+"common/sketch/",
+],
+sbt_test_goals=["sketch/test"],
+)
+
 core = Module(
 name="core",
 dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher],
@@ -181,7 +190,7 @@ core = Module(
 
 catalyst = Module(
 name="catalyst",
-dependencies=[tags, core],
+dependencies=[tags, sketch, core],
 source_file_regexes=[
 "sql/catalyst/",
 ],
@@ -295,15 +304,6 @@ protobuf = Module(
 ],
 )
 
-sketch = Module(
-name="sketch",
-dependencies=[tags],
-source_file_regexes=[
-"common/sketch/",
-],
-sbt_test_goals=["sketch/test"],
-)
-
 graphx = Module(
 name="graphx",
 dependencies=[tags, core],


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46855][INFRA][3.5] Add `sketch` to the dependencies of the `catalyst` in `module.py`

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e5a654e818b4 [SPARK-46855][INFRA][3.5] Add `sketch` to the 
dependencies of the `catalyst` in `module.py`
e5a654e818b4 is described below

commit e5a654e818b4698260807a081e5cf3d71480ac13
Author: yangjie01 
AuthorDate: Thu Jan 25 22:35:38 2024 -0800

[SPARK-46855][INFRA][3.5] Add `sketch` to the dependencies of the 
`catalyst` in `module.py`

### What changes were proposed in this pull request?
This pr add `sketch` to the dependencies of the `catalyst` module in 
`module.py` due to `sketch` is direct dependency of `catalyst` module.

### Why are the changes needed?
Ensure that when modifying the `sketch` module, both `catalyst` and 
cascading modules will trigger tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44893 from LuciferYang/SPARK-46855-35.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 dev/sparktestsupport/modules.py | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 33d253a47ea0..d29fc8726018 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -168,6 +168,15 @@ launcher = Module(
 ],
 )
 
+sketch = Module(
+name="sketch",
+dependencies=[tags],
+source_file_regexes=[
+"common/sketch/",
+],
+sbt_test_goals=["sketch/test"],
+)
+
 core = Module(
 name="core",
 dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher],
@@ -181,7 +190,7 @@ core = Module(
 
 catalyst = Module(
 name="catalyst",
-dependencies=[tags, core],
+dependencies=[tags, sketch, core],
 source_file_regexes=[
 "sql/catalyst/",
 ],
@@ -295,15 +304,6 @@ connect = Module(
 ],
 )
 
-sketch = Module(
-name="sketch",
-dependencies=[tags],
-source_file_regexes=[
-"common/sketch/",
-],
-sbt_test_goals=["sketch/test"],
-)
-
 graphx = Module(
 name="graphx",
 dependencies=[tags, core],


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46872][CORE] Recover `log-view.js` to be non-module

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1eedb7507ae2 [SPARK-46872][CORE] Recover `log-view.js` to be non-module
1eedb7507ae2 is described below

commit 1eedb7507ae23d069e65a40c202173a709c5e94d
Author: Dongjoon Hyun 
AuthorDate: Thu Jan 25 22:31:56 2024 -0800

[SPARK-46872][CORE] Recover `log-view.js` to be non-module

### What changes were proposed in this pull request?

This PR aims to recover `log-view.js` to be no-module to fix loading issue.

### Why are the changes needed?

- #43903

![Screenshot 2024-01-25 at 9 08 48 
PM](https://github.com/apache/spark/assets/9700541/830fadc8-ab1c-4cf4-9e56-493f9553b3ae)

### Does this PR introduce _any_ user-facing change?

No. This is a recovery to the status before SPARK-46003 which is not 
released yet.

### How was this patch tested?

Manually.

- Checkout SPARK-46003 commit and build.
- Start Master and Worker.
- Open `Incognito` or `Private` mode browser and go to Worker Log.
- Check `initLogPage` error via the developer tools

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44896 from dongjoon-hyun/SPARK-46872.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/resources/org/apache/spark/ui/static/log-view.js | 4 +---
 core/src/main/scala/org/apache/spark/ui/UIUtils.scala  | 2 +-
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/core/src/main/resources/org/apache/spark/ui/static/log-view.js 
b/core/src/main/resources/org/apache/spark/ui/static/log-view.js
index eaf7130e974b..0b917ee5c8d8 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/log-view.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/log-view.js
@@ -17,8 +17,6 @@
 
 /* global $ */
 
-import {getBaseURI} from "./utils.js";
-
 var baseParams;
 
 var curLogLength;
@@ -60,7 +58,7 @@ function getRESTEndPoint() {
   // If the worker is served from the master through a proxy (see doc on 
spark.ui.reverseProxy), 
   // we need to retain the leading ../proxy// part of the URL when 
making REST requests.
   // Similar logic is contained in executorspage.js function 
createRESTEndPoint.
-  var words = getBaseURI().split('/');
+  var words = (document.baseURI || document.URL).split('/');
   var ind = words.indexOf("proxy");
   if (ind > 0) {
 return words.slice(0, ind + 2).join('/') + "/log";
diff --git a/core/src/main/scala/org/apache/spark/ui/UIUtils.scala 
b/core/src/main/scala/org/apache/spark/ui/UIUtils.scala
index d124717ea85a..14255d276d66 100644
--- a/core/src/main/scala/org/apache/spark/ui/UIUtils.scala
+++ b/core/src/main/scala/org/apache/spark/ui/UIUtils.scala
@@ -220,7 +220,7 @@ private[spark] object UIUtils extends Logging {
 
 
 
-
+
 
 setUIRoot('{UIUtils.uiRoot(request)}')
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46870][CORE] Support Spark Master Log UI

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d6fc06bd4515 [SPARK-46870][CORE] Support Spark Master Log UI
d6fc06bd4515 is described below

commit d6fc06bd451586edc5e55068aabecb3dc7ec5849
Author: Dongjoon Hyun 
AuthorDate: Thu Jan 25 21:15:30 2024 -0800

[SPARK-46870][CORE] Support Spark Master Log UI

### What changes were proposed in this pull request?

This PR aims to support `Spark Master` Log UI.

### Why are the changes needed?

This is a new feature to allow the users to access the master log like the 
following. The value of `Status`, e.g., `ALIVE`, has a new link for log UI.

**BEFORE**

![Screenshot 2024-01-25 at 7 30 07 
PM](https://github.com/apache/spark/assets/9700541/2c263944-ebfa-49bb-955f-d9a022e23cba)

**AFTER**

![Screenshot 2024-01-25 at 7 28 59 
PM](https://github.com/apache/spark/assets/9700541/8d096261-3a31-4746-b52b-e01cfcdf3237)

![Screenshot 2024-01-25 at 7 29 21 
PM](https://github.com/apache/spark/assets/9700541/fc4d3c10-8695-4529-a92b-6ab477c961da)

### Does this PR introduce _any_ user-facing change?

No. This is a new link and UI.

### How was this patch tested?

Manually.

```
$ sbin/start-master.sh
```

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44890 from dongjoon-hyun/SPARK-46870.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/master/ui/LogPage.scala| 125 +
 .../apache/spark/deploy/master/ui/MasterPage.scala |   4 +-
 .../spark/deploy/master/ui/MasterWebUI.scala   |   1 +
 3 files changed, 129 insertions(+), 1 deletion(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/LogPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/LogPage.scala
new file mode 100644
index ..9da05025e1a3
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/LogPage.scala
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.master.ui
+
+import java.io.File
+import javax.servlet.http.HttpServletRequest
+
+import scala.xml.{Node, Unparsed}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.ui.{UIUtils, WebUIPage}
+import org.apache.spark.util.Utils
+import org.apache.spark.util.logging.RollingFileAppender
+
+private[ui] class LogPage(parent: MasterWebUI) extends WebUIPage("logPage") 
with Logging {
+  private val defaultBytes = 100 * 1024
+
+  def render(request: HttpServletRequest): Seq[Node] = {
+val logDir = sys.env.getOrElse("SPARK_LOG_DIR", "logs/")
+val logType = request.getParameter("logType")
+val offset = Option(request.getParameter("offset")).map(_.toLong)
+val byteLength = Option(request.getParameter("byteLength")).map(_.toInt)
+  .getOrElse(defaultBytes)
+val (logText, startByte, endByte, logLength) = getLog(logDir, logType, 
offset, byteLength)
+val curLogLength = endByte - startByte
+val range =
+  
+Showing {curLogLength} Bytes: {startByte.toString} - 
{endByte.toString} of {logLength}
+  
+
+val moreButton =
+  
+Load More
+  
+
+val newButton =
+  
+Load New
+  
+
+val alert =
+  
+End of Log
+  
+
+val logParams = "?self=%s".format(logType)
+val jsOnload = "window.onload = " +
+  s"initLogPage('$logParams', $curLogLength, $startByte, $endByte, 
$logLength, $byteLength);"
+
+val content =
+   ++
+  
+Back to Master
+{range}
+
+  {moreButton}
+  {logText}
+  {alert}
+  {newButton}
+
+{Unparsed(jsOnload)}
+  
+
+UIUtils.basicSparkPage(request, content, logType + " log page for master")
+  }
+
+  /** Get the part of the log files given the offset and desired length of 
bytes */
+  private def getLog(
+  logDirectory: 

(spark) branch master updated: [MINOR][INFRA] Update the location for error class README.md

2024-01-25 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new de71884273a3 [MINOR][INFRA] Update the location for error class 
README.md
de71884273a3 is described below

commit de71884273a3945317f90eb1fa79f8dbc6ec51f4
Author: panbingkun 
AuthorDate: Fri Jan 26 13:05:49 2024 +0900

[MINOR][INFRA] Update the location for error class README.md

### What changes were proposed in this pull request?
The pr aims to update the location for error class `README.md`.

### Why are the changes needed?
Fix expired file `README.md` path.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44891 from panbingkun/update_error_readme_path.

Authored-by: panbingkun 
Signed-off-by: Hyukjin Kwon 
---
 .github/PULL_REQUEST_TEMPLATE | 2 +-
 .../src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/.github/PULL_REQUEST_TEMPLATE b/.github/PULL_REQUEST_TEMPLATE
index a80bf21312a4..019c259594e2 100644
--- a/.github/PULL_REQUEST_TEMPLATE
+++ b/.github/PULL_REQUEST_TEMPLATE
@@ -9,7 +9,7 @@ Thanks for sending a pull request!  Here are some tips for you:
   7. If you want to add a new configuration, please read the guideline first 
for naming configurations in
  'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   8. If you want to add or modify an error type or message, please read the 
guideline first in
- 'core/src/main/resources/error/README.md'.
+ 'common/utils/src/main/resources/error/README.md'.
 -->
 
 ### What changes were proposed in this pull request?
diff --git 
a/common/utils/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala 
b/common/utils/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala
index 083064bfe238..ead395cd39cb 100644
--- a/common/utils/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala
+++ b/common/utils/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala
@@ -32,8 +32,8 @@ import org.apache.spark.annotation.DeveloperApi
 
 /**
  * A reader to load error information from one or more JSON files. Note that, 
if one error appears
- * in more than one JSON files, the latter wins. Please read 
core/src/main/resources/error/README.md
- * for more details.
+ * in more than one JSON files, the latter wins.
+ * Please read common/utils/src/main/resources/error/README.md for more 
details.
  */
 @DeveloperApi
 class ErrorClassesJsonReader(jsonFileURLs: Seq[URL]) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46868][CORE] Support Spark Worker Log UI

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 48cd8604f953 [SPARK-46868][CORE] Support Spark Worker Log UI
48cd8604f953 is described below

commit 48cd8604f953dc82cadb6c076914d4d5c69b8126
Author: Dongjoon Hyun 
AuthorDate: Thu Jan 25 19:31:26 2024 -0800

[SPARK-46868][CORE] Support Spark Worker Log UI

### What changes were proposed in this pull request?

This PR aims to support `Spark Worker Log UI` when `SPARK_LOG_DIR` is under 
work directory.

### Why are the changes needed?

This is a new feature to allow the users to access the worker log like the 
following.

**BEFORE**

![Screenshot 2024-01-25 at 3 04 20 
PM](https://github.com/apache/spark/assets/9700541/73ef33d5-9b56-4cca-83c2-9fd2e8ab5201)

**AFTER**

- Worker Page (Worker ID provides a new hyperlink for Log UI)
![Screenshot 2024-01-25 at 2 58 44 
PM](https://github.com/apache/spark/assets/9700541/1de66eee-7b73-4be3-a12c-e008442b7b6c)

- Log UI
![Screenshot 2024-01-25 at 6 00 25 
PM](https://github.com/apache/spark/assets/9700541/e20fde05-ce5e-42cb-9112-4a8d2ec69418)

### Does this PR introduce _any_ user-facing change?

To provide a better UX.

### How was this patch tested?

Manually.

```
$ sbin/start-master.sh
$ SPARK_LOG_DIR=$PWD/work/logs sbin/start-worker.sh spark://$(hostname):7077
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44888 from dongjoon-hyun/SPARK-46868.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/worker/ui/LogPage.scala| 29 --
 .../apache/spark/deploy/worker/ui/WorkerPage.scala |  6 -
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
index dd714cdc4437..991c791cc79e 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
@@ -30,23 +30,26 @@ import org.apache.spark.util.logging.RollingFileAppender
 private[ui] class LogPage(parent: WorkerWebUI) extends WebUIPage("logPage") 
with Logging {
   private val worker = parent.worker
   private val workDir = new File(parent.workDir.toURI.normalize().getPath)
-  private val supportedLogTypes = Set("stderr", "stdout")
+  private val supportedLogTypes = Set("stderr", "stdout", "out")
   private val defaultBytes = 100 * 1024
 
   def renderLog(request: HttpServletRequest): String = {
 val appId = Option(request.getParameter("appId"))
 val executorId = Option(request.getParameter("executorId"))
 val driverId = Option(request.getParameter("driverId"))
+val self = Option(request.getParameter("self"))
 val logType = request.getParameter("logType")
 val offset = Option(request.getParameter("offset")).map(_.toLong)
 val byteLength = Option(request.getParameter("byteLength")).map(_.toInt)
   .getOrElse(defaultBytes)
 
-val logDir = (appId, executorId, driverId) match {
-  case (Some(a), Some(e), None) =>
+val logDir = (appId, executorId, driverId, self) match {
+  case (Some(a), Some(e), None, None) =>
 s"${workDir.getPath}/$a/$e/"
-  case (None, None, Some(d)) =>
+  case (None, None, Some(d), None) =>
 s"${workDir.getPath}/$d/"
+  case (None, None, None, Some(_)) =>
+s"${sys.env.getOrElse("SPARK_LOG_DIR", workDir.getPath)}/"
   case _ =>
 throw new Exception("Request must specify either application or driver 
identifiers")
 }
@@ -60,16 +63,19 @@ private[ui] class LogPage(parent: WorkerWebUI) extends 
WebUIPage("logPage") with
 val appId = Option(request.getParameter("appId"))
 val executorId = Option(request.getParameter("executorId"))
 val driverId = Option(request.getParameter("driverId"))
+val self = Option(request.getParameter("self"))
 val logType = request.getParameter("logType")
 val offset = Option(request.getParameter("offset")).map(_.toLong)
 val byteLength = Option(request.getParameter("byteLength")).map(_.toInt)
   .getOrElse(defaultBytes)
 
-val (logDir, params, pageName) = (appId, executorId, driverId) match {
-  case (Some(a), Some(e), None) =>
+val (logDir, params, pageName) = (appId, executorId, driverId, self) match 
{
+  case (Some(a), Some(e), None, None) =>
 (s"${workDir.getPath}/$a/$e/", s"appId=$a=$e", s"$a/$e")
-  case (None, None, Some(d)) =>
+  case (None, None, Some(d), None) =>
 (s"${workDir.getPath}/$d/", s"driverId=$d", d)
+  case (None, None, None, Some(_)) 

(spark) branch master updated: [SPARK-46863][DOCS] Cleanup custom CSS

2024-01-25 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 545489a60048 [SPARK-46863][DOCS] Cleanup custom CSS
545489a60048 is described below

commit 545489a60048e760c01dffea0c11c0c461030ac3
Author: Nicholas Chammas 
AuthorDate: Fri Jan 26 11:14:09 2024 +0800

[SPARK-46863][DOCS] Cleanup custom CSS

### What changes were proposed in this pull request?

- Remove rules for the following classes that we do not use:
  - `.code` (We use `code` _elements_, but not a `code` _class_.)
  - `.code-tab` (We _do_ have a `codetabs` class, however.)
  - `.jumbotron` (No idea where this came from, but we don't use it.)
- Remove some superfluous comments.
- Using [css-purge][1], merge rules for the same selectors that were 
previously scattered across several blocks, and remove overridden rules.
- Format the whole file with VS Code. This makes the spacing consistent, 
and puts each selector on its own line.

[1]: http://rbtech.github.io/css-purge/

### Why are the changes needed?

It's very difficult to make improvements to the presentation of the 
documentation due to the noise in this CSS file. This change will facilitate 
future improvements to the documentation.

### Does this PR introduce _any_ user-facing change?

No, it shouldn't. This change should be functionally neutral.

### How was this patch tested?

I built the docs and manually reviewed them. Code blocks and tables look 
the same to me.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44885 from nchammas/css-cleanup.

Authored-by: Nicholas Chammas 
Signed-off-by: Kent Yao 
---
 docs/css/custom.css | 561 +++-
 1 file changed, 199 insertions(+), 362 deletions(-)

diff --git a/docs/css/custom.css b/docs/css/custom.css
index 51e89066e4d5..1fabf7be3ac8 100644
--- a/docs/css/custom.css
+++ b/docs/css/custom.css
@@ -1,9 +1,6 @@
-
-
-
 body {
   color: #66;
-  font-family: 'DM Sans', sans-serif;
+  font-family: "DM Sans", sans-serif;
   font-style: normal;
   font-weight: 400;
   overflow-wrap: anywhere;
@@ -13,11 +10,13 @@ body {
 }
 
 a {
+  background: transparent;
   color: #2fa4e7;
   text-decoration: none;
 }
 
-a:hover, a:focus {
+a:hover,
+a:focus {
   color: #157ab5;
   text-decoration: underline;
 }
@@ -25,41 +24,43 @@ a:hover, a:focus {
 .navbar {
   border-radius: 0;
   z-index: ;
-}
-
-.navbar {
   transition: none !important;
 }
 
-.navbar .nav-item:hover .dropdown-menu, .navbar .nav-item .dropdown-menu, 
.navbar .dropdown-menu.fade-down, .navbar-toggler, .collapse, .collapsing {
+.navbar .nav-item:hover .dropdown-menu,
+.navbar .nav-item .dropdown-menu,
+.navbar .dropdown-menu.fade-down,
+.navbar-toggler,
+.collapse,
+.collapsing {
   transition: none !important;
   transform: none !important;
 }
 
 @media all and (min-width: 992px) {
   .navbar .nav-item .dropdown-menu {
-  display: block;
-  opacity: 0;
-  visibility: hidden;
-  transition: none !important;
-  margin-top: 0;
+display: block;
+opacity: 0;
+visibility: hidden;
+transition: none !important;
+margin-top: 0;
   }
 
   .navbar .dropdown-menu.fade-down {
-  top: 80%;
-  transform: none !important;
-  transform-origin: 0% 0%;
+top: 80%;
+transform: none !important;
+transform-origin: 0% 0%;
   }
 
   .navbar .dropdown-menu.fade-up {
-  top: 180%;
+top: 180%;
   }
 
   .navbar .nav-item:hover .dropdown-menu {
-  opacity: 1;
-  visibility: visible;
-  top: 100%;
-  font-size: calc(0.51rem + 0.55vw);
+opacity: 1;
+visibility: visible;
+top: 100%;
+font-size: calc(0.51rem + 0.55vw);
   }
 }
 
@@ -74,12 +75,17 @@ a:hover, a:focus {
   text-decoration: none !important;
 }
 
-.navbar-dark .navbar-nav .nav-link.active, .navbar-dark .navbar-nav .show > 
.nav-link, .navbar-dark .navbar-nav .nav-link {
+.navbar-dark .navbar-nav .nav-link.active,
+.navbar-dark .navbar-nav .show>.nav-link,
+.navbar-dark .navbar-nav .nav-link {
   color: #ff;
   border: 1px solid transparent;
 }
 
-.navbar-dark .navbar-nav .nav-link:focus, .navbar-dark .navbar-nav 
.nav-link:active, .navbar-dark .navbar-nav .nav-link:focus, .navbar-dark 
.navbar-nav .nav-link:hover {
+.navbar-dark .navbar-nav .nav-link:focus,
+.navbar-dark .navbar-nav .nav-link:active,
+.navbar-dark .navbar-nav .nav-link:focus,
+.navbar-dark .navbar-nav .nav-link:hover {
   color: #ff;
   background-color: #1b618e;
   border: 1px solid transparent;
@@ -151,9 +157,9 @@ section {
 }
 
 @media screen and (min-width: 1900px) {
-.apache-spark-motto {
-  font-size: 7.3rem;
-}
+  .apache-spark-motto {
+font-size: 7.3rem;
+  }
 }
 
 

(spark) branch master updated: [SPARK-46432][BUILD] Upgrade Netty to 4.1.106.Final

2024-01-25 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 44b163d281b9 [SPARK-46432][BUILD] Upgrade Netty to 4.1.106.Final
44b163d281b9 is described below

commit 44b163d281b9773cab9995e690ec3f4751c8be69
Author: panbingkun 
AuthorDate: Fri Jan 26 11:12:11 2024 +0800

[SPARK-46432][BUILD] Upgrade Netty to 4.1.106.Final

### What changes were proposed in this pull request?
The pr aims to upgrade `Netty` from `4.1.100.Final` to `4.1.106.Final`.

### Why are the changes needed?
- To bring the latest bug fixes
Automatically close Http2StreamChannel when 
Http2FrameStreamExceptionreaches end ofChannelPipeline 
([#13651](https://github.com/netty/netty/pull/13651))
Symbol not found: _netty_jni_util_JNI_OnLoad 
([#13695](https://github.com/netty/netty/issues/13728))

- 4.1.106.Final release note: 
https://netty.io/news/2024/01/19/4-1-106-Final.html
- 4.1.105.Final release note: 
https://netty.io/news/2024/01/16/4-1-105-Final.html
- 4.1.104.Final release note: 
https://netty.io/news/2023/12/15/4-1-104-Final.html
- 4.1.103.Final release note: 
https://netty.io/news/2023/12/13/4-1-103-Final.html
- 4.1.101.Final release note: 
https://netty.io/news/2023/11/09/4-1-101-Final.html

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44384 from panbingkun/SPARK-46432.

Lead-authored-by: panbingkun 
Co-authored-by: panbingkun 
Signed-off-by: yangjie01 
---
 common/network-yarn/pom.xml   | 44 ++-
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 37 +++--
 pom.xml   |  2 +-
 3 files changed, 43 insertions(+), 40 deletions(-)

diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index c809bdfbbc1d..3f2ae21eeb3b 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -173,27 +173,29 @@
 unpack
 package
 
-
-
-
-
-
-
-
-
-
-
-
-
-
+  
+
+
+
+
+
+
+
+
+
+
+
+
+  
 
 
   run
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 4ee0f5a41191..71f9ac8665b0 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -191,16 +191,16 @@ metrics-jmx/4.2.21//metrics-jmx-4.2.21.jar
 metrics-json/4.2.21//metrics-json-4.2.21.jar
 metrics-jvm/4.2.21//metrics-jvm-4.2.21.jar
 minlog/1.3.0//minlog-1.3.0.jar
-netty-all/4.1.100.Final//netty-all-4.1.100.Final.jar
-netty-buffer/4.1.100.Final//netty-buffer-4.1.100.Final.jar
-netty-codec-http/4.1.100.Final//netty-codec-http-4.1.100.Final.jar
-netty-codec-http2/4.1.100.Final//netty-codec-http2-4.1.100.Final.jar
-netty-codec-socks/4.1.100.Final//netty-codec-socks-4.1.100.Final.jar
-netty-codec/4.1.100.Final//netty-codec-4.1.100.Final.jar
-netty-common/4.1.100.Final//netty-common-4.1.100.Final.jar
-netty-handler-proxy/4.1.100.Final//netty-handler-proxy-4.1.100.Final.jar
-netty-handler/4.1.100.Final//netty-handler-4.1.100.Final.jar
-netty-resolver/4.1.100.Final//netty-resolver-4.1.100.Final.jar
+netty-all/4.1.106.Final//netty-all-4.1.106.Final.jar
+netty-buffer/4.1.106.Final//netty-buffer-4.1.106.Final.jar
+netty-codec-http/4.1.106.Final//netty-codec-http-4.1.106.Final.jar
+netty-codec-http2/4.1.106.Final//netty-codec-http2-4.1.106.Final.jar
+netty-codec-socks/4.1.106.Final//netty-codec-socks-4.1.106.Final.jar
+netty-codec/4.1.106.Final//netty-codec-4.1.106.Final.jar
+netty-common/4.1.106.Final//netty-common-4.1.106.Final.jar
+netty-handler-proxy/4.1.106.Final//netty-handler-proxy-4.1.106.Final.jar
+netty-handler/4.1.106.Final//netty-handler-4.1.106.Final.jar
+netty-resolver/4.1.106.Final//netty-resolver-4.1.106.Final.jar
 
netty-tcnative-boringssl-static/2.0.61.Final//netty-tcnative-boringssl-static-2.0.61.Final.jar
 
netty-tcnative-boringssl-static/2.0.61.Final/linux-aarch_64/netty-tcnative-boringssl-static-2.0.61.Final-linux-aarch_64.jar
 
netty-tcnative-boringssl-static/2.0.61.Final/linux-x86_64/netty-tcnative-boringssl-static-2.0.61.Final-linux-x86_64.jar
@@ -208,14 +208,15 @@ 

(spark) branch master updated: [SPARK-46869][K8S] Add `logrotate` to Spark docker files

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 25df5dad6761 [SPARK-46869][K8S] Add `logrotate` to Spark docker files
25df5dad6761 is described below

commit 25df5dad67610357287bef075ee755c59acdb904
Author: Dongjoon Hyun 
AuthorDate: Thu Jan 25 16:30:55 2024 -0800

[SPARK-46869][K8S] Add `logrotate` to Spark docker files

### What changes were proposed in this pull request?

This PR aims to add `logrotate` to Spark docker files.

### Why are the changes needed?

To help a user to easily rotate the logs by configuration. Note that this 
is not for rigorous users who cannot allow log data loss. `logratate` is easy 
but is known to allow log loss during rotation.

### Does this PR introduce _any_ user-facing change?

The image size change is negligible.
```
$ docker images spark
REPOSITORY   TAGIMAGE ID   CREATEDSIZE
sparklatest-logrotate   d843879458af   18 hours ago   657MB
sparklatest 0e281bd1fbe6   18 hours ago   657MB
```

### How was this patch tested?

Manually.
```
$ docker run -it --rm spark:latest-logrotate /usr/sbin/logrotate | tail -n7
logrotate 3.19.0 - Copyright (C) 1995-2001 Red Hat, Inc.
This may be freely redistributed under the terms of the GNU General Public 
License

Usage: logrotate [-dfv?] [-d|--debug] [-f|--force] [-m|--mail=command]
[-s|--state=statefile] [--skip-state-lock] [-v|--verbose]
[-l|--log=logfile] [--version] [-?|--help] [--usage]
[OPTION...] 
```

### Was this patch authored or co-authored using generative AI tooling?

Pass the CIs.

Closes #44889 from dongjoon-hyun/SPARK-46869.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
index b80e72c768c6..25d7e076169b 100644
--- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
+++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
@@ -30,7 +30,7 @@ ARG spark_uid=185
 RUN set -ex && \
 apt-get update && \
 ln -s /lib /lib64 && \
-apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps 
net-tools && \
+apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps 
net-tools logrotate && \
 mkdir -p /opt/spark && \
 mkdir -p /opt/spark/examples && \
 mkdir -p /opt/spark/work-dir && \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46867][PYTHON][CONNECT][TESTS] Remove unnecessary dependency from test_mixed_udf_and_sql.py

2024-01-25 Thread xinrong
This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 79918028b142 [SPARK-46867][PYTHON][CONNECT][TESTS] Remove unnecessary 
dependency from test_mixed_udf_and_sql.py
79918028b142 is described below

commit 79918028b142685fe1c3871a3593e91100ab6bbf
Author: Xinrong Meng 
AuthorDate: Thu Jan 25 14:16:12 2024 -0800

[SPARK-46867][PYTHON][CONNECT][TESTS] Remove unnecessary dependency from 
test_mixed_udf_and_sql.py

### What changes were proposed in this pull request?
Remove unnecessary dependency from test_mixed_udf_and_sql.py.

### Why are the changes needed?
Otherwise, test_mixed_udf_and_sql.py depends on Spark Connect's dependency 
"grpc", possibly leading to conflicts or compatibility issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test change only.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44886 from xinrong-meng/fix_dep.

Authored-by: Xinrong Meng 
Signed-off-by: Xinrong Meng 
---
 python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py | 4 
 python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py | 5 +++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py 
b/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py
index c950ca2e17c3..6a3d03246549 100644
--- a/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py
+++ b/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py
@@ -15,6 +15,7 @@
 # limitations under the License.
 #
 import unittest
+from pyspark.sql.connect.column import Column
 from pyspark.sql.tests.pandas.test_pandas_udf_scalar import 
ScalarPandasUDFTestsMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
@@ -51,6 +52,9 @@ class PandasUDFScalarParityTests(ScalarPandasUDFTestsMixin, 
ReusedConnectTestCas
 def test_vectorized_udf_invalid_length(self):
 self.check_vectorized_udf_invalid_length()
 
+def test_mixed_udf_and_sql(self):
+self._test_mixed_udf_and_sql(Column)
+
 
 if __name__ == "__main__":
 from pyspark.sql.tests.connect.test_parity_pandas_udf_scalar import *  # 
noqa: F401
diff --git a/python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py 
b/python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py
index dfbab5c8b3cd..9f6bdb83caf7 100644
--- a/python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py
+++ b/python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py
@@ -1321,8 +1321,9 @@ class ScalarPandasUDFTestsMixin:
 self.assertEqual(expected_multi, df_multi_2.collect())
 
 def test_mixed_udf_and_sql(self):
-from pyspark.sql.connect.column import Column as ConnectColumn
+self._test_mixed_udf_and_sql(Column)
 
+def _test_mixed_udf_and_sql(self, col_type):
 df = self.spark.range(0, 1).toDF("v")
 
 # Test mixture of UDFs, Pandas UDFs and SQL expression.
@@ -1333,7 +1334,7 @@ class ScalarPandasUDFTestsMixin:
 return x + 1
 
 def f2(x):
-assert type(x) in (Column, ConnectColumn)
+assert type(x) == col_type
 return x + 10
 
 @pandas_udf("int")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 3130ac9276bd [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler
3130ac9276bd is described below

commit 3130ac9276bd43dd21aa1aa5e5ef920b00bc3aff
Author: fred-db 
AuthorDate: Thu Jan 25 08:34:37 2024 -0800

[SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

* The DAGScheduler could currently run into a deadlock with another thread 
if both access the partitions of the same RDD at the same time.
* To make progress in getCacheLocs, we require both exclusive access to the 
RDD partitions and the location cache. We first lock on the location cache, and 
then on the RDD.
* When accessing partitions of an RDD, the RDD first acquires exclusive 
access on the partitions, and then might acquire exclusive access on the 
location cache.
* If thread 1 is able to acquire access on the RDD, while thread 2 holds 
the access to the location cache, we can run into a deadlock situation.
* To fix this, acquire locks in the same order. Change the DAGScheduler to 
first acquire the lock on the RDD, and then the lock on the location cache.

* This is a deadlock you can run into, which can prevent any progress on 
the cluster.

* No

* Unit test that reproduces the issue.

No

Closes #44882 from fred-db/fix-deadlock.

Authored-by: fred-db 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 617014cc92d933c70c9865a578fceb265883badd)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/rdd/RDD.scala | 11 ---
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 31 ++
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 38 +-
 3 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index 407820b663a3..fc5a2089f43b 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -223,14 +223,17 @@ abstract class RDD[T: ClassTag](
* not use `this` because RDDs are user-visible, so users might have added 
their own locking on
* RDDs; sharing that could lead to a deadlock.
*
-   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies; but
-   * because DAGs are acyclic, and we only ever hold locks for one path in 
that DAG, there is no
-   * chance of deadlock.
+   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies. Deadlocks
+   * are possible if we try to lock another resource while holding the 
stateLock,
+   * and the lock acquisition sequence of these locks is not guaranteed to be 
the same.
+   * This can lead lead to a deadlock as one thread might first acquire the 
stateLock,
+   * and then the resource,
+   * while another thread might first acquire the resource, and then the 
stateLock.
*
* Executors may reference the shared fields (though they should never 
mutate them,
* that only happens on the driver).
*/
-  private val stateLock = new Serializable {}
+  private[spark] val stateLock = new Serializable {}
 
   // Our dependencies and partitions will be gotten by calling subclass's 
methods below, and will
   // be overwritten when we're checkpointed
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index 2a966fab6f02..26be8c72bbcb 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -173,6 +173,9 @@ private[spark] class DAGScheduler(
* locations where that RDD partition is cached.
*
* All accesses to this map should be guarded by synchronizing on it (see 
SPARK-4454).
+   * If you need to access any RDD while synchronizing on the cache locations,
+   * first synchronize on the RDD, and then synchronize on this map to avoid 
deadlocks. The RDD
+   * could try to access the cache locations after synchronizing on the RDD.
*/
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
@@ -408,22 +411,24 @@ private[spark] class DAGScheduler(
   }
 
   private[scheduler]
-  def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
-// Note: this doesn't use `getOrElse()` because this method is called 
O(num tasks) times
-if (!cacheLocs.contains(rdd.id)) {
-  // Note: if the storage level is NONE, we don't need to get locations 
from block manager.
-  val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == 
StorageLevel.NONE) {
-IndexedSeq.fill(rdd.partitions.length)(Nil)

(spark) branch branch-3.5 updated: [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 125b2f87d453 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler
125b2f87d453 is described below

commit 125b2f87d453a16325f24e7382707f2b365bba14
Author: fred-db 
AuthorDate: Thu Jan 25 08:34:37 2024 -0800

[SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

* The DAGScheduler could currently run into a deadlock with another thread 
if both access the partitions of the same RDD at the same time.
* To make progress in getCacheLocs, we require both exclusive access to the 
RDD partitions and the location cache. We first lock on the location cache, and 
then on the RDD.
* When accessing partitions of an RDD, the RDD first acquires exclusive 
access on the partitions, and then might acquire exclusive access on the 
location cache.
* If thread 1 is able to acquire access on the RDD, while thread 2 holds 
the access to the location cache, we can run into a deadlock situation.
* To fix this, acquire locks in the same order. Change the DAGScheduler to 
first acquire the lock on the RDD, and then the lock on the location cache.

* This is a deadlock you can run into, which can prevent any progress on 
the cluster.

* No

* Unit test that reproduces the issue.

No

Closes #44882 from fred-db/fix-deadlock.

Authored-by: fred-db 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 617014cc92d933c70c9865a578fceb265883badd)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/rdd/RDD.scala | 11 ---
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 31 ++
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 38 +-
 3 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index a21d2ae77396..f695b1020275 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -223,14 +223,17 @@ abstract class RDD[T: ClassTag](
* not use `this` because RDDs are user-visible, so users might have added 
their own locking on
* RDDs; sharing that could lead to a deadlock.
*
-   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies; but
-   * because DAGs are acyclic, and we only ever hold locks for one path in 
that DAG, there is no
-   * chance of deadlock.
+   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies. Deadlocks
+   * are possible if we try to lock another resource while holding the 
stateLock,
+   * and the lock acquisition sequence of these locks is not guaranteed to be 
the same.
+   * This can lead lead to a deadlock as one thread might first acquire the 
stateLock,
+   * and then the resource,
+   * while another thread might first acquire the resource, and then the 
stateLock.
*
* Executors may reference the shared fields (though they should never 
mutate them,
* that only happens on the driver).
*/
-  private val stateLock = new Serializable {}
+  private[spark] val stateLock = new Serializable {}
 
   // Our dependencies and partitions will be gotten by calling subclass's 
methods below, and will
   // be overwritten when we're checkpointed
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index d8adaae19b90..89d16e579348 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -174,6 +174,9 @@ private[spark] class DAGScheduler(
* locations where that RDD partition is cached.
*
* All accesses to this map should be guarded by synchronizing on it (see 
SPARK-4454).
+   * If you need to access any RDD while synchronizing on the cache locations,
+   * first synchronize on the RDD, and then synchronize on this map to avoid 
deadlocks. The RDD
+   * could try to access the cache locations after synchronizing on the RDD.
*/
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
@@ -420,22 +423,24 @@ private[spark] class DAGScheduler(
   }
 
   private[scheduler]
-  def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
-// Note: this doesn't use `getOrElse()` because this method is called 
O(num tasks) times
-if (!cacheLocs.contains(rdd.id)) {
-  // Note: if the storage level is NONE, we don't need to get locations 
from block manager.
-  val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == 
StorageLevel.NONE) {
-IndexedSeq.fill(rdd.partitions.length)(Nil)

(spark) branch master updated: [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 617014cc92d9 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler
617014cc92d9 is described below

commit 617014cc92d933c70c9865a578fceb265883badd
Author: fred-db 
AuthorDate: Thu Jan 25 08:34:37 2024 -0800

[SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

### What changes were proposed in this pull request?

* The DAGScheduler could currently run into a deadlock with another thread 
if both access the partitions of the same RDD at the same time.
* To make progress in getCacheLocs, we require both exclusive access to the 
RDD partitions and the location cache. We first lock on the location cache, and 
then on the RDD.
* When accessing partitions of an RDD, the RDD first acquires exclusive 
access on the partitions, and then might acquire exclusive access on the 
location cache.
* If thread 1 is able to acquire access on the RDD, while thread 2 holds 
the access to the location cache, we can run into a deadlock situation.
* To fix this, acquire locks in the same order. Change the DAGScheduler to 
first acquire the lock on the RDD, and then the lock on the location cache.

### Why are the changes needed?

* This is a deadlock you can run into, which can prevent any progress on 
the cluster.

### Does this PR introduce _any_ user-facing change?

* No

### How was this patch tested?

* Unit test that reproduces the issue.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44882 from fred-db/fix-deadlock.

Authored-by: fred-db 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/rdd/RDD.scala | 11 ---
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 31 ++
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 38 +-
 3 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index d73fb1b9bc3b..a48eaa253ad1 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -224,14 +224,17 @@ abstract class RDD[T: ClassTag](
* not use `this` because RDDs are user-visible, so users might have added 
their own locking on
* RDDs; sharing that could lead to a deadlock.
*
-   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies; but
-   * because DAGs are acyclic, and we only ever hold locks for one path in 
that DAG, there is no
-   * chance of deadlock.
+   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies. Deadlocks
+   * are possible if we try to lock another resource while holding the 
stateLock,
+   * and the lock acquisition sequence of these locks is not guaranteed to be 
the same.
+   * This can lead lead to a deadlock as one thread might first acquire the 
stateLock,
+   * and then the resource,
+   * while another thread might first acquire the resource, and then the 
stateLock.
*
* Executors may reference the shared fields (though they should never 
mutate them,
* that only happens on the driver).
*/
-  private val stateLock = new Serializable {}
+  private[spark] val stateLock = new Serializable {}
 
   // Our dependencies and partitions will be gotten by calling subclass's 
methods below, and will
   // be overwritten when we're checkpointed
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index e728d921d290..e74a3efac250 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -181,6 +181,9 @@ private[spark] class DAGScheduler(
* locations where that RDD partition is cached.
*
* All accesses to this map should be guarded by synchronizing on it (see 
SPARK-4454).
+   * If you need to access any RDD while synchronizing on the cache locations,
+   * first synchronize on the RDD, and then synchronize on this map to avoid 
deadlocks. The RDD
+   * could try to access the cache locations after synchronizing on the RDD.
*/
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
@@ -435,22 +438,24 @@ private[spark] class DAGScheduler(
   }
 
   private[scheduler]
-  def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
-// Note: this doesn't use `getOrElse()` because this method is called 
O(num tasks) times
-if (!cacheLocs.contains(rdd.id)) {
-  // Note: if the storage level is NONE, we don't need to get locations 

(spark) branch master updated: [SPARK-46855][INFRA] Add `sketch` to the dependencies of the `catalyst` in `module.py`

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1bee07e39f1b [SPARK-46855][INFRA] Add `sketch` to the dependencies of 
the `catalyst` in `module.py`
1bee07e39f1b is described below

commit 1bee07e39f1b5aef6ce81e028207691f1dd1fc7c
Author: yangjie01 
AuthorDate: Thu Jan 25 08:26:13 2024 -0800

[SPARK-46855][INFRA] Add `sketch` to the dependencies of the `catalyst` in 
`module.py`

### What changes were proposed in this pull request?
This pr add `sketch` to the dependencies of the `catalyst` module in 
`module.py` due to `sketch` is direct dependency of `catalyst` module.

### Why are the changes needed?
Ensure that when modifying the `sketch` module, both `catalyst` and 
cascading modules will trigger tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44878 from LuciferYang/SPARK-46855.

Lead-authored-by: yangjie01 
Co-authored-by: YangJie 
Signed-off-by: Dongjoon Hyun 
---
 dev/sparktestsupport/modules.py | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index be3e798b0779..b9541c4be9b3 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -179,6 +179,15 @@ launcher = Module(
 ],
 )
 
+sketch = Module(
+name="sketch",
+dependencies=[tags],
+source_file_regexes=[
+"common/sketch/",
+],
+sbt_test_goals=["sketch/test"],
+)
+
 core = Module(
 name="core",
 dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher, 
utils],
@@ -200,7 +209,7 @@ api = Module(
 
 catalyst = Module(
 name="catalyst",
-dependencies=[tags, core, api],
+dependencies=[tags, sketch, core, api],
 source_file_regexes=[
 "sql/catalyst/",
 ],
@@ -315,15 +324,6 @@ connect = Module(
 ],
 )
 
-sketch = Module(
-name="sketch",
-dependencies=[tags],
-source_file_regexes=[
-"common/sketch/",
-],
-sbt_test_goals=["sketch/test"],
-)
-
 graphx = Module(
 name="graphx",
 dependencies=[tags, core],


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] docs: udpate third party projects [spark-website]

2024-01-25 Thread via GitHub


MrPowers commented on PR #497:
URL: https://github.com/apache/spark-website/pull/497#issuecomment-1910538353

   Thank you for reviewing @dongjoon-hyun and @srowen!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] docs: udpate third party projects [spark-website]

2024-01-25 Thread via GitHub


dongjoon-hyun commented on PR #497:
URL: https://github.com/apache/spark-website/pull/497#issuecomment-1910537370

   Thank you for updating, @MrPowers .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark-website) branch asf-site updated: docs: udpate third party projects (#497)

2024-01-25 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new a6ce63fb9c docs: udpate third party projects (#497)
a6ce63fb9c is described below

commit a6ce63fb9c82dc8f25f42f377b487c0de2aff826
Author: Matthew Powers 
AuthorDate: Thu Jan 25 11:18:05 2024 -0500

docs: udpate third party projects (#497)
---
 site/third-party-projects.html | 79 ++
 third-party-projects.md| 77 
 2 files changed, 81 insertions(+), 75 deletions(-)

diff --git a/site/third-party-projects.html b/site/third-party-projects.html
index ba0911b733..a0f7a953f8 100644
--- a/site/third-party-projects.html
+++ b/site/third-party-projects.html
@@ -141,40 +141,57 @@
 
   This page tracks external software projects that supplement Apache 
Spark and add to its ecosystem.
 
-To add a project, open a pull request against the https://github.com/apache/spark-website;>spark-website 
-repository. Add an entry to 
-https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md;>this
 markdown file, 
-then run jekyll 
build to generate the HTML too. Include
-both in your pull request. See the README in this repo for more 
information.
+Popular libraries with 
PySpark integrations
 
-Note that all project and product names should follow trademark guidelines.
+
+  https://github.com/great-expectations/great_expectations;>great-expectations
 - Always know what to expect from your data
+  https://github.com/apache/airflow;>Apache Airflow - A 
platform to programmatically author, schedule, and monitor workflows
+  https://github.com/dmlc/xgboost;>xgboost - Scalable, 
portable and distributed gradient boosting
+  https://github.com/shap/shap;>shap - A game theoretic 
approach to explain the output of any machine learning model
+  https://github.com/awslabs/python-deequ;>python-deequ - 
Measures data quality in large datasets
+  https://github.com/datahub-project/datahub;>datahub - 
Metadata platform for the modern data stack
+  https://github.com/dbt-labs/dbt-spark;>dbt-spark - Enables 
dbt to work with Apache Spark
+
 
-spark-packages.org
+Connectors
 
-https://spark-packages.org/;>spark-packages.org is an 
external, 
-community-managed list of third-party libraries, add-ons, and applications 
that work with 
-Apache Spark. You can add a package as long as you have a GitHub 
repository.
+
+  https://github.com/spark-redshift-community/spark-redshift;>spark-redshift
 - Performant Redshift data source for Apache Spark
+  https://github.com/microsoft/sql-spark-connector;>spark-sql-connector 
- Apache Spark Connector for SQL Server and Azure SQL
+  https://github.com/Azure/azure-cosmosdb-spark;>azure-cosmos-spark - 
Apache Spark Connector for Azure Cosmos DB
+  https://github.com/Azure/azure-event-hubs-spark;>azure-event-hubs-spark
 - Enables continuous data processing with Apache Spark and Azure Event 
Hubs
+  https://github.com/Azure/azure-kusto-spark;>azure-kusto-spark - 
Apache Spark connector for Azure Kusto
+  https://github.com/mongodb/mongo-spark;>mongo-spark - The 
MongoDB Spark connector
+  https://github.com/couchbase/couchbase-spark-connector;>couchbase-spark-connector
 - The Official Couchbase Spark connector
+  https://github.com/datastax/spark-cassandra-connector;>spark-cassandra-connector
 - DataStax connector for Apache Spark to Apache Cassandra
+  https://github.com/elastic/elasticsearch-hadoop;>elasticsearch-hadoop 
- Elasticsearch real-time search and analytics natively integrated with 
Spark
+  https://github.com/neo4j-contrib/neo4j-spark-connector;>neo4j-spark-connector
 - Neo4j Connector for Apache Spark
+  https://github.com/StarRocks/starrocks-connector-for-apache-spark;>starrocks-connector-for-apache-spark
 - StarRocks Apache Spark connector
+  https://github.com/pingcap/tispark;>tispark - TiSpark is 
built for running Apache Spark on top of TiDB/TiKV
+
+
+Open table formats
+
+
+  https://delta.io;>Delta Lake - Storage layer that provides 
ACID transactions and scalable metadata handling for Apache Spark workloads
+  https://github.com/apache/hudi;>Hudi: Upserts, Deletes And 
Incremental Processing on Big Data
+  https://github.com/apache/iceberg;>Iceberg - Open table 
format for analytic datasets
+
 
 Infrastructure projects
 
 
-  https://github.com/spark-jobserver/spark-jobserver;>REST Job 
Server for Apache Spark - 
-REST interface for managing and submitting Spark jobs on the same cluster.
-  http://mlbase.org/;>MLbase - Machine Learning research 
project on top of Spark
+  https://github.com/apache/kyuubi;>Kyuubi - Apache Kyuubi is 
a distributed and multi-tenant gateway to provide serverless SQL on data 
warehouses and lakehouses
+  https://github.com/spark-jobserver/spark-jobserver;>REST Job 
Server for Apache 

Re: [PR] docs: udpate third party projects [spark-website]

2024-01-25 Thread via GitHub


dongjoon-hyun merged PR #497:
URL: https://github.com/apache/spark-website/pull/497


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[PR] docs: udpate third party projects [spark-website]

2024-01-25 Thread via GitHub


MrPowers opened a new pull request, #497:
URL: https://github.com/apache/spark-website/pull/497

   This PR makes a few changes to the [Third Party Projects 
page](https://spark.apache.org/third-party-projects.html).  
   
   * removes all projects that are archived or haven't been updated in the last 
5 years
   * adds connectors that are being actively maintained
   * Adds the most popular Python projects that depend on PySpark and have a 
PySpark integration (determined by number of downloads)
   
   Let me know what you think!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46850][SQL] Convert `_LEGACY_ERROR_TEMP_2102 ` to `UNSUPPORTED_DATATYPE`

2024-01-25 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 43adfa070a40 [SPARK-46850][SQL] Convert `_LEGACY_ERROR_TEMP_2102 ` to 
`UNSUPPORTED_DATATYPE`
43adfa070a40 is described below

commit 43adfa070a40832d8316be8db164e3aca8a4f593
Author: panbingkun 
AuthorDate: Thu Jan 25 18:04:17 2024 +0300

[SPARK-46850][SQL] Convert `_LEGACY_ERROR_TEMP_2102 ` to 
`UNSUPPORTED_DATATYPE`

### What changes were proposed in this pull request?
The pr aims to
- convert `_LEGACY_ERROR_TEMP_2102` to `UNSUPPORTED_DATATYPE`.
- remove some outdated comments.

### Why are the changes needed?
The changes improve the error framework.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add new UT
- Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44871 from panbingkun/LEGACY_ERROR_TEMP_2102.

Lead-authored-by: panbingkun 
Co-authored-by: Maxim Gekk 
Signed-off-by: Max Gekk 
---
 .../src/main/resources/error/error-classes.json|  5 -
 .../spark/sql/catalyst/csv/UnivocityParser.scala   |  6 ++---
 .../spark/sql/catalyst/json/JacksonParser.scala|  4 ++--
 .../spark/sql/errors/QueryExecutionErrors.scala|  7 --
 .../spark/sql/execution/columnar/ColumnType.scala  |  4 ++--
 .../org/apache/spark/sql/CsvFunctionsSuite.scala   | 26 +-
 .../sql/execution/columnar/ColumnTypeSuite.scala   | 14 +++-
 7 files changed, 39 insertions(+), 27 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index 1f3122a502c5..64d65fd4beed 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -5966,11 +5966,6 @@
   "Not support non-primitive type now."
 ]
   },
-  "_LEGACY_ERROR_TEMP_2102" : {
-"message" : [
-  "Unsupported type: ."
-]
-  },
   "_LEGACY_ERROR_TEMP_2103" : {
 "message" : [
   "Dictionary encoding should not be used because of dictionary overflow."
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
index b99ee630d4b2..eb7e120277bb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
@@ -30,13 +30,12 @@ import org.apache.spark.sql.catalyst.expressions.{Cast, 
EmptyRow, ExprUtils, Gen
 import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
 import org.apache.spark.sql.catalyst.util.ResolveDefaultColumns._
-import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.errors.{ExecutionErrors, QueryExecutionErrors}
 import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
 import org.apache.spark.sql.sources.Filter
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
-
 /**
  * Constructs a parser for a given schema that translates CSV data to an 
[[InternalRow]].
  *
@@ -273,8 +272,7 @@ class UnivocityParser(
 case udt: UserDefinedType[_] =>
   makeConverter(name, udt.sqlType, nullable)
 
-// We don't actually hit this exception though, we keep it for 
understandability
-case _ => throw QueryExecutionErrors.unsupportedTypeError(dataType)
+case _ => throw ExecutionErrors.unsupportedDataTypeError(dataType)
   }
 
   private def nullSafeDatum(
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index eace96ac8729..36f37888b084 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -32,7 +32,7 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
 import org.apache.spark.sql.catalyst.util.ResolveDefaultColumns._
-import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.errors.{ExecutionErrors, QueryExecutionErrors}
 import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
 import org.apache.spark.sql.sources.Filter
 import org.apache.spark.sql.types._
@@ -381,7 +381,7 @@ class JacksonParser(
   }
 
 // We don't actually hit this exception though, we keep it for 
understandability
-case 

(spark) branch master updated: [SPARK-46620][PS][CONNECT] Introduce a basic fallback mechanism for frame methods

2024-01-25 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8e1fa5616068 [SPARK-46620][PS][CONNECT] Introduce a basic fallback 
mechanism for frame methods
8e1fa5616068 is described below

commit 8e1fa56160686219039d4cd24db867b982c3af25
Author: Ruifeng Zheng 
AuthorDate: Thu Jan 25 17:26:26 2024 +0800

[SPARK-46620][PS][CONNECT] Introduce a basic fallback mechanism for frame 
methods

### What changes were proposed in this pull request?
1, Introduce a basic fallback mechanism for frame methods, with a new 
option `compute.pandas_fallback` default false;
2, implement `Frame.asfreq` and `Frame.asof`

### Why are the changes needed?
for pandas parity

### Does this PR introduce _any_ user-facing change?
yes

```
In [1]: import pyspark.pandas as ps
   ...: import pandas as pd
   ...:
   ...: index = pd.date_range('1/1/2000', periods=4, freq='min')
   ...: series = pd.Series([0.0, None, 2.0, 3.0], index=index)
   ...: pdf = pd.DataFrame({'s': series})
   ...: psdf = ps.from_pandas(pdf)

In [2]: psdf.asfreq(freq='30s')
---
PandasNotImplementedError Traceback (most recent call last)
Cell In[2], line 1
> 1 psdf.asfreq(freq='30s')

File ~/Dev/spark/python/pyspark/pandas/missing/__init__.py:23, in 
unsupported_function..unsupported_function(*args, **kwargs)
 22 def unsupported_function(*args, **kwargs):
---> 23 raise PandasNotImplementedError(
 24 class_name=class_name, method_name=method_name, 
reason=reason
 25 )

PandasNotImplementedError: The method `pd.DataFrame.asfreq()` is not 
implemented yet.

In [3]: ps.set_option("compute.pandas_fallback", True)

In [4]: psdf.asfreq(freq='30s')
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1015: 
PandasAPIOnSparkAdviceWarning: `asfreq` is executed in fallback mode. It loads 
partial data into the driver's memory to infer the schema, and loads all data 
into one executor's memory to compute. It should only be used if the pandas 
DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1015: 
PandasAPIOnSparkAdviceWarning: If the type hints is not specified for 
`groupby.apply`, it is expensive to infer the data type internally.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[4]:
   s
2000-01-01 00:00:00  0.0
2000-01-01 00:00:30  NaN
2000-01-01 00:01:00  NaN
2000-01-01 00:01:30  NaN
2000-01-01 00:02:00  2.0
2000-01-01 00:02:30  NaN
2000-01-01 00:03:00  3.0
```

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44869 from zhengruifeng/ps_df_fallback.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 dev/sparktestsupport/modules.py|   6 ++
 .../source/user_guide/pandas_on_spark/options.rst  |   2 +
 python/pyspark/pandas/config.py|   9 ++
 python/pyspark/pandas/frame.py |  43 +
 .../tests/connect/frame/test_parity_asfreq.py  |  41 
 .../pandas/tests/connect/frame/test_parity_asof.py |  41 
 python/pyspark/pandas/tests/frame/test_asfreq.py   |  87 +
 python/pyspark/pandas/tests/frame/test_asof.py | 106 +
 8 files changed, 335 insertions(+)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 0500bf38ea8e..be3e798b0779 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -819,6 +819,9 @@ pyspark_pandas = Module(
 "pyspark.pandas.tests.io.test_dataframe_conversion",
 "pyspark.pandas.tests.io.test_dataframe_spark_io",
 "pyspark.pandas.tests.io.test_series_conversion",
+# fallback
+"pyspark.pandas.tests.frame.test_asfreq",
+"pyspark.pandas.tests.frame.test_asof",
 ],
 excluded_python_implementations=[
 "PyPy"  # Skip these tests under PyPy since they require numpy, 
pandas, and pyarrow and
@@ -1200,6 +1203,9 @@ pyspark_pandas_connect_part1 = Module(
 "pyspark.pandas.tests.connect.reshape.test_parity_get_dummies_object",
 "pyspark.pandas.tests.connect.reshape.test_parity_get_dummies_prefix",
 "pyspark.pandas.tests.connect.reshape.test_parity_merge_asof",
+# fallback
+"pyspark.pandas.tests.connect.frame.test_parity_asfreq",
+"pyspark.pandas.tests.connect.frame.test_parity_asof",
 ],
 

(spark) branch master updated: [SPARK-46851][DOCS] Remove `buf` version information from the doc `contributing.rst`

2024-01-25 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 95ea2a6dd403 [SPARK-46851][DOCS] Remove `buf` version information from 
the doc `contributing.rst`
95ea2a6dd403 is described below

commit 95ea2a6dd40399370f75f56dffc66608f351e39e
Author: panbingkun 
AuthorDate: Thu Jan 25 17:14:22 2024 +0800

[SPARK-46851][DOCS] Remove `buf` version information from the doc 
`contributing.rst`

### What changes were proposed in this pull request?
The pr aims to remove `buf` version information from the doc 
`contributing.rst`.

### Why are the changes needed?
Since master branch now use `bufbuild/buf-setup-actionv1` to install buf, 
by default, it uses the `latest` version,
when a new version is released, the version used in the document is no 
longer consistent with the actual version used.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Manually test.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44877 from panbingkun/SPARK-46851.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 python/docs/source/development/contributing.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/docs/source/development/contributing.rst 
b/python/docs/source/development/contributing.rst
index 3eabb96c..94e485c706e3 100644
--- a/python/docs/source/development/contributing.rst
+++ b/python/docs/source/development/contributing.rst
@@ -120,7 +120,7 @@ Prerequisite
 
 PySpark development requires to build Spark that needs a proper JDK installed, 
etc. See `Building Spark 
`_ for more details.
 
-Note that if you intend to contribute to Spark Connect in Python, ``buf`` 
version ``1.28.1`` is required, see `Buf Installation 
`_ for more details.
+Note that if you intend to contribute to Spark Connect in Python, ``buf`` is 
required, see `Buf Installation `_ for 
more details.
 
 Conda
 ~


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46856][PS][TESTS] Apply approximate equality in ewm tests

2024-01-25 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 38a517de8e22 [SPARK-46856][PS][TESTS] Apply approximate equality in 
ewm tests
38a517de8e22 is described below

commit 38a517de8e22183ac92c693e0137bd5bfb3c88a4
Author: Ruifeng Zheng 
AuthorDate: Thu Jan 25 17:11:22 2024 +0800

[SPARK-46856][PS][TESTS] Apply approximate equality in ewm tests

### What changes were proposed in this pull request?
Apply approximate equality in ewm tests

### Why are the changes needed?
the `ewm` function in Spark is based on `EWM` expression in Scala, do not 
need to compare the result too exactly.

on various envs, some tests may fail like:
```
Traceback (most recent call last):
  File "/home/jenkins/python/pyspark/testing/pandasutils.py", line 91, in 
_assert_pandas_equal
assert_frame_equal(
  File 
"/databricks/python3/lib/python3.10/site-packages/pandas/_testing/asserters.py",
 line 1308, in assert_frame_equal
assert_series_equal(
  File 
"/databricks/python3/lib/python3.10/site-packages/pandas/_testing/asserters.py",
 line 1018, in assert_series_equal
assert_numpy_array_equal(
  File 
"/databricks/python3/lib/python3.10/site-packages/pandas/_testing/asserters.py",
 line 741, in assert_numpy_array_equal
_raise(left, right, err_msg)
  File 
"/databricks/python3/lib/python3.10/site-packages/pandas/_testing/asserters.py",
 line 735, in _raise
raise_assert_detail(obj, msg, left, right, index_values=index_values)
  File 
"/databricks/python3/lib/python3.10/site-packages/pandas/_testing/asserters.py",
 line 665, in raise_assert_detail
raise AssertionError(msg)
AssertionError: DataFrame.iloc[:, 1] (column name="b") are different
DataFrame.iloc[:, 1] (column name="b") values are different (25.0 %)
[index]: [0.9781772871933869, 0.6938842103849581, 0.05954110855254491, 
0.43191250286369276]
[left]:  [4.0, 2.4615384615384617, 2.848920863309352, 1.5441072688779112]
[right]: [4.0, 2.4615384615384617, 2.8489208633093526, 1.5441072688779112]
```

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44879 from zhengruifeng/ps_test_ewm_almost.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../pyspark/pandas/tests/window/test_ewm_mean.py   | 106 +
 1 file changed, 90 insertions(+), 16 deletions(-)

diff --git a/python/pyspark/pandas/tests/window/test_ewm_mean.py 
b/python/pyspark/pandas/tests/window/test_ewm_mean.py
index 00750b867610..5777445ecc57 100644
--- a/python/pyspark/pandas/tests/window/test_ewm_mean.py
+++ b/python/pyspark/pandas/tests/window/test_ewm_mean.py
@@ -25,56 +25,110 @@ class EWMMeanMixin:
 def _test_ewm_func(self, f):
 pser = pd.Series([1, 2, 3], index=np.random.rand(3), name="a")
 psser = ps.from_pandas(pser)
-self.assert_eq(getattr(psser.ewm(com=0.2), f)(), 
getattr(pser.ewm(com=0.2), f)())
 self.assert_eq(
-getattr(psser.ewm(com=0.2), f)().sum(), getattr(pser.ewm(com=0.2), 
f)().sum()
+getattr(psser.ewm(com=0.2), f)(),
+getattr(pser.ewm(com=0.2), f)(),
+almost=True,
 )
-self.assert_eq(getattr(psser.ewm(span=1.7), f)(), 
getattr(pser.ewm(span=1.7), f)())
 self.assert_eq(
-getattr(psser.ewm(span=1.7), f)().sum(), 
getattr(pser.ewm(span=1.7), f)().sum()
+getattr(psser.ewm(com=0.2), f)().sum(),
+getattr(pser.ewm(com=0.2), f)().sum(),
+almost=True,
 )
-self.assert_eq(getattr(psser.ewm(halflife=0.5), f)(), 
getattr(pser.ewm(halflife=0.5), f)())
 self.assert_eq(
-getattr(psser.ewm(halflife=0.5), f)().sum(), 
getattr(pser.ewm(halflife=0.5), f)().sum()
+getattr(psser.ewm(span=1.7), f)(),
+getattr(pser.ewm(span=1.7), f)(),
+almost=True,
 )
-self.assert_eq(getattr(psser.ewm(alpha=0.7), f)(), 
getattr(pser.ewm(alpha=0.7), f)())
 self.assert_eq(
-getattr(psser.ewm(alpha=0.7), f)().sum(), 
getattr(pser.ewm(alpha=0.7), f)().sum()
+getattr(psser.ewm(span=1.7), f)().sum(),
+getattr(pser.ewm(span=1.7), f)().sum(),
+almost=True,
+)
+self.assert_eq(
+getattr(psser.ewm(halflife=0.5), f)(),
+getattr(pser.ewm(halflife=0.5), f)(),
+almost=True,
+)
+self.assert_eq(
+getattr(psser.ewm(halflife=0.5), f)().sum(),
+getattr(pser.ewm(halflife=0.5), f)().sum(),
+almost=True,
+)
+