from:"dongjoon"

(spark) branch master updated: [SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java code

2024-02-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7e911cdd0344 [SPARK-47039][TESTS] Add a checkstyle rule to ban 
`commons-lang` in Java code
7e911cdd0344 is described below

commit 7e911cdd0344f164cc6a2976fa832d50589b3a2c
Author: Dongjoon Hyun 
AuthorDate: Wed Feb 14 09:41:09 2024 -0800

[SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java 
code

### What changes were proposed in this pull request?

This PR aims to add a checkstyle rule to ban `commons-lang` in Java code in 
favor of `commons-lang3`.

### Why are the changes needed?

SPARK-16129 banned `commons-lang` in Scala code since Apache Spark 2.0.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45097 from dongjoon-hyun/SPARK-47039.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/checkstyle-suppressions.xml | 2 ++
 dev/checkstyle.xml  | 1 +
 2 files changed, 3 insertions(+)

diff --git a/dev/checkstyle-suppressions.xml b/dev/checkstyle-suppressions.xml
index 37c03759ad5e..7b20dfb6bce5 100644
--- a/dev/checkstyle-suppressions.xml
+++ b/dev/checkstyle-suppressions.xml
@@ -62,4 +62,6 @@
   
files="sql/api/src/main/java/org/apache/spark/sql/streaming/Trigger.java"/>
 
+
 
diff --git a/dev/checkstyle.xml b/dev/checkstyle.xml
index 5af15318081a..b9997d2050d1 100644
--- a/dev/checkstyle.xml
+++ b/dev/checkstyle.xml
@@ -186,6 +186,7 @@
 
 
 
+
 
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait

2024-02-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a6bed5e9bcc5 [SPARK-47040][CONNECT] Allow Spark Connect Server Script 
to wait
a6bed5e9bcc5 is described below

commit a6bed5e9bcc54dac51421263d5ef73c0b6e0b12c
Author: Martin Grund 
AuthorDate: Wed Feb 14 03:03:30 2024 -0800

[SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait

### What changes were proposed in this pull request?
Add an option to the command line of `./sbin/start-connect-server.sh` that 
leaves it running in the foreground for easier debugging.

```
./sbin/start-connect-server.sh --wait
```

### Why are the changes needed?
Usability

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Manual

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45090 from grundprinzip/start_server_wait.

Authored-by: Martin Grund 
Signed-off-by: Dongjoon Hyun 
---
 sbin/start-connect-server.sh | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/sbin/start-connect-server.sh b/sbin/start-connect-server.sh
index a347f43db8b1..fecda717eb34 100755
--- a/sbin/start-connect-server.sh
+++ b/sbin/start-connect-server.sh
@@ -38,4 +38,10 @@ fi
 
 . "${SPARK_HOME}/bin/load-spark-env.sh"
 
-exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark 
Connect server" "$@"
+if [ "$1" == "--wait" ]; then
+  shift
+  exec "${SPARK_HOME}"/bin/spark-submit --class $CLASS 1 --name "Spark Connect 
Server" "$@"
+else
+  exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark 
Connect server" "$@"
+fi
+


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26

2024-02-13 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a8c62d3f9a8d [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26
a8c62d3f9a8d is described below

commit a8c62d3f9a8de22f92e0e0ca1a5770f373b0b142
Author: Dongjoon Hyun 
AuthorDate: Mon Feb 12 10:37:49 2024 -0800

[SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26

This PR aims to upgrade `aircompressor` to 1.26.

`aircompressor` v1.26 has the following bug fixes.

- [Fix out of bounds read/write in Snappy 
decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2)
- [Fix ZstdOutputStream corruption on double 
close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2)

No.

Pass the CIs.

No.

Closes #45084 from dongjoon-hyun/SPARK-47023.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 5 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 9ab51dfa011a..c76702cd0af0 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -4,7 +4,7 @@ JTransforms/3.1//JTransforms-3.1.jar
 RoaringBitmap/0.9.45//RoaringBitmap-0.9.45.jar
 ST4/4.0.4//ST4-4.0.4.jar
 activation/1.1.1//activation-1.1.1.jar
-aircompressor/0.25//aircompressor-0.25.jar
+aircompressor/0.26//aircompressor-0.26.jar
 algebra_2.12/2.0.1//algebra_2.12-2.0.1.jar
 aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar
 aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar
diff --git a/pom.xml b/pom.xml
index 52505e6e1200..5db3c78e00eb 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2555,6 +2555,11 @@
   
 
   
+  
+io.airlift
+aircompressor
+0.26
+  
   
 org.apache.orc
 orc-mapreduce


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47030][TESTS] Add `WebBrowserTest`

2024-02-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 33a153a6bbcb [SPARK-47030][TESTS] Add `WebBrowserTest`
33a153a6bbcb is described below

commit 33a153a6bbcba0d9b2ab20404c7d3b6db86d7b4a
Author: Dongjoon Hyun 
AuthorDate: Mon Feb 12 17:01:35 2024 -0800

[SPARK-47030][TESTS] Add `WebBrowserTest`

### What changes were proposed in this pull request?

This PR aims to add a new test tag, `WebBrowserTest`.

### Why are the changes needed?

Currently, several browser-based tests exist in multiple modules like the 
following. It's difficult to find and run them.

```
common/tags/src/test/java/org/apache/spark/tags/WebBrowserTest.java

core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala
core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala
core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/ui/UISeleniumSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/ui/UISeleniumSuite.scala

sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/UISeleniumSuite.scala
```

In addition, the previous `ChromeUITest` is designed to disable `ChromeUI*` 
suite and doesn't cover all `WebBroser` based test suite.

### Does this PR introduce _any_ user-facing change?

No, this is a new test tag.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45089 from dongjoon-hyun/SPARK-47030.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../java/org/apache/spark/tags/WebBrowserTest.java | 36 +-
 .../history/ChromeUIHistoryServerSuite.scala   |  4 ++-
 .../spark/deploy/history/HistoryServerSuite.scala  |  4 ++-
 .../apache/spark/ui/ChromeUISeleniumSuite.scala|  3 +-
 .../org/apache/spark/ui/UISeleniumSuite.scala  |  2 ++
 .../spark/sql/execution/ui/UISeleniumSuite.scala   |  2 ++
 .../spark/sql/streaming/ui/UISeleniumSuite.scala   |  4 ++-
 .../sql/hive/thriftserver/UISeleniumSuite.scala|  2 ++
 .../apache/spark/streaming/UISeleniumSuite.scala   |  2 ++
 9 files changed, 26 insertions(+), 33 deletions(-)

diff --git 
a/core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala 
b/common/tags/src/test/java/org/apache/spark/tags/WebBrowserTest.java
similarity index 50%
copy from core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala
copy to common/tags/src/test/java/org/apache/spark/tags/WebBrowserTest.java
index 459af6748e0e..715dcbf3b747 100644
--- a/core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala
+++ b/common/tags/src/test/java/org/apache/spark/tags/WebBrowserTest.java
@@ -15,35 +15,13 @@
  * limitations under the License.
  */
 
-package org.apache.spark.ui
+package org.apache.spark.tags;
 
-import org.openqa.selenium.{JavascriptExecutor, WebDriver}
-import org.openqa.selenium.chrome.{ChromeDriver, ChromeOptions}
+import java.lang.annotation.*;
 
-import org.apache.spark.tags.ChromeUITest
+import org.scalatest.TagAnnotation;
 
-/**
- * Selenium tests for the Spark Web UI with Chrome.
- */
-@ChromeUITest
-class ChromeUISeleniumSuite extends 
RealBrowserUISeleniumSuite("webdriver.chrome.driver") {
-
-  override var webDriver: WebDriver with JavascriptExecutor = _
-
-  override def beforeAll(): Unit = {
-super.beforeAll()
-val chromeOptions = new ChromeOptions
-chromeOptions.addArguments("--headless", "--disable-gpu")
-webDriver = new ChromeDriver(chromeOptions)
-  }
-
-  override def afterAll(): Unit = {
-try {
-  if (webDriver != null) {
-webDriver.quit()
-  }
-} finally {
-  super.afterAll()
-}
-  }
-}
+@TagAnnotation
+@Retention(RetentionPolicy.RUNTIME)
+@Target({ElementType.METHOD, ElementType.TYPE})
+public @interface WebBrowserTest { }
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala
 
b/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala
index ec910e9bf343..ec9278f81b6c 100644
--- 
a/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala
@@ -21,7 +21,7 @@ import org.openqa.selenium.WebDriver
 import org.openqa.selenium.chrome.{ChromeDriver, ChromeOptions}
 
 import org.apache.spark.internal.config.History.HybridStoreD

(spark) branch master updated: [SPARK-47027][PYTHON][TESTS] Use temporary directories for profiler test outputs

2024-02-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 24a9d25358f7 [SPARK-47027][PYTHON][TESTS] Use temporary directories 
for profiler test outputs
24a9d25358f7 is described below

commit 24a9d25358f71e5634240aa29c600588b838edb2
Author: Takuya UESHIN 
AuthorDate: Mon Feb 12 13:35:45 2024 -0800

[SPARK-47027][PYTHON][TESTS] Use temporary directories for profiler test 
outputs

### What changes were proposed in this pull request?

Use temporary directories for profiler test outputs instead of 
`tempfile.gettempdir()`.

### Why are the changes needed?

Directly using `tempfile.gettempdir()` can leave the files there after each 
test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45087 from ueshin/issues/SPARK-47027/tempdir.

Authored-by: Takuya UESHIN 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/tests/test_udf_profiler.py | 28 +--
 python/pyspark/tests/test_memory_profiler.py  |  6 +++---
 python/pyspark/tests/test_profiler.py |  6 +++---
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/python/pyspark/sql/tests/test_udf_profiler.py 
b/python/pyspark/sql/tests/test_udf_profiler.py
index 4f767d274414..764a860026f8 100644
--- a/python/pyspark/sql/tests/test_udf_profiler.py
+++ b/python/pyspark/sql/tests/test_udf_profiler.py
@@ -82,20 +82,20 @@ class UDFProfilerTests(unittest.TestCase):
 finally:
 sys.stdout = old_stdout
 
-d = tempfile.gettempdir()
-self.sc.dump_profiles(d)
-
-for i, udf_name in enumerate(["add1", "add2", "add1", "add2"]):
-id, profiler, _ = profilers[i]
-with self.subTest(id=id, udf_name=udf_name):
-stats = profiler.stats()
-self.assertTrue(stats is not None)
-width, stat_list = stats.get_print_list([])
-func_names = [func_name for fname, n, func_name in stat_list]
-self.assertTrue(udf_name in func_names)
-
-self.assertTrue(udf_name in io.getvalue())
-self.assertTrue("udf_%d.pstats" % id in os.listdir(d))
+with tempfile.TemporaryDirectory() as d:
+self.sc.dump_profiles(d)
+
+for i, udf_name in enumerate(["add1", "add2", "add1", "add2"]):
+id, profiler, _ = profilers[i]
+with self.subTest(id=id, udf_name=udf_name):
+stats = profiler.stats()
+self.assertTrue(stats is not None)
+width, stat_list = stats.get_print_list([])
+func_names = [func_name for fname, n, func_name in 
stat_list]
+self.assertTrue(udf_name in func_names)
+
+self.assertTrue(udf_name in io.getvalue())
+self.assertTrue("udf_%d.pstats" % id in os.listdir(d))
 
 def test_custom_udf_profiler(self):
 class TestCustomProfiler(UDFBasicProfiler):
diff --git a/python/pyspark/tests/test_memory_profiler.py 
b/python/pyspark/tests/test_memory_profiler.py
index 536f38679c3e..aa3541620446 100644
--- a/python/pyspark/tests/test_memory_profiler.py
+++ b/python/pyspark/tests/test_memory_profiler.py
@@ -106,9 +106,9 @@ class MemoryProfilerTests(PySparkTestCase):
 self.sc.show_profiles()
 self.assertTrue("plus_one" in fake_out.getvalue())
 
-d = tempfile.gettempdir()
-self.sc.dump_profiles(d)
-self.assertTrue("udf_%d_memory.txt" % id in os.listdir(d))
+with tempfile.TemporaryDirectory() as d:
+self.sc.dump_profiles(d)
+self.assertTrue("udf_%d_memory.txt" % id in os.listdir(d))
 
 def test_profile_pandas_udf(self):
 udfs = [self.exec_pandas_udf_ser_to_ser, 
self.exec_pandas_udf_ser_to_scalar]
diff --git a/python/pyspark/tests/test_profiler.py 
b/python/pyspark/tests/test_profiler.py
index b7797ead2adb..a12bc99c54ae 100644
--- a/python/pyspark/tests/test_profiler.py
+++ b/python/pyspark/tests/test_profiler.py
@@ -54,9 +54,9 @@ class ProfilerTests(PySparkTestCase):
 self.assertTrue("heavy_foo" in io.getvalue())
 sys.stdout = old_stdout
 
-d = tempfile.gettempdir()
-self.sc.dump_profiles(d)
-self.assertTrue("rdd_%d.pstats" % id in os.listdir(d))
+with tempfile.TemporaryDirectory() as d:
+self.sc.dump_profiles(d)
+

(spark) branch master updated: [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26

2024-02-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b425cd866334 [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26
b425cd866334 is described below

commit b425cd86633402b764ea90449853610e98963a54
Author: Dongjoon Hyun 
AuthorDate: Mon Feb 12 10:37:49 2024 -0800

[SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26

### What changes were proposed in this pull request?

This PR aims to upgrade `aircompressor` to 1.26.

### Why are the changes needed?

`aircompressor` v1.26 has the following bug fixes.

- [Fix out of bounds read/write in Snappy 
decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2)
- [Fix ZstdOutputStream corruption on double 
close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45084 from dongjoon-hyun/SPARK-47023.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 5 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index dd8d74888c6a..0b619a249e96 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -4,7 +4,7 @@ JTransforms/3.1//JTransforms-3.1.jar
 RoaringBitmap/1.0.1//RoaringBitmap-1.0.1.jar
 ST4/4.0.4//ST4-4.0.4.jar
 activation/1.1.1//activation-1.1.1.jar
-aircompressor/0.25//aircompressor-0.25.jar
+aircompressor/0.26//aircompressor-0.26.jar
 algebra_2.13/2.8.0//algebra_2.13-2.8.0.jar
 aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar
 aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar
diff --git a/pom.xml b/pom.xml
index ed6f48262570..0b6a6955b18b 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2596,6 +2596,11 @@
   
 
   
+  
+io.airlift
+aircompressor
+0.26
+  
   
 org.apache.orc
 orc-mapreduce


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r67309 - /dev/spark/v3.3.4-rc1-docs/

2024-02-12 Thread dongjoon

Author: dongjoon
Date: Mon Feb 12 17:26:47 2024
New Revision: 67309

Log:
Remove Apache Spark 3.3.4 RC1 docs after releasing

Removed:
dev/spark/v3.3.4-rc1-docs/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit`

2024-02-12 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f5b1e37c9e6a [SPARK-5][BUILD][TESTS] Use 
`org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit`
f5b1e37c9e6a is described below

commit f5b1e37c9e6a4ec2fd897f97cd4526415e6c0e49
Author: Dongjoon Hyun 
AuthorDate: Mon Feb 12 09:12:10 2024 -0800

[SPARK-5][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` 
instead of `net.sourceforge.htmlunit`

### What changes were proposed in this pull request?

This PR aims to use `org.seleniumhq.selenium.htmlunit3-driver` 4.17.0 
instead of old `net.sourceforge.htmlunit` test dependencies and 
`org.seleniumhq.selenium.htmlunit-driver` test dependency;
- Remove `net.sourceforge.htmlunit.htmlunit` `2.70.0`.
- Remove `net.sourceforge.htmlunit.htmlunit-core-js` `2.70.0`.
- Remove `org.seleniumhq.selenium.htmlunit-driver` `4.12.0`.
- Remove `xml-apis:xml-apis` `1.4.01`.

### Why are the changes needed?

To help browser-based test suites.

### Does this PR introduce _any_ user-facing change?

No. This is a test-only dependency and code change.

### How was this patch tested?

Manual tests.

```
build/sbt -Dguava.version=32.1.2-jre 
-Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver 
-Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \
"core/testOnly *HistoryServerSuite"
```

```
build/sbt -Dguava.version=32.1.2-jre 
-Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver 
-Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \
"core/testOnly *UISeleniumSuite"
```

```
build/sbt -Dguava.version=32.1.2-jre 
-Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver 
-Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \
"sql/testOnly *UISeleniumSuite"
```

```
build/sbt -Dguava.version=32.1.2-jre 
-Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver 
-Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \
"streaming/testOnly *UISeleniumSuite"
```

```
build/sbt -Dguava.version=32.1.2-jre 
-Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver 
-Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \
"hive-thriftserver/testOnly *UISeleniumSuite"
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45079 from dongjoon-hyun/SPARK-5.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/pom.xml   | 19 +---
 .../spark/deploy/history/HistoryServerSuite.scala  |  4 +++-
 .../history/RealBrowserUIHistoryServerSuite.scala  |  6 -
 .../org/apache/spark/ui/UISeleniumSuite.scala  |  4 ++--
 pom.xml| 26 +++---
 sql/core/pom.xml   |  2 +-
 .../spark/sql/execution/ui/UISeleniumSuite.scala   |  8 ---
 sql/hive-thriftserver/pom.xml  |  2 +-
 streaming/pom.xml  |  2 +-
 9 files changed, 22 insertions(+), 51 deletions(-)

diff --git a/core/pom.xml b/core/pom.xml
index f780551fb555..9b5297cb8543 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -354,18 +354,7 @@
 
 
   org.seleniumhq.selenium
-  htmlunit-driver
-  test
-
-
-
-  net.sourceforge.htmlunit
-  htmlunit
-  test
-
-
-  net.sourceforge.htmlunit
-  htmlunit-core-js
+  htmlunit3-driver
   test
 
 
@@ -384,12 +373,6 @@
   httpcore
   test
 
-
-
-  xml-apis
-  xml-apis
-  test
-
 
   org.mockito
   mockito-core
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
index 1ca1e8fefd06..b3d7315e169b 100644
--- 
a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
@@ -561,7 +561,9 @@ abstract class HistoryServerSuite extends SparkFunSuite 
with BeforeAndAfter with
 // app is no longer incomplete
 listApplications(false) should not contain(appId)
 
-assert(jobcount === getNumJobs("/jobs"))
+eventually(stdTimeout, stdInterval) {
+  assert(jobcount === getNumJobs("/jobs"))
+}
 
 // no need to retain the test dir now the tests complete
 ShutdownHookManager.registerShutdownDeleteDir(logDir)
diff --git 
a/core

(spark) branch master updated: [SPARK-46991][SQL] Replace `IllegalArgumentException` by `SparkIllegalArgumentException` in `catalyst`

2024-02-11 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f5b0de07eff4 [SPARK-46991][SQL] Replace `IllegalArgumentException` by 
`SparkIllegalArgumentException` in `catalyst`
f5b0de07eff4 is described below

commit f5b0de07eff49bc1d076c4a1dc59c8672beff99e
Author: Max Gekk 
AuthorDate: Sun Feb 11 15:25:28 2024 -0800

[SPARK-46991][SQL] Replace `IllegalArgumentException` by 
`SparkIllegalArgumentException` in `catalyst`

### What changes were proposed in this pull request?
In the PR, I propose to replace all `IllegalArgumentException` by 
`SparkIllegalArgumentException` in `Catalyst` code base, and introduce new 
legacy error classes with the `_LEGACY_ERROR_TEMP_` prefix.

### Why are the changes needed?
To unify Spark SQL exception, and port Java exceptions on Spark exceptions 
with error classes.

### Does this PR introduce _any_ user-facing change?
Yes, it can if user's code assumes some particular format of 
`IllegalArgumentException` messages.

### How was this patch tested?
By running existing test suites like:
```
$ build/sbt "core/testOnly *SparkThrowableSuite"
$ build/sbt "test:testOnly *BufferHolderSparkSubmitSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45033 from MaxGekk/migrate-IllegalArgumentException-catalyst.

Authored-by: Max Gekk 
    Signed-off-by: Dongjoon Hyun 
---
 .../src/main/resources/error/error-classes.json| 255 +
 .../scala/org/apache/spark/SparkException.scala|  25 +-
 .../connect/client/GrpcExceptionConverter.scala|   1 +
 .../sql/catalyst/expressions/ExpressionInfo.java   |  38 +--
 .../catalyst/expressions/codegen/BufferHolder.java |  12 +-
 .../sql/catalyst/expressions/xml/UDFXPathUtil.java |   4 +-
 .../catalog/SupportsPartitionManagement.java   |   5 +-
 .../sql/connector/util/V2ExpressionSQLBuilder.java |   4 +-
 .../spark/sql/util/CaseInsensitiveStringMap.java   |   3 +-
 .../sql/catalyst/CatalystTypeConverters.scala  |  75 --
 .../spark/sql/catalyst/csv/CSVExprUtils.scala  |  18 +-
 .../spark/sql/catalyst/csv/CSVHeaderChecker.scala  |   5 +-
 .../spark/sql/catalyst/expressions/Cast.scala  |   6 +-
 .../sql/catalyst/expressions/TimeWindow.scala  |   6 +-
 .../expressions/codegen/CodeGenerator.scala|   6 +-
 .../expressions/collectionOperations.scala |  25 +-
 .../sql/catalyst/expressions/csvExpressions.scala  |   7 +-
 .../catalyst/expressions/datetimeExpressions.scala |  14 +-
 .../sql/catalyst/expressions/xmlExpressions.scala  |   7 +-
 .../ReplaceNullWithFalseInPredicate.scala  |  11 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala |   8 +-
 .../spark/sql/catalyst/plans/joinTypes.scala   |  16 +-
 .../sql/catalyst/plans/logical/v2Commands.scala|   6 +-
 .../spark/sql/catalyst/util/DateTimeUtils.scala|   8 +-
 .../spark/sql/catalyst/util/IntervalUtils.scala|  39 ++--
 .../spark/sql/catalyst/xml/StaxXmlGenerator.scala  |   9 +-
 .../spark/sql/catalyst/xml/StaxXmlParser.scala |  22 +-
 .../spark/sql/catalyst/xml/XmlInferSchema.scala|   5 +-
 .../sql/connector/catalog/CatalogV2Util.scala  |  32 ++-
 .../spark/sql/errors/QueryExecutionErrors.scala|  28 +--
 .../sql/catalyst/CatalystTypeConvertersSuite.scala |  74 +++---
 .../spark/sql/catalyst/csv/CSVExprUtilsSuite.scala |  42 ++--
 .../sql/catalyst/expressions/TimeWindowSuite.scala |  18 +-
 .../codegen/BufferHolderSparkSubmitSuite.scala |  12 +-
 .../expressions/codegen/BufferHolderSuite.scala|  22 +-
 .../sql/catalyst/util/DateTimeUtilsSuite.scala |  24 +-
 .../sql/util/CaseInsensitiveStringMapSuite.scala   |  11 +-
 .../execution/datasources/v2/AlterTableExec.scala  |   3 +-
 .../resources/sql-tests/results/ansi/date.sql.out  |   5 +-
 .../sql/expressions/ExpressionInfoSuite.scala  |  84 ---
 40 files changed, 729 insertions(+), 266 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index 4fcf9248d3e2..5884c9267119 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -7512,6 +7512,261 @@
   "Failed to create column family with reserved name="
 ]
   },
+  "_LEGACY_ERROR_TEMP_3198" : {
+"message" : [
+  "Cannot grow BufferHolder by size  because the size is 
negative"
+]
+  },
+  "_LEGACY_ERROR_TEMP_3199" : {
+"message" : [
+  "Cannot grow BufferHolder by size  because the size after 
growing exceeds size limitation "
+]
+  },
+  "_LEGACY_ERROR_TEMP

(spark) branch branch-3.5 updated: [SPARK-47022][CONNECT][TESTS][3.5] Fix `connect/client/jvm` to have explicit `commons-(io|lang3)` test dependency

2024-02-11 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 4e4d9f07d095 [SPARK-47022][CONNECT][TESTS][3.5] Fix 
`connect/client/jvm` to have explicit `commons-(io|lang3)` test dependency
4e4d9f07d095 is described below

commit 4e4d9f07d0954357e85a6e2b0da47746a4b08501
Author: Dongjoon Hyun 
AuthorDate: Sun Feb 11 14:38:48 2024 -0800

[SPARK-47022][CONNECT][TESTS][3.5] Fix `connect/client/jvm` to have 
explicit `commons-(io|lang3)` test dependency

### What changes were proposed in this pull request?

This PR aims to add `commons-io` and `commons-lang3` test dependency to 
`connector/client/jvm` module.

### Why are the changes needed?

`connector/client/jvm` module uses `commons-io` and `commons-lang3` during 
testing like the following.


https://github.com/apache/spark/blob/9700da7bfc1abb607f3cb916b96724d0fb8f2eba/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala#L26-L28

Currently, it's broken due to that.

- https://github.com/apache/spark/actions?query=branch%3Abranch-3.5

### Does this PR introduce _any_ user-facing change?

No, this is a test-dependency only change.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45081 from dongjoon-hyun/SPARK-47022.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 connector/connect/client/jvm/pom.xml | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/connector/connect/client/jvm/pom.xml 
b/connector/connect/client/jvm/pom.xml
index 236e5850b762..0c0d4cdad3a9 100644
--- a/connector/connect/client/jvm/pom.xml
+++ b/connector/connect/client/jvm/pom.xml
@@ -71,6 +71,16 @@
   ${ammonite.version}
   provided
 
+
+  commons-io
+  commons-io
+  test
+
+
+  org.apache.commons
+  commons-lang3
+  test
+
 
   org.scalacheck
   scalacheck_${scala.binary.version}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency

2024-02-11 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 29adf32acdac [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have 
explicit `commons-lang3` test dependency
29adf32acdac is described below

commit 29adf32acdacb56fd399b8945d7e049db5810ca1
Author: Dongjoon Hyun 
AuthorDate: Sun Feb 11 10:38:00 2024 -0800

[SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit 
`commons-lang3` test dependency

### What changes were proposed in this pull request?

This PR aims to fix `kvstore` module by adding explicit `commons-lang3` 
test dependency and excluding `htmlunit-driver` from `org.scalatestplus` to use 
Apache Spark's explicit declaration.

https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/pom.xml#L711-L716

### Why are the changes needed?

Since Spark 3.3.0 (SPARK-37282), `kvstore` uses `commons-lang3` test 
dependency like the following, but we didn't declare it explicitly so far.


https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java#L33


https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBIteratorSuite.java#L23

Previously, it was provided by some unused `htmlunit-driver`'s transitive 
dependency accidentally. This causes a weird situation which `kvstore` module 
starts to fail to compile when we upgrade `htmlunit-driver`. We need to fix 
this first.

```
$ mvn dependency:tree -pl common/kvstore
...
[INFO] |  \- org.seleniumhq.selenium:htmlunit-driver:jar:4.12.0:test
...
[INFO] |+- org.apache.commons:commons-lang3:jar:3.14.0:test
```

### Does this PR introduce _any_ user-facing change?

No. This is only a test dependency fix.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45080 from dongjoon-hyun/SPARK-47021.

    Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit a926c7912a78f1a2fb71c5ffd21b5c2f723a0128)
Signed-off-by: Dongjoon Hyun 
---
 common/kvstore/pom.xml | 5 +
 pom.xml| 6 ++
 2 files changed, 11 insertions(+)

diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 4184e499221a..c72e08056937 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -66,6 +66,11 @@
   commons-io
   test
 
+
+  org.apache.commons
+  commons-lang3
+  test
+
 
 
   org.apache.logging.log4j
diff --git a/pom.xml b/pom.xml
index acc23ab2d8ed..26f0b71a5114 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1152,6 +1152,12 @@
 selenium-4-7_${scala.binary.version}
 3.2.15.0
 test
+
+  
+org.seleniumhq.selenium
+htmlunit-driver
+  
+
   
   
 org.mockito


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency

2024-02-11 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 9700da7bfc1a [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have 
explicit `commons-lang3` test dependency
9700da7bfc1a is described below

commit 9700da7bfc1abb607f3cb916b96724d0fb8f2eba
Author: Dongjoon Hyun 
AuthorDate: Sun Feb 11 10:38:00 2024 -0800

[SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit 
`commons-lang3` test dependency

### What changes were proposed in this pull request?

This PR aims to fix `kvstore` module by adding explicit `commons-lang3` 
test dependency and excluding `htmlunit-driver` from `org.scalatestplus` to use 
Apache Spark's explicit declaration.

https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/pom.xml#L711-L716

### Why are the changes needed?

Since Spark 3.3.0 (SPARK-37282), `kvstore` uses `commons-lang3` test 
dependency like the following, but we didn't declare it explicitly so far.


https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java#L33


https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBIteratorSuite.java#L23

Previously, it was provided by some unused `htmlunit-driver`'s transitive 
dependency accidentally. This causes a weird situation which `kvstore` module 
starts to fail to compile when we upgrade `htmlunit-driver`. We need to fix 
this first.

```
$ mvn dependency:tree -pl common/kvstore
...
[INFO] |  \- org.seleniumhq.selenium:htmlunit-driver:jar:4.12.0:test
...
[INFO] |+- org.apache.commons:commons-lang3:jar:3.14.0:test
```

### Does this PR introduce _any_ user-facing change?

No. This is only a test dependency fix.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45080 from dongjoon-hyun/SPARK-47021.

    Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit a926c7912a78f1a2fb71c5ffd21b5c2f723a0128)
Signed-off-by: Dongjoon Hyun 
---
 common/kvstore/pom.xml | 5 +
 pom.xml| 6 ++
 2 files changed, 11 insertions(+)

diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 1b1a8d0066f8..7dece9de699c 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -66,6 +66,11 @@
   commons-io
   test
 
+
+  org.apache.commons
+  commons-lang3
+  test
+
 
 
   org.apache.logging.log4j
diff --git a/pom.xml b/pom.xml
index 9e945f8d959a..d0cfdaa1496b 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1146,6 +1146,12 @@
 selenium-4-9_${scala.binary.version}
 3.2.16.0
 test
+
+  
+org.seleniumhq.selenium
+htmlunit-driver
+  
+
   
   
 org.mockito


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency

2024-02-11 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a926c7912a78 [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have 
explicit `commons-lang3` test dependency
a926c7912a78 is described below

commit a926c7912a78f1a2fb71c5ffd21b5c2f723a0128
Author: Dongjoon Hyun 
AuthorDate: Sun Feb 11 10:38:00 2024 -0800

[SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit 
`commons-lang3` test dependency

### What changes were proposed in this pull request?

This PR aims to fix `kvstore` module by adding explicit `commons-lang3` 
test dependency and excluding `htmlunit-driver` from `org.scalatestplus` to use 
Apache Spark's explicit declaration.

https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/pom.xml#L711-L716

### Why are the changes needed?

Since Spark 3.3.0 (SPARK-37282), `kvstore` uses `commons-lang3` test 
dependency like the following, but we didn't declare it explicitly so far.


https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java#L33


https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBIteratorSuite.java#L23

Previously, it was provided by some unused `htmlunit-driver`'s transitive 
dependency accidentally. This causes a weird situation which `kvstore` module 
starts to fail to compile when we upgrade `htmlunit-driver`. We need to fix 
this first.

```
$ mvn dependency:tree -pl common/kvstore
...
[INFO] |  \- org.seleniumhq.selenium:htmlunit-driver:jar:4.12.0:test
...
[INFO] |+- org.apache.commons:commons-lang3:jar:3.14.0:test
```

### Does this PR introduce _any_ user-facing change?

No. This is only a test dependency fix.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45080 from dongjoon-hyun/SPARK-47021.

    Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 common/kvstore/pom.xml | 5 +
 pom.xml| 6 ++
 2 files changed, 11 insertions(+)

diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index a9b5a4634717..3820d1b8e395 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -70,6 +70,11 @@
   commons-io
   test
 
+
+  org.apache.commons
+  commons-lang3
+  test
+
 
 
   org.apache.logging.log4j
diff --git a/pom.xml b/pom.xml
index f0eb164d0c45..79d572f1b8bf 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1182,6 +1182,12 @@
 selenium-4-12_${scala.binary.version}
 3.2.17.0
 test
+
+  
+org.seleniumhq.selenium
+htmlunit-driver
+  
+
   
   
 org.mockito


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47020][CORE][TESTS] Fix `RealBrowserUISeleniumSuite`

2024-02-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fa23d276e7e4 [SPARK-47020][CORE][TESTS] Fix 
`RealBrowserUISeleniumSuite`
fa23d276e7e4 is described below

commit fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5
Author: Dongjoon Hyun 
AuthorDate: Fri Feb 9 19:23:17 2024 -0800

[SPARK-47020][CORE][TESTS] Fix `RealBrowserUISeleniumSuite`

### What changes were proposed in this pull request?

This PR aims to fix `RealBrowserUISeleniumSuite` which has been broken 
after SPARK-45274.

- #43053

### Why are the changes needed?

To recover `RealBrowserUISeleniumSuite` according to the latest HTML 
structure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual
```
$ build/sbt -Dguava.version=32.1.2-jre \
-Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver \
-Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \
"core/testOnly org.apache.spark.ui.ChromeUISeleniumSuite"
```

**BEFORE**
```
[info] ChromeUISeleniumSuite:
[info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (12 
seconds, 752 milliseconds)
[info] - SPARK-31882: Link URL for Stage DAGs should not depend on paged 
table. (2 seconds, 363 milliseconds)
[info] - SPARK-31886: Color barrier execution mode RDD correctly *** FAILED 
*** (12 seconds, 143 milliseconds)
[info] - Search text for paged tables should not be saved (3 seconds, 47 
milliseconds)
[info] Run completed in 32 seconds, 54 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
[info] *** 2 TESTS FAILED ***
[error] Failed tests:
[error] org.apache.spark.ui.ChromeUISeleniumSuite
[error] (core / Test / testOnly) sbt.TestsFailedException: Tests 
unsuccessful
[error] Total time: 42 s, completed Feb 9, 2024, 5:32:52 PM
```

**AFTER**
```
[info] ChromeUISeleniumSuite:
[info] - SPARK-31534: text for tooltip should be escaped (3 seconds, 135 
milliseconds)
[info] - SPARK-31882: Link URL for Stage DAGs should not depend on paged 
table. (2 seconds, 395 milliseconds)
[info] - SPARK-31886: Color barrier execution mode RDD correctly (2 
seconds, 144 milliseconds)
[info] - Search text for paged tables should not be saved (2 seconds, 958 
milliseconds)
[info] Run completed in 12 seconds, 377 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 22 s, completed Feb 9, 2024, 5:34:24 PM
```

### Was this patch authored or co-authored using generative AI tooling?

No.

    Closes #45078 from dongjoon-hyun/SPARK-47020.
    
Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/ui/RealBrowserUISeleniumSuite.scala  | 32 ++
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git 
a/core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala 
b/core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala
index b0f1fcab63be..709ee98be1e3 100644
--- a/core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala
+++ b/core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala
@@ -73,8 +73,8 @@ abstract class RealBrowserUISeleniumSuite(val driverProp: 
String)
 
 // Open DAG Viz.
 webDriver.findElement(By.id("job-dag-viz")).click()
-val nodeDesc = webDriver.findElement(By.cssSelector("g[class='node_0 
node']"))
-nodeDesc.getAttribute("name") should include ("collect at 
<console>:25")
+val nodeDesc = webDriver.findElement(By.cssSelector("g[id='node_0']"))
+nodeDesc.getAttribute("innerHTML") should include ("collect at 
<console>:25")
   }
 }
   }
@@ -109,22 +109,20 @@ abstract class RealBrowserUISeleniumSuite(val driverProp: 
String)
 goToUi(sc, "/jobs/job/?id=0")
 webDriver.findElement(By.id("job-dag-viz")).click()
 
-val stage0 = webDriver.findElement(By.cssSelector("g[id='graph_0']"))
-val stage1 = webDriver.findElement(By.cssSelector("g[id='graph_1']"))
+val stage0 = 
webDriver.findElement(By.cssSelector("g[id='graph_stage_0']"))
+  .findElement(By.xpath(".."))
+val stage1 = 
webDriver.findElement

(spark) branch master updated: [MINOR][DOCS] Remove outdated `antlr4` version comment in `pom.xml`

2024-02-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f80a83e98668 [MINOR][DOCS] Remove outdated `antlr4` version comment in 
`pom.xml`
f80a83e98668 is described below

commit f80a83e986682e1ac0dcada4f538f4e050728bbe
Author: Dongjoon Hyun 
AuthorDate: Fri Feb 9 03:31:43 2024 -0800

[MINOR][DOCS] Remove outdated `antlr4` version comment in `pom.xml`

### What changes were proposed in this pull request?

This PR aims to remove an outdated `antlr4` comment in `pom.xml`.

### Why are the changes needed?

This was missed when SPARK-44366 upgraded `antlr4` from 4.9.3 to 4.13.1.

- https://github.com/apache/spark/pull/43075

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45076 from dongjoon-hyun/SPARK_ANTLR.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 pom.xml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/pom.xml b/pom.xml
index 35452ba0d734..f0eb164d0c45 100644
--- a/pom.xml
+++ b/pom.xml
@@ -212,7 +212,6 @@
 3.5.2
 3.0.0
 0.12.0
-
 4.13.1
 1.1
 4.12.1


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (3f5faaa24e3a -> d179f7564541)

2024-02-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3f5faaa24e3a [SPARK-46641][SS] Add maxBytesPerTrigger threshold
 add d179f7564541 [SPARK-46355][SQL][TESTS][FOLLOWUP] Test to check number 
of open files

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/xml/XmlSuite.scala   | 92 +-
 1 file changed, 91 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46641][SS] Add maxBytesPerTrigger threshold

2024-02-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3f5faaa24e3a [SPARK-46641][SS] Add maxBytesPerTrigger threshold
3f5faaa24e3a is described below

commit 3f5faaa24e3ab4d9cc8f996bd1938573dd057e20
Author: maxim_konstantinov 
AuthorDate: Thu Feb 8 23:16:17 2024 -0800

[SPARK-46641][SS] Add maxBytesPerTrigger threshold

### What changes were proposed in this pull request?
This PR adds [Input Streaming 
Source's](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources)
 option `maxBytesPerTrigger` for limiting the total size of input files in a 
streaming batch. Semantics of `maxBytesPerTrigger` is very close to already 
existing one `maxFilesPerTrigger` option.

 How a feature was implemented?
Because `maxBytesPerTrigger` is semantically close to `maxFilesPerTrigger` 
I used all the `maxFilesPerTrigger` usages in the whole repository as a 
potential places that requires changes, that includes:
- Option paramater definition
- Option related logic
- Option related ScalaDoc and MD files
- Option related test

I went over the usage of all usages of `maxFilesPerTrigger` in 
`FileStreamSourceSuite` and implemented `maxBytesPerTrigger` in the same 
fashion as those two are pretty close in their nature. From the structure and 
elements of ReadLimit I've concluded that current design implies only one 
simple rule for ReadLimit, so I openly prohibited the setting of both 
maxFilesPerTrigger and maxBytesPerTrigger at the same time.

### Why are the changes needed?
This feature is useful for our and our sister teams and we expect it will 
find a broad acceptance among Spark users. We have a use-case in a few of the 
Spark pipelines we support when we use Available-now trigger for periodic 
processing using Spark Streaming. We use `maxFilesPerTrigger` threshold for 
now, but this is not ideal as Input file size might change with the time which 
requires periodic configuration adjustment of `maxFilesPerTrigger`. 
Computational complexity of the job depe [...]

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
New unit tests were added or existing `maxFilesPerTrigger` test were 
extended. I searched `maxFilesPerTrigger` related test  and added new tests or 
extended existing ones trying to minimize and simplify the changes.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44636 from MaxNevermind/streaming-add-maxBytesPerTrigger-option.

Lead-authored-by: maxim_konstantinov 
Co-authored-by: Max Konstantinov 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/streaming/DataStreamReader.scala |  24 +++-
 docs/structured-streaming-programming-guide.md |   8 +-
 .../sql/connector/read/streaming/ReadLimit.java|   2 +
 .../{ReadLimit.java => ReadMaxBytes.java}  |  39 +++---
 .../execution/streaming/FileStreamOptions.scala|  18 ++-
 .../sql/execution/streaming/FileStreamSource.scala |  87 +++---
 .../spark/sql/streaming/DataStreamReader.scala |  12 ++
 .../sql/streaming/FileStreamSourceSuite.scala  | 133 +++--
 8 files changed, 247 insertions(+), 76 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala
index bc8e30cd300c..789425c9daea 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala
@@ -159,7 +159,9 @@ final class DataStreamReader private[sql] (sparkSession: 
SparkSession) extends L
* schema in advance, use the version that specifies the schema to avoid the 
extra scan.
*
* You can set the following option(s):  `maxFilesPerTrigger` 
(default: no max limit):
-   * sets the maximum number of new files to be considered in every 
trigger. 
+   * sets the maximum number of new files to be considered in every 
trigger.
+   * `maxBytesPerTrigger` (default: no max limit): sets the maximum total 
size of new files to
+   * be considered in every trigger. 
*
* You can find the JSON-specific options for reading JSON file stream in https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option";>
@@ -179,7 +181,9 @@ final class DataStreamReader private[sql] (sparkSession: 
SparkSession) extends L
* specify the schema explicitly using `schema`.
*
* You can set the following option(s):  `maxFilesPerTrigger` 
(default: no max limit):
-   * sets the

(spark) branch master updated: [SPARK-46831][INFRA][FOLLOWUP] Fix a wrong JIRA ID in MimaExcludes

2024-02-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8603ed58a34d [SPARK-46831][INFRA][FOLLOWUP] Fix a wrong JIRA ID in 
MimaExcludes
8603ed58a34d is described below

commit 8603ed58a34d42dd14a82d8950ef5943114c3a8d
Author: Dongjoon Hyun 
AuthorDate: Thu Feb 8 12:00:23 2024 -0800

[SPARK-46831][INFRA][FOLLOWUP] Fix a wrong JIRA ID in MimaExcludes

### What changes were proposed in this pull request?

This is a follow-up of
- #44901

### Why are the changes needed?

To fix the wrong JIRA ID information.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45071 from dongjoon-hyun/SPARK-46831.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 project/MimaExcludes.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 5674eba0bea0..3e1391317eab 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -76,7 +76,7 @@ object MimaExcludes {
 
 // SPARK-46410: Assign error classes/subclasses to 
JdbcUtils.classifyException
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.JdbcDialect.classifyException"),
-// [SPARK-464878][CORE][SQL] (false alert). Invalid rule for StringType 
extension.
+// TODO(SPARK-46878): Invalid Mima report for StringType extension
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.StringType.this"),
 // SPARK-47011: Remove deprecated 
BinaryClassificationMetrics.scoreLabelsWeight
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.scoreLabelsWeight")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47011][MLLIB] Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight`

2024-02-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a5e741e60ac9 [SPARK-47011][MLLIB] Remove deprecated 
`BinaryClassificationMetrics.scoreLabelsWeight`
a5e741e60ac9 is described below

commit a5e741e60ac97a395ce80d9fb39709e143ada721
Author: Dongjoon Hyun 
AuthorDate: Thu Feb 8 10:59:17 2024 -0800

[SPARK-47011][MLLIB] Remove deprecated 
`BinaryClassificationMetrics.scoreLabelsWeight`

### What changes were proposed in this pull request?

This PR aims to a planned removal of the deprecated 
`BinaryClassificationMetrics.scoreLabelsWeight`.

### Why are the changes needed?

Apache Spark 3.4.0 deprecated this via SPARK-39533 and announced the 
removal of this.
- #36926


https://github.com/apache/spark/blob/b7edc5fac0f4e479cbc869d54a9490c553ba2613/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L49-L50

### Does this PR introduce _any_ user-facing change?

Yes, but this is a planned change.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45070 from dongjoon-hyun/SPARK-47011.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala   | 4 +---
 project/MimaExcludes.scala| 4 +++-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
index a74800f7b189..869fe7155a26 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
@@ -46,9 +46,7 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 @Since("1.3.0") val numBins: Int = 1000)
   extends Logging {
 
-  @deprecated("The variable `scoreLabelsWeight` should be private and " +
-"will be removed in 4.0.0.", "3.4.0")
-  val scoreLabelsWeight: RDD[(Double, (Double, Double))] = scoreAndLabels.map {
+  private val scoreLabelsWeight: RDD[(Double, (Double, Double))] = 
scoreAndLabels.map {
 case (prediction: Double, label: Double, weight: Double) =>
   require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
   (prediction, (label, weight))
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 64c5599919a6..5674eba0bea0 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -77,7 +77,9 @@ object MimaExcludes {
 // SPARK-46410: Assign error classes/subclasses to 
JdbcUtils.classifyException
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.JdbcDialect.classifyException"),
 // [SPARK-464878][CORE][SQL] (false alert). Invalid rule for StringType 
extension.
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.StringType.this")
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.StringType.this"),
+// SPARK-47011: Remove deprecated 
BinaryClassificationMetrics.scoreLabelsWeight
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.scoreLabelsWeight")
   )
 
   // Default exclude rules


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46997][CORE] Enable `spark.worker.cleanup.enabled` by default

2024-02-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 338bb31c2fac [SPARK-46997][CORE] Enable `spark.worker.cleanup.enabled` 
by default
338bb31c2fac is described below

commit 338bb31c2fac79fbc3482c23310b77d5306bd6c8
Author: Dongjoon Hyun 
AuthorDate: Wed Feb 7 22:51:14 2024 -0800

[SPARK-46997][CORE] Enable `spark.worker.cleanup.enabled` by default

### What changes were proposed in this pull request?

This PR aims to enable `spark.worker.cleanup.enabled` by default as a part 
of Apache Spark 4.0.0.

### Why are the changes needed?

Apache Spark community has been recommending (from Apache Spark 3.0 to 3.5) 
to enable `spark.worker.cleanup.enabled` when 
`spark.shuffle.service.db.enabled` is true. And, 
`spark.shuffle.service.db.enabled` has been `true` since SPARK-26288.


https://github.com/apache/spark/blob/dc73a8d7e96ead55053096971c838908b7c90527/docs/spark-standalone.md?plain=1#L443


https://github.com/apache/spark/blob/dc73a8d7e96ead55053096971c838908b7c90527/docs/spark-standalone.md?plain=1#L473


https://github.com/apache/spark/blob/dc73a8d7e96ead55053096971c838908b7c90527/core/src/main/scala/org/apache/spark/internal/config/package.scala#L718-L724

Although `spark.shuffle.service.enabled` is disabled by default, 
`spark.worker.cleanup.enabled` is crucial for long-standing Spark Standalone 
clusters to avoid the disk full situation.


https://github.com/apache/spark/blob/dc73a8d7e96ead55053096971c838908b7c90527/core/src/main/scala/org/apache/spark/internal/config/package.scala#L692-L696

### Does this PR introduce _any_ user-facing change?

Yes, but this has been a long-standing recommended configuration in the 
real production-level Spark Standalone clusters.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45055 from dongjoon-hyun/SPARK-46997.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/internal/config/Worker.scala | 2 +-
 docs/core-migration-guide.md  | 2 ++
 docs/spark-standalone.md  | 2 +-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/internal/config/Worker.scala 
b/core/src/main/scala/org/apache/spark/internal/config/Worker.scala
index c53e181df002..5a67f3398a7d 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/Worker.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/Worker.scala
@@ -62,7 +62,7 @@ private[spark] object Worker {
   val WORKER_CLEANUP_ENABLED = ConfigBuilder("spark.worker.cleanup.enabled")
 .version("1.0.0")
 .booleanConf
-.createWithDefault(false)
+.createWithDefault(true)
 
   val WORKER_CLEANUP_INTERVAL = ConfigBuilder("spark.worker.cleanup.interval")
 .version("1.0.0")
diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md
index 7a5b17397bec..26e6b0f1f444 100644
--- a/docs/core-migration-guide.md
+++ b/docs/core-migration-guide.md
@@ -28,6 +28,8 @@ license: |
 
 - Since Spark 4.0, Spark will compress event logs. To restore the behavior 
before Spark 4.0, you can set `spark.eventLog.compress` to `false`.
 
+- Since Spark 4.0, Spark workers will clean up worker and stopped application 
directories periodically. To restore the behavior before Spark 4.0, you can set 
`spark.worker.cleanup.enabled` to `false`.
+
 - Since Spark 4.0, `spark.shuffle.service.db.backend` is set to `ROCKSDB` by 
default which means Spark will use RocksDB store for shuffle service. To 
restore the behavior before Spark 4.0, you can set 
`spark.shuffle.service.db.backend` to `LEVELDB`.
 
 - In Spark 4.0, support for Apache Mesos as a resource manager was removed.
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index fbc83180d6b6..1eab3158e2e5 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -436,7 +436,7 @@ SPARK_WORKER_OPTS supports the following system properties:
 
 
   spark.worker.cleanup.enabled
-  false
+  true
   
 Enable periodic cleanup of worker / application directories.  Note that 
this only affects standalone
 mode, as YARN works differently. Only the directories of stopped 
applications are cleaned up.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46998][SQL] Deprecate the SQL config `spark.sql.legacy.allowZeroIndexInFormatString`

2024-02-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a8ad71dbe417 [SPARK-46998][SQL] Deprecate the SQL config 
`spark.sql.legacy.allowZeroIndexInFormatString`
a8ad71dbe417 is described below

commit a8ad71dbe417c16af13d46783e13fba0c2280268
Author: Max Gekk 
AuthorDate: Wed Feb 7 18:20:20 2024 -0800

[SPARK-46998][SQL] Deprecate the SQL config 
`spark.sql.legacy.allowZeroIndexInFormatString`

### What changes were proposed in this pull request?
In the PR, I propose to deprecate the SQL config 
`spark.sql.legacy.allowZeroIndexInFormatString` and put it to the list 
`deprecatedSQLConfigs` in `SQLConf`.

### Why are the changes needed?
After migration on JDK 17+, the SQL config 
`spark.sql.legacy.allowZeroIndexInFormatString` doesn't work anymore, and 
doesn't allow to use the zero index. Even users set the config to `true`, they 
get the error:
```java
Illegal format argument index = 0
java.util.IllegalFormatArgumentIndexException: Illegal format argument 
index = 0
at 
java.base/java.util.Formatter$FormatSpecifier.index(Formatter.java:2808)
at 
java.base/java.util.Formatter$FormatSpecifier.(Formatter.java:2879)
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *QueryCompilationErrorsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45057 from MaxGekk/deprecate-ALLOW_ZERO_INDEX_IN_FORMAT_STRING.

Authored-by: Max Gekk 
Signed-off-by: Dongjoon Hyun 
---
 docs/sql-migration-guide.md| 1 +
 .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 7 +--
 .../org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala  | 7 +++
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index cb5e697f871c..3d0c7280496a 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -38,6 +38,7 @@ license: |
   - `spark.sql.avro.datetimeRebaseModeInRead` instead of 
`spark.sql.legacy.avro.datetimeRebaseModeInRead`
 - Since Spark 4.0, the default value of `spark.sql.orc.compression.codec` is 
changed from `snappy` to `zstd`. To restore the previous behavior, set 
`spark.sql.orc.compression.codec` to `snappy`.
 - Since Spark 4.0, `SELECT (*)` is equivalent to `SELECT struct(*)` instead of 
`SELECT *`. To restore the previous behavior, set 
`spark.sql.legacy.ignoreParenthesesAroundStar` to `true`.
+- Since Spark 4.0, the SQL config 
`spark.sql.legacy.allowZeroIndexInFormatString` is deprecated. Consider to 
change `strfmt` of the `format_string` function to use 1-based indexes. The 
first argument must be referenced by "1$", the second by "2$", etc.
 
 ## Upgrading from Spark SQL 3.4 to 3.5
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 8f86a1c8a1f3..59db3e71a135 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -2339,7 +2339,8 @@ object SQLConf {
   .doc("When false, the `strfmt` in `format_string(strfmt, obj, ...)` and 
" +
 "`printf(strfmt, obj, ...)` will no longer support to use \"0$\" to 
specify the first " +
 "argument, the first argument should always reference by \"1$\" when 
use argument index " +
-"to indicating the position of the argument in the argument list.")
+"to indicating the position of the argument in the argument list. " +
+"This config will be removed in the future releases.")
   .version("3.3")
   .booleanConf
   .createWithDefault(false)
@@ -4718,7 +4719,9 @@ object SQLConf {
   DeprecatedConfig(ESCAPED_STRING_LITERALS.key, "4.0",
 "Use raw string literals with the `r` prefix instead. "),
   DeprecatedConfig("spark.connect.copyFromLocalToFs.allowDestLocal", "4.0",
-s"Use '${ARTIFACT_COPY_FROM_LOCAL_TO_FS_ALLOW_DEST_LOCAL.key}' 
instead.")
+s"Use '${ARTIFACT_COPY_FROM_LOCAL_TO_FS_ALLOW_DEST_LOCAL.key}' 
instead."),
+  DeprecatedConfig(ALLOW_ZERO_INDEX_IN_FORMAT_STRING.key, "4.0", "Increase 
indexes by 1 " +
+"in `strfmt` of the `format_string` function. Refer to the first 
argument by \"1$\".")

(spark) branch master updated: [SPARK-47003][K8S] Detect and fail on invalid volume sizes (< 1KiB) in K8s

2024-02-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cdd41a9f2c4f [SPARK-47003][K8S] Detect and fail on invalid volume 
sizes (< 1KiB) in K8s
cdd41a9f2c4f is described below

commit cdd41a9f2c4f278c5da7e1826c5e0ca0db7ec548
Author: Dongjoon Hyun 
AuthorDate: Wed Feb 7 14:58:20 2024 -0800

[SPARK-47003][K8S] Detect and fail on invalid volume sizes (< 1KiB) in K8s

### What changes were proposed in this pull request?

This PR aims to detect and fails on invalid volume size.

### Why are the changes needed?

This happens when the user forget the unit of volume size. For example, 
`100` instead of `100Gi`.

### Does this PR introduce _any_ user-facing change?

For K8s volumes, the system is trying to use the system default minimum 
volume size. However it totally depends on the underlying system. And, this 
misconfiguration misleads the users in many cases because the job is started 
and running in unhealthy status.
- First, the executor pods will be killed by the K8s control plane due to 
the out of disk situation.
- Second, Spark is trying to create new executors (still with small disks) 
and to retry multiple times.

We had better detect the missed-unit situation and make those jobs fail as 
early as possible.

### How was this patch tested?

Pass the CIs with newly added test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45061 from dongjoon-hyun/SPARK-47003.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/deploy/k8s/KubernetesVolumeUtils.scala   | 13 +++
 .../deploy/k8s/KubernetesVolumeUtilsSuite.scala| 26 ++
 2 files changed, 39 insertions(+)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala
index 18fda708d9bb..baa519658c2e 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala
@@ -16,6 +16,8 @@
  */
 package org.apache.spark.deploy.k8s
 
+import java.lang.Long.parseLong
+
 import org.apache.spark.SparkConf
 import org.apache.spark.deploy.k8s.Config._
 
@@ -76,6 +78,7 @@ private[spark] object KubernetesVolumeUtils {
   
s"$volumeType.$volumeName.$KUBERNETES_VOLUMES_OPTIONS_CLAIM_STORAGE_CLASS_KEY"
 val sizeLimitKey = 
s"$volumeType.$volumeName.$KUBERNETES_VOLUMES_OPTIONS_SIZE_LIMIT_KEY"
 verifyOptionKey(options, claimNameKey, KUBERNETES_VOLUMES_PVC_TYPE)
+verifySize(options.get(sizeLimitKey))
 KubernetesPVCVolumeConf(
   options(claimNameKey),
   options.get(storageClassKey),
@@ -84,6 +87,7 @@ private[spark] object KubernetesVolumeUtils {
   case KUBERNETES_VOLUMES_EMPTYDIR_TYPE =>
 val mediumKey = 
s"$volumeType.$volumeName.$KUBERNETES_VOLUMES_OPTIONS_MEDIUM_KEY"
 val sizeLimitKey = 
s"$volumeType.$volumeName.$KUBERNETES_VOLUMES_OPTIONS_SIZE_LIMIT_KEY"
+verifySize(options.get(sizeLimitKey))
 KubernetesEmptyDirVolumeConf(options.get(mediumKey), 
options.get(sizeLimitKey))
 
   case KUBERNETES_VOLUMES_NFS_TYPE =>
@@ -105,4 +109,13 @@ private[spark] object KubernetesVolumeUtils {
   throw new NoSuchElementException(key + s" is required for $msg")
 }
   }
+
+  private def verifySize(size: Option[String]): Unit = {
+size.foreach { v =>
+  if (v.forall(_.isDigit) && parseLong(v) < 1024) {
+throw new IllegalArgumentException(
+  s"Volume size `$v` is smaller than 1KiB. Missing units?")
+  }
+}
+  }
 }
diff --git 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtilsSuite.scala
 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtilsSuite.scala
index 156740d7c8ae..fdc1aae0d410 100644
--- 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtilsSuite.scala
+++ 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtilsSuite.scala
@@ -182,4 +182,30 @@ class KubernetesVolumeUtilsSuite extends SparkFunSuite {
 }
 assert(e.getMessage.contains("nfs.volumeName.options.server"))
   }
+
+  test("SPARK-47003: Check emptyDir volume size") {
+val sparkConf = new SparkConf(false)
+sparkConf.set("test.emptyDir.volumeName

(spark) branch master updated: [SPARK-47000][CORE] Use `getTotalMemorySize` in `WorkerArguments`

2024-02-07 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d8e402db2aa7 [SPARK-47000][CORE] Use `getTotalMemorySize` in 
`WorkerArguments`
d8e402db2aa7 is described below

commit d8e402db2aa71835d087f84173463c7346c7176b
Author: Dongjoon Hyun 
AuthorDate: Wed Feb 7 14:32:41 2024 -0800

[SPARK-47000][CORE] Use `getTotalMemorySize` in `WorkerArguments`

### What changes were proposed in this pull request?

This PR aims to use `getTotalMemorySize` instead of deprecated 
`getTotalPhysicalMemorySize` (OpenJDK) or `getTotalPhysicalMemory` (IBM Java) 
in `WorkerArguments`.

### Why are the changes needed?

`getTotalPhysicalMemorySize` is deprecated at Java 14 in OpenJDK.
- 
https://docs.oracle.com/en/java/javase/17/docs/api/jdk.management/com/sun/management/OperatingSystemMXBean.html#getTotalPhysicalMemorySize()

`getTotalPhysicalMemory` is deprecated since 1.8 in IBM.
- 
https://eclipse.dev/openj9/docs/api/jdk17/jdk.management/com/ibm/lang/management/OperatingSystemMXBean.html#getTotalPhysicalMemory()

`getTotalMemorySize` is recommended in both environments for Apache Spark 
4.0.0 because the minimum Java version is 17.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the existing test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45060 from dongjoon-hyun/SPARK-47000.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/deploy/worker/WorkerArguments.scala| 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala
index 42f684c0a197..94a27e1a3e6d 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala
@@ -148,20 +148,13 @@ private[worker] class WorkerArguments(args: 
Array[String], conf: SparkConf) {
   }
 
   def inferDefaultMemory(): Int = {
-val ibmVendor = System.getProperty("java.vendor").contains("IBM")
 var totalMb = 0
 try {
   // scalastyle:off classforname
   val bean = ManagementFactory.getOperatingSystemMXBean()
-  if (ibmVendor) {
-val beanClass = 
Class.forName("com.ibm.lang.management.OperatingSystemMXBean")
-val method = beanClass.getDeclaredMethod("getTotalPhysicalMemory")
-totalMb = (method.invoke(bean).asInstanceOf[Long] / 1024 / 1024).toInt
-  } else {
-val beanClass = 
Class.forName("com.sun.management.OperatingSystemMXBean")
-val method = beanClass.getDeclaredMethod("getTotalPhysicalMemorySize")
-totalMb = (method.invoke(bean).asInstanceOf[Long] / 1024 / 1024).toInt
-  }
+  val beanClass = Class.forName("com.sun.management.OperatingSystemMXBean")
+  val method = beanClass.getDeclaredMethod("getTotalMemorySize")
+  totalMb = (method.invoke(bean).asInstanceOf[Long] / 1024 / 1024).toInt
   // scalastyle:on classforname
 } catch {
   case e: Exception =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46977][CORE] A failed request to obtain a token from one NameNode should not skip subsequent token requests

2024-02-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 006c2dca6d87 [SPARK-46977][CORE] A failed request to obtain a token 
from one NameNode should not skip subsequent token requests
006c2dca6d87 is described below

commit 006c2dca6d87e29a69e30124e8320c275859d148
Author: Cheng Pan 
AuthorDate: Mon Feb 5 12:18:20 2024 -0800

[SPARK-46977][CORE] A failed request to obtain a token from one NameNode 
should not skip subsequent token requests

### What changes were proposed in this pull request?

This PR enhances the `HadoopFSDelegationTokenProvider` to tolerate failures 
when fetching tokens from multiple NameNodes.

### Why are the changes needed?

Let's say we are going to access 3 HDFS, `nn-1`, `nn-2`, `nn-3` in YARN 
cluster mode with TGT cache, while the `nn-1` is the `defaultFs` which is used 
by YARN to store aggregated logs, and there are issues in `nn-2` which can not 
issue the token.

```
spark-submit \
  --master yarn \
  --deployMode cluster \
  --conf 
spark.kerberos.access.hadoopFileSystems=hdfs://nn-1,hdfs://nn-2,hdfs://nn-3 \
  ...
```

During the submitting phase, Spark is going to call 
`HadoopFSDelegationTokenProvider` to fetch tokens from all declared NameNodes 
one by one, in **indeterminate** order 
(`HadoopFSDelegationTokenProvider.hadoopFSsToAccess` process and return a 
`Set[FileSystem]`), so the order may not respect the user declared order in 
`spark.kerberos.access.hadoopFileSystems`.

If the order is [`nn-1`, `nn-2`, `nn-3`], then we are going to request a 
token from `nn-1` successfully, but fail for `nn-2` with the below error, the 
left `nn-3` is going to be skipped. But such failure WON'T block the whole 
submitting process, the Spark app is going to be submitted with only `nn-1` 
token.
```
2024-01-03 12:41:36 [WARN] [main] 
org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider#94 - Failed to 
get token from service hadoopfs
org.apache.hadoop.ipc.RemoteException: 
  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1507) 
~[hadoop-common-2.9.2.2.jar:?]
  ...
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2604)
 ~[hadoop-hdfs-client-2.9.2.2.jar:?]
  at 
org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.$anonfun$fetchDelegationTokens$1(HadoopFSDelegationTokenProvider.scala:122)
 ~[spark-core_2.12-3.3.1.27.jar:3.3.1.27]
  at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:335) 
~[scala-library-2.12.15.jar:?]
  at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:) 
~[scala-library-2.12.15.jar:?]
  at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:) 
~[scala-library-2.12.15.jar:?]
  at 
org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.fetchDelegationTokens(HadoopFSDelegationTokenProvider.scala:115)
 ~[spark-core_2.12-3.3.1.27.jar:3.3.1.27]
  ...
  at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:146)
 ~[spark-core_2.12-3.3.1.27.jar:3.3.1.27]
  at 
org.apache.spark.deploy.yarn.Client.setupSecurityToken(Client.scala:352) 
~[spark-yarn_2.12-3.3.1.27.jar:3.3.1.27]
  at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:1140)
 ~[spark-yarn_2.12-3.3.1.27.jar:3.3.1.27]
  ...
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
~[spark-core_2.12-3.3.1.27.jar:3.3.1.27]
```
when the Spark app access `nn-2` and `nn-3`, it will fail with 
`o.a.h.security.AccessControlException: Client cannot authenticate via:[TOKEN, 
KERBEROS]`

Things become worse if the FS order is [`nn-3`, `nn-2`, `nn-1`], the Spark 
app will be submitted to YARN with only `nn-3` token, it even has no chance to 
allow NodeManager to upload aggregated logs after the application exit because 
it requires the app to provide a token to access `nn-1`.

the log from NodeManager
```
2024-01-03 08:08:14,028 [3173570620] - WARN [NM ContainerManager 
dispatcher:Client$Connection1$772] - Exception encountered while connecting to 
the server
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
at 
org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:179)
at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:392)
...
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1768)
...

(spark) branch master updated: [SPARK-46972][SQL] Fix asymmetrical replacement for char/varchar in V2SessionCatalog.createTable

2024-02-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f8ff18223b36 [SPARK-46972][SQL] Fix asymmetrical replacement for 
char/varchar in V2SessionCatalog.createTable
f8ff18223b36 is described below

commit f8ff18223b365afa59ee077dc7535f1190073069
Author: Kent Yao 
AuthorDate: Mon Feb 5 12:01:08 2024 -0800

[SPARK-46972][SQL] Fix asymmetrical replacement for char/varchar in 
V2SessionCatalog.createTable

### What changes were proposed in this pull request?

This PR removes the asymmetrical replacement for char/varchar in 
V2SessionCatalog.createTable

### Why are the changes needed?

Replacement for char/varchar shall happen in both sizes of the equation 
`DataType.equalsIgnoreNullability(tableSchema, schema))`
### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45019 from yaooqinn/SPARK-46972.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/datasources/v2/V2SessionCatalog.scala| 9 +++--
 .../org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala| 9 +
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
index e7445e970fa5..0cb3f8dca38f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
@@ -27,7 +27,6 @@ import org.apache.spark.SparkUnsupportedOperationException
 import org.apache.spark.sql.catalyst.{FunctionIdentifier, SQLConfHelper, 
TableIdentifier}
 import org.apache.spark.sql.catalyst.analysis.{NoSuchDatabaseException, 
NoSuchTableException, TableAlreadyExistsException}
 import org.apache.spark.sql.catalyst.catalog.{CatalogDatabase, 
CatalogStorageFormat, CatalogTable, CatalogTableType, CatalogUtils, 
ClusterBySpec, SessionCatalog}
-import org.apache.spark.sql.catalyst.util.CharVarcharUtils
 import org.apache.spark.sql.catalyst.util.TypeUtils._
 import org.apache.spark.sql.connector.catalog.{CatalogManager, CatalogV2Util, 
Column, FunctionCatalog, Identifier, NamespaceChange, SupportsNamespaces, 
Table, TableCatalog, TableCatalogCapability, TableChange, V1Table}
 import org.apache.spark.sql.connector.catalog.NamespaceChange.RemoveProperty
@@ -206,11 +205,9 @@ class V2SessionCatalog(catalog: SessionCatalog)
   }
   val table = tableProvider.getTable(schema, partitions, dsOptions)
   // Check if the schema of the created table matches the given schema.
-  val tableSchema = 
CharVarcharUtils.replaceCharVarcharWithStringInSchema(
-table.columns().asSchema)
-  if (!DataType.equalsIgnoreNullability(tableSchema, schema)) {
-throw QueryCompilationErrors.dataSourceTableSchemaMismatchError(
-  tableSchema, schema)
+  val tableSchema = table.columns().asSchema
+  if (!DataType.equalsIgnoreNullability(table.columns().asSchema, 
schema)) {
+throw 
QueryCompilationErrors.dataSourceTableSchemaMismatchError(tableSchema, schema)
   }
   (schema, partitioning)
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala
index f92a9a827b1c..2701738351b1 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala
@@ -1735,6 +1735,15 @@ class DataSourceV2SQLSuiteV1Filter
 }
   }
 
+  test("SPARK-46972: asymmetrical replacement for char/varchar in 
V2SessionCatalog.createTable") {
+// unset this config to use the default v2 session catalog.
+spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key)
+withTable("t") {
+  sql(s"CREATE TABLE t(c char(1), v varchar(2)) USING $v2Source")
+  assert(!spark.table("t").isEmpty)
+}
+  }
+
   test("ShowCurrentNamespace: basic tests") {
 def testShowCurrentNamespace(expectedCatalogName: String, 
expectedNamespace: String): Unit = {
   val schema = new StructType()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46978][PYTHON][DOCS] Refine docstring of `sum_distinct/array_agg/count_if`

2024-02-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2e917fb91924 [SPARK-46978][PYTHON][DOCS] Refine docstring of 
`sum_distinct/array_agg/count_if`
2e917fb91924 is described below

commit 2e917fb919244b421c5a2770403c0fd91336f65d
Author: yangjie01 
AuthorDate: Mon Feb 5 11:58:25 2024 -0800

[SPARK-46978][PYTHON][DOCS] Refine docstring of 
`sum_distinct/array_agg/count_if`

### What changes were proposed in this pull request?
This pr refine docstring of  `sum_distinct/array_agg/count_if` and add some 
new examples.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45031 from LuciferYang/agg-functions.

Lead-authored-by: yangjie01 
Co-authored-by: YangJie 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/functions/builtin.py | 134 +---
 1 file changed, 123 insertions(+), 11 deletions(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 0932ac1c2843..cb872fdb8180 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -1472,13 +1472,51 @@ def sum_distinct(col: "ColumnOrName") -> Column:
 
 Examples
 
->>> df = spark.createDataFrame([(None,), (1,), (1,), (2,)], 
schema=["numbers"])
->>> df.select(sum_distinct(col("numbers"))).show()
+Example 1: Using sum_distinct function on a column with all distinct values
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([(1,), (2,), (3,), (4,)], ["numbers"])
+>>> df.select(sf.sum_distinct('numbers')).show()
++-+
+|sum(DISTINCT numbers)|
++-+
+|   10|
++-+
+
+Example 2: Using sum_distinct function on a column with no distinct values
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([(1,), (1,), (1,), (1,)], ["numbers"])
+>>> df.select(sf.sum_distinct('numbers')).show()
++-+
+|sum(DISTINCT numbers)|
++-+
+|1|
++-+
+
+Example 3: Using sum_distinct function on a column with null and duplicate 
values
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([(None,), (1,), (1,), (2,)], ["numbers"])
+>>> df.select(sf.sum_distinct('numbers')).show()
 +-+
 |sum(DISTINCT numbers)|
 +-+
 |3|
 +-+
+
+Example 4: Using sum_distinct function on a column with all None values
+
+>>> from pyspark.sql import functions as sf
+>>> from pyspark.sql.types import StructType, StructField, IntegerType
+>>> schema = StructType([StructField("numbers", IntegerType(), True)])
+>>> df = spark.createDataFrame([(None,), (None,), (None,), (None,)], 
schema=schema)
+>>> df.select(sf.sum_distinct('numbers')).show()
++-+
+|sum(DISTINCT numbers)|
++-+
+| NULL|
++-+
 """
 return _invoke_function_over_columns("sum_distinct", col)
 
@@ -4122,9 +4160,49 @@ def array_agg(col: "ColumnOrName") -> Column:
 
 Examples
 
+Example 1: Using array_agg function on an int column
+
+>>> from pyspark.sql import functions as sf
 >>> df = spark.createDataFrame([[1],[1],[2]], ["c"])
->>> df.agg(array_agg('c').alias('r')).collect()
-[Row(r=[1, 1, 2])]
+>>> df.agg(sf.sort_array(sf.array_agg('c'))).show()
++-+
+|sort_array(collect_list(c), true)|
++-+
+|[1, 1, 2]|
++-+
+
+Example 2: Using array_agg function on a string column
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([["apple"],["apple"],["banana"]], ["c"])
+>>> df.agg(sf.sort_array(sf.array_agg('c'))).show(truncate=Fa

(spark) branch master updated: [MINOR][TEST] Add output/exception to error message when schema not matched in `TPCDSQueryTestSuite`

2024-02-05 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e9697130a1ac [MINOR][TEST] Add output/exception to error message when 
schema not matched in `TPCDSQueryTestSuite`
e9697130a1ac is described below

commit e9697130a1acd2293d52ce72b9bddf6a203e3e8c
Author: Liang-Chi Hsieh 
AuthorDate: Mon Feb 5 09:43:44 2024 -0800

[MINOR][TEST] Add output/exception to error message when schema not matched 
in `TPCDSQueryTestSuite`

### What changes were proposed in this pull request?

This patch adds output/exception string to the error message when output 
schema not match expected schema in `TPCDSQueryTestSuite`.

### Why are the changes needed?

We have used `TPCDSQueryTestSuite` for testing TPCDS query results. The 
test suite checks output schema and then output result. If any exception 
happens during query execution, it will handle the exception and return an 
empty schema and exception class + message as output. So, when any exception 
happens, the test suite just fails on schema check and never uses/shows the 
exception, e.g.,

```
java.lang.Exception: Expected "struct<[count(1):bigint]>", but got 
"struct<[]>" Schema did not match
```

We cannot see what exception was happened there from the log. It is somehow 
inconvenient for debugging.

### Does this PR introduce _any_ user-facing change?

No, test only.

### How was this patch tested?

Existing tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45025 from viirya/minor_ouput_exception.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: Dongjoon Hyun 
---
 .../test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala| 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala
index ef7bdc2b079e..bde615552987 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala
@@ -139,7 +139,14 @@ class TPCDSQueryTestSuite extends QueryTest with TPCDSBase 
with SQLQueryTestHelp
   (segments(1).trim, segments(2).replaceAll("\\s+$", ""))
 }
 
-assertResult(expectedSchema, s"Schema did not match\n$queryString") {
+val notMatchedSchemaOutput = if (schema == emptySchema) {
+  // There might be exception. See `handleExceptions`.
+  s"Schema did not match\n$queryString\nOutput/Exception: 
$outputString"
+} else {
+  s"Schema did not match\n$queryString"
+}
+
+assertResult(expectedSchema, notMatchedSchemaOutput) {
   schema
 }
 if (shouldSortResults) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46970][CORE] Rewrite `OpenHashSet#hasher` with `pattern matching`

2024-02-03 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7ca355cbc225 [SPARK-46970][CORE] Rewrite `OpenHashSet#hasher` with 
`pattern matching`
7ca355cbc225 is described below

commit 7ca355cbc225653b090020271117a763ec59536d
Author: yangjie01 
AuthorDate: Sat Feb 3 21:07:16 2024 -0800

[SPARK-46970][CORE] Rewrite `OpenHashSet#hasher` with `pattern matching`

### What changes were proposed in this pull request?
The proposed changes in this pr involve refactoring the method of creating 
a `Hasher[T]` instance in the code. The original code used a series of if-else 
statements to check the class type of `T` and create the corresponding 
`Hasher[T]` instance. The proposed change simplifies this process by using 
Scala's pattern matching feature. The new code is more concise and easier to 
read.

### Why are the changes needed?
The changes are needed for several reasons. Firstly, the use of pattern 
matching makes the code more idiomatic to Scala, which is beneficial for 
readability and maintainability. Secondly, the original code contains a comment 
about a bug in the Scala 2.9.x compiler that prevented the use of pattern 
matching in this context. However, Apache Spark 4.0 has switched to using Scala 
2.13, and the new code has passed all tests, it appears that the bug no longer 
exists in the new version of Sc [...]

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44998 from LuciferYang/openhashset-hasher.

Lead-authored-by: yangjie01 
Co-authored-by: YangJie 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/util/collection/OpenHashSet.scala | 28 +-
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala 
b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
index 6815e47a198d..faee9ce56a0a 100644
--- a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
+++ b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
@@ -62,28 +62,12 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: 
ClassTag](
   // specialization to work (specialized class extends the non-specialized one 
and needs access
   // to the "private" variables).
 
-  protected val hasher: Hasher[T] = {
-// It would've been more natural to write the following using pattern 
matching. But Scala 2.9.x
-// compiler has a bug when specialization is used together with this 
pattern matching, and
-// throws:
-// scala.tools.nsc.symtab.Types$TypeError: type mismatch;
-//  found   : scala.reflect.AnyValManifest[Long]
-//  required: scala.reflect.ClassTag[Int]
-// at 
scala.tools.nsc.typechecker.Contexts$Context.error(Contexts.scala:298)
-// at 
scala.tools.nsc.typechecker.Infer$Inferencer.error(Infer.scala:207)
-// ...
-val mt = classTag[T]
-if (mt == ClassTag.Long) {
-  (new LongHasher).asInstanceOf[Hasher[T]]
-} else if (mt == ClassTag.Int) {
-  (new IntHasher).asInstanceOf[Hasher[T]]
-} else if (mt == ClassTag.Double) {
-  (new DoubleHasher).asInstanceOf[Hasher[T]]
-} else if (mt == ClassTag.Float) {
-  (new FloatHasher).asInstanceOf[Hasher[T]]
-} else {
-  new Hasher[T]
-}
+  protected val hasher: Hasher[T] = classTag[T] match {
+case ClassTag.Long => new LongHasher().asInstanceOf[Hasher[T]]
+case ClassTag.Int => new IntHasher().asInstanceOf[Hasher[T]]
+case ClassTag.Double => new DoubleHasher().asInstanceOf[Hasher[T]]
+case ClassTag.Float => new FloatHasher().asInstanceOf[Hasher[T]]
+case _ => new Hasher[T]
   }
 
   protected var _capacity = nextPowerOf2(initialCapacity)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46967][CORE][UI] Hide `Thread Dump` and `Heap Histogram` of `Dead` executors in `Executors` UI

2024-02-03 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 062522e96a50 [SPARK-46967][CORE][UI] Hide `Thread Dump` and `Heap 
Histogram` of `Dead` executors in `Executors` UI
062522e96a50 is described below

commit 062522e96a50b8b46b313aae62668717ba88639f
Author: Dongjoon Hyun 
AuthorDate: Sat Feb 3 19:17:33 2024 -0800

[SPARK-46967][CORE][UI] Hide `Thread Dump` and `Heap Histogram` of `Dead` 
executors in `Executors` UI

### What changes were proposed in this pull request?

This PR aims to hide `Thread Dump` and `Heap Histogram` links of `Dead` 
executors in Spark Driver `Executors` UI.

**BEFORE**
![Screenshot 2024-02-02 at 11 40 46 
PM](https://github.com/apache/spark/assets/9700541/9fb45667-b25c-44cc-9c7c-c2ff981c5a2f)

**AFTER**
![Screenshot 2024-02-02 at 11 40 03 
PM](https://github.com/apache/spark/assets/9700541/9963452a-773c-4f8b-b025-9362853d3cae)

### Why are the changes needed?

Since both `Thread Dump` and `Heap Histogram` requires a live JVM, those 
links are broken and leads to the following pages.

**Broken Thread Dump Link**
![Screenshot 2024-02-02 at 11 36 55 
PM](https://github.com/apache/spark/assets/9700541/2cfff1b1-dc00-4fef-ab68-5e3fad5df7a0)

**Broken Heap Histogram Link**
![Screenshot 2024-02-02 at 11 37 12 
PM](https://github.com/apache/spark/assets/9700541/8450cb3e-3756-4755-896f-7ced682f09b0)

We had better hide them.

### Does this PR introduce _any_ user-facing change?

Yes, but this PR only hides the broken links.

### How was this patch tested?

Manual.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45009 from dongjoon-hyun/SPARK-46967.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../main/resources/org/apache/spark/ui/static/executorspage.js | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git 
a/core/src/main/resources/org/apache/spark/ui/static/executorspage.js 
b/core/src/main/resources/org/apache/spark/ui/static/executorspage.js
index 41164c7997bb..1b02fc0493e7 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/executorspage.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/executorspage.js
@@ -587,14 +587,16 @@ $(document).ready(function () {
 {name: 'executorLogsCol', data: 'executorLogs', render: 
formatLogsCells},
 {
   name: 'threadDumpCol',
-  data: 'id', render: function (data, type) {
-return type === 'display' ? ("Thread Dump" ) : data;
+  data: function (row) { return row.isActive ? row.id : '' },
+  render: function (data, type) {
+return data != '' && type === 'display' ? ("Thread Dump" ) : data;
   }
 },
 {
   name: 'heapHistogramCol',
-  data: 'id', render: function (data, type) {
-return type === 'display' ? ("Heap Histogram") : data;
+  data: function (row) { return row.isActive ? row.id : '' },
+  render: function (data, type) {
+return data != '' && type === 'display' ? ("Heap Histogram") : data;
   }
 },
 {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [MINOR][DOCS] Remove Java 8/11 at `IgnoreUnrecognizedVMOptions` description

2024-02-03 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0154c059cddb [MINOR][DOCS] Remove Java 8/11 at 
`IgnoreUnrecognizedVMOptions` description
0154c059cddb is described below

commit 0154c059cddba7cafe74243b3f9eedd9db367b72
Author: Dongjoon Hyun 
AuthorDate: Sat Feb 3 18:47:30 2024 -0800

[MINOR][DOCS] Remove Java 8/11 at `IgnoreUnrecognizedVMOptions` description

### What changes were proposed in this pull request?

This PR aims to remove old Java 8 and Java 11 from 
`IgnoreUnrecognizedVMOptions` JVM option description.

### Why are the changes needed?

From Apache Spark 4.0.0, we use `IgnoreUnrecognizedVMOptions` JVM option to 
be robust, not for Java 8 and Java 11 support.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45012 from dongjoon-hyun/IgnoreUnrecognizedVMOptions.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../src/main/java/org/apache/spark/launcher/JavaModuleOptions.java | 2 +-
 .../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala  | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java 
b/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java
index a7a6891746c2..8893f4bcb85a 100644
--- a/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java
+++ b/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java
@@ -20,7 +20,7 @@ package org.apache.spark.launcher;
 /**
  * This helper class is used to place the all `--add-opens` options
  * required by Spark when using Java 17. `DEFAULT_MODULE_OPTIONS` has added
- * `-XX:+IgnoreUnrecognizedVMOptions` to be compatible with Java 8 and Java 11.
+ * `-XX:+IgnoreUnrecognizedVMOptions` to be robust.
  *
  * @since 3.3.0
  */
diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index 22037ad5..6e3e0a1e644e 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -1031,8 +1031,7 @@ private[spark] class Client(
 javaOpts += s"-Djava.net.preferIPv6Addresses=${Utils.preferIPv6}"
 
 // SPARK-37106: To start AM with Java 17, 
`JavaModuleOptions.defaultModuleOptions`
-// is added by default. It will not affect Java 8 and Java 11 due to 
existence of
-// `-XX:+IgnoreUnrecognizedVMOptions`.
+// is added by default.
 javaOpts += JavaModuleOptions.defaultModuleOptions()
 
 // Set the environment variable through a command prefix


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-45276][INFRA][FOLLOWUP] Fix Java version comment from 11 to 17

2024-02-03 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f4525ff978a7 [SPARK-45276][INFRA][FOLLOWUP] Fix Java version comment 
from 11 to 17
f4525ff978a7 is described below

commit f4525ff978a7626d93311cb45425cbd591c0454e
Author: Dongjoon Hyun 
AuthorDate: Sat Feb 3 18:33:59 2024 -0800

[SPARK-45276][INFRA][FOLLOWUP] Fix Java version comment from 11 to 17

### What changes were proposed in this pull request?

This is a follow-up of
- #43076

### Why are the changes needed?

To match with the code.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45013 from dongjoon-hyun/SPARK-45276.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 connector/docker/spark-test/base/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/connector/docker/spark-test/base/Dockerfile 
b/connector/docker/spark-test/base/Dockerfile
index 0e8593f8af5b..c397abc211e2 100644
--- a/connector/docker/spark-test/base/Dockerfile
+++ b/connector/docker/spark-test/base/Dockerfile
@@ -18,7 +18,7 @@
 FROM ubuntu:20.04
 
 # Upgrade package index
-# install a few other useful packages plus Open Java 11
+# install a few other useful packages plus Open Java 17
 # Remove unneeded /var/lib/apt/lists/* after install to reduce the
 # docker image size (by ~30MB)
 RUN apt-get update && \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46968][SQL] Replace `UnsupportedOperationException` by `SparkUnsupportedOperationException` in `sql`

2024-02-03 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6e60b232c769 [SPARK-46968][SQL] Replace 
`UnsupportedOperationException` by `SparkUnsupportedOperationException` in `sql`
6e60b232c769 is described below

commit 6e60b232c7693738b1d005858e5dac24e7bafcaf
Author: Max Gekk 
AuthorDate: Sat Feb 3 00:22:06 2024 -0800

[SPARK-46968][SQL] Replace `UnsupportedOperationException` by 
`SparkUnsupportedOperationException` in `sql`

### What changes were proposed in this pull request?
In the PR, I propose to replace all `UnsupportedOperationException` by 
`SparkUnsupportedOperationException` in `sql` code base, and introduce new 
legacy error classes with the `_LEGACY_ERROR_TEMP_` prefix.

### Why are the changes needed?
To unify Spark SQL exception, and port Java exceptions on Spark exceptions 
with error classes.

### Does this PR introduce _any_ user-facing change?
Yes, it can if user's code assumes some particular format of 
`UnsupportedOperationException` messages.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "core/testOnly *SparkThrowableSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44937 from MaxGekk/migrate-UnsupportedOperationException-api.

Authored-by: Max Gekk 
Signed-off-by: Dongjoon Hyun 
---
 common/utils/src/main/resources/error/error-classes.json | 10 ++
 .../org/apache/spark/sql/catalyst/trees/QueryContexts.scala  | 12 ++--
 .../scala/org/apache/spark/sql/catalyst/util/UDTUtils.scala  |  3 ++-
 .../org/apache/spark/sql/execution/UnsafeRowSerializer.scala |  2 +-
 .../sql/execution/streaming/CompactibleFileStreamLog.scala   |  4 ++--
 .../spark/sql/execution/streaming/ValueStateImpl.scala   |  2 --
 .../streaming/state/HDFSBackedStateStoreProvider.scala   |  5 ++---
 .../apache/spark/sql/execution/streaming/state/RocksDB.scala |  7 ---
 8 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index 8399311cbfc4..ef9e81c98e05 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -7489,6 +7489,16 @@
   "Datatype not supported "
 ]
   },
+  "_LEGACY_ERROR_TEMP_3193" : {
+"message" : [
+  "Creating multiple column families with HDFSBackedStateStoreProvider is 
not supported"
+]
+  },
+  "_LEGACY_ERROR_TEMP_3197" : {
+"message" : [
+  "Failed to create column family with reserved name="
+]
+  },
   "_LEGACY_ERROR_USER_RAISED_EXCEPTION" : {
 "message" : [
   ""
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala
 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala
index 57271e535afb..c716002ef35c 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql.catalyst.trees
 
-import org.apache.spark.{QueryContext, QueryContextType}
+import org.apache.spark.{QueryContext, QueryContextType, 
SparkUnsupportedOperationException}
 
 /** The class represents error context of a SQL query. */
 case class SQLQueryContext(
@@ -131,16 +131,16 @@ case class SQLQueryContext(
   originStartIndex.get <= originStopIndex.get
   }
 
-  override def callSite: String = throw new UnsupportedOperationException
+  override def callSite: String = throw SparkUnsupportedOperationException()
 }
 
 case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends 
QueryContext {
   override val contextType = QueryContextType.DataFrame
 
-  override def objectType: String = throw new UnsupportedOperationException
-  override def objectName: String = throw new UnsupportedOperationException
-  override def startIndex: Int = throw new UnsupportedOperationException
-  override def stopIndex: Int = throw new UnsupportedOperationException
+  override def objectType: String = throw SparkUnsupportedOperationException()
+  override def objectName: String = throw SparkUnsupportedOperationException()
+  override def startIndex: Int = throw SparkUnsupportedOperationException()
+  override def stopIndex: Int = throw SparkUnsupportedOperationException()
 
   override val fragment: String = {
 stackTrace.headOption.map { firstElem =>
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/UDTUtils.scala 
b/s

(spark) branch master updated: [SPARK-46965][CORE] Check `logType` in `Utils.getLog`

2024-02-02 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 84387394c387 [SPARK-46965][CORE] Check `logType` in `Utils.getLog`
84387394c387 is described below

commit 84387394c387c7a6c171714f5d45d517b6bec7af
Author: Dongjoon Hyun 
AuthorDate: Fri Feb 2 17:22:32 2024 -0800

[SPARK-46965][CORE] Check `logType` in `Utils.getLog`

### What changes were proposed in this pull request?

This PR aims to check `logType` in `Utils.getLog`.

### Why are the changes needed?

To prevent security vulnerability.

### Does this PR introduce _any_ user-facing change?

No. This is a new module which is not released yet.

### How was this patch tested?

Manually.

**BEFORE**
```
$ sbin/start-master.sh
$ curl -s 
'http://localhost:8080/logPage/self?logType=../../../../../../etc/nfs.conf' | 
grep NFS
# nfs.conf: the NFS configuration file
```

**AFTER**
```
$ sbin/start-master.sh
$ curl -s 
'http://localhost:8080/logPage/self?logType=../../../../../../etc/nfs.conf' | 
grep NFS
```

For `Spark History Server`, the same check with 18080 port.
```
$ curl -s 
'http://localhost:18080/logPage/self?logType=../../../../../../../etc/nfs.conf' 
| grep NFS
```

### Was this patch authored or co-authored using generative AI tooling?

No

    Closes #45006 from dongjoon-hyun/SPARK-46965.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/deploy/Utils.scala | 4 
 1 file changed, 4 insertions(+)

diff --git a/core/src/main/scala/org/apache/spark/deploy/Utils.scala 
b/core/src/main/scala/org/apache/spark/deploy/Utils.scala
index 9bbcc9f314b2..32328ae1e07a 100644
--- a/core/src/main/scala/org/apache/spark/deploy/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/Utils.scala
@@ -32,6 +32,7 @@ import org.apache.spark.util.logging.RollingFileAppender
  */
 private[deploy] object Utils extends Logging {
   val DEFAULT_BYTES = 100 * 1024
+  val SUPPORTED_LOG_TYPES = Set("stderr", "stdout", "out")
 
   def addRenderLogHandler(page: WebUI, conf: SparkConf): Unit = {
 page.attachHandler(createServletHandler("/log",
@@ -58,6 +59,9 @@ private[deploy] object Utils extends Logging {
   logType: String,
   offsetOption: Option[Long],
   byteLength: Int): (String, Long, Long, Long) = {
+if (!SUPPORTED_LOG_TYPES.contains(logType)) {
+  return ("Error: Log type must be one of " + 
SUPPORTED_LOG_TYPES.mkString(", "), 0, 0, 0)
+}
 try {
   // Find a log file name
   val fileName = if (logType.equals("out")) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [MINOR][SQL] Clean up outdated comments from `hash` function in `Metadata`

2024-02-01 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 704e9a0785f4 [MINOR][SQL] Clean up outdated comments from `hash` 
function in `Metadata`
704e9a0785f4 is described below

commit 704e9a0785f4fc4dd86b950a649114e807a826a1
Author: yangjie01 
AuthorDate: Thu Feb 1 09:31:24 2024 -0800

[MINOR][SQL] Clean up outdated comments from `hash` function in `Metadata`

### What changes were proposed in this pull request?
This pr just clean up outdated comments from `hash` function in `Metadata`

### Why are the changes needed?
Clean up outdated comments

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44978 from LuciferYang/minior-remove-comments.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala | 2 --
 1 file changed, 2 deletions(-)

diff --git a/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala
index 17be8cfa12b5..2ffd0f13ca10 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala
@@ -208,8 +208,6 @@ object Metadata {
   /** Computes the hash code for the types we support. */
   private def hash(obj: Any): Int = {
 obj match {
-  // `map.mapValues` return `Map` in Scala 2.12 and return `MapView` in 
Scala 2.13, call
-  // `toMap` for Scala version compatibility.
   case map: Map[_, _] =>
 map.transform((_, v) => hash(v)).##
   case arr: Array[_] =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46940][CORE] Remove unused `updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils`

2024-02-01 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d5ca61692c34 [SPARK-46940][CORE] Remove unused 
`updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils`
d5ca61692c34 is described below

commit d5ca61692c34449bc602db6cf0919010ec5a50a3
Author: panbingkun 
AuthorDate: Thu Feb 1 09:30:07 2024 -0800

[SPARK-46940][CORE] Remove unused `updateSparkConfigFromProperties` and 
`isAbsoluteURI` in `o.a.s.u.Utils`

### What changes were proposed in this pull request?
The pr aims to remove unused `updateSparkConfigFromProperties` and 
`isAbsoluteURI` in `o.a.s.u.Utils`.

### Why are the changes needed?
Keep the code cleanly.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44979 from panbingkun/SPARK-46940.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/util/Utils.scala   | 25 --
 1 file changed, 25 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala 
b/core/src/main/scala/org/apache/spark/util/Utils.scala
index a55539c0a235..b49f97aed05e 100644
--- a/core/src/main/scala/org/apache/spark/util/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/util/Utils.scala
@@ -1884,17 +1884,6 @@ private[spark] object Utils
 }
   }
 
-  /** Check whether a path is an absolute URI. */
-  def isAbsoluteURI(path: String): Boolean = {
-try {
-  val uri = new URI(path: String)
-  uri.isAbsolute
-} catch {
-  case _: URISyntaxException =>
-false
-}
-  }
-
   /** Return all non-local paths from a comma-separated list of paths. */
   def nonLocalPaths(paths: String, testWindows: Boolean = false): 
Array[String] = {
 val windows = isWindows || testWindows
@@ -1931,20 +1920,6 @@ private[spark] object Utils
 path
   }
 
-  /**
-   * Updates Spark config with properties from a set of Properties.
-   * Provided properties have the highest priority.
-   */
-  def updateSparkConfigFromProperties(
-  conf: SparkConf,
-  properties: Map[String, String]) : Unit = {
-properties.filter { case (k, v) =>
-  k.startsWith("spark.")
-}.foreach { case (k, v) =>
-  conf.set(k, v)
-}
-  }
-
   /**
* Implements the same logic as JDK `java.lang.String#trim` by removing 
leading and trailing
* non-printable characters less or equal to '\u0020' (SPACE) but preserves 
natural line


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger with int

2024-01-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c63dea8f4235 [SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger 
with int
c63dea8f4235 is described below

commit c63dea8f42357ecfd4fe41f04732e2cb0d0d53ae
Author: beliefer 
AuthorDate: Wed Jan 31 17:47:50 2024 -0800

[SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger with int

### What changes were proposed in this pull request?
This PR propose to replace unnecessary `AtomicInteger` with int.

### Why are the changes needed?
The variable `value` of `GetMaxCounter` always guarded by itself. So we can 
replace the unnecessary `AtomicInteger` with int.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44907 from beliefer/SPARK-46882.

Authored-by: beliefer 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/streaming/util/WriteAheadLogSuite.scala| 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git 
a/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala
 
b/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala
index 3a9fffec13cf..cf9d5b7387f7 100644
--- 
a/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala
+++ 
b/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala
@@ -20,7 +20,6 @@ import java.io._
 import java.nio.ByteBuffer
 import java.util.{Iterator => JIterator}
 import java.util.concurrent.{CountDownLatch, RejectedExecutionException, 
ThreadPoolExecutor, TimeUnit}
-import java.util.concurrent.atomic.AtomicInteger
 
 import scala.collection.mutable.ArrayBuffer
 import scala.concurrent._
@@ -238,14 +237,14 @@ class FileBasedWriteAheadLogSuite
 val executionContext = ExecutionContext.fromExecutorService(fpool)
 
 class GetMaxCounter {
-  private val value = new AtomicInteger()
-  @volatile private var max: Int = 0
+  private var value = 0
+  private var max: Int = 0
   def increment(): Unit = synchronized {
-val atInstant = value.incrementAndGet()
-if (atInstant > max) max = atInstant
+value = value + 1
+if (value > max) max = value
   }
-  def decrement(): Unit = synchronized { value.decrementAndGet() }
-  def get(): Int = synchronized { value.get() }
+  def decrement(): Unit = synchronized { value = value - 1 }
+  def get(): Int = synchronized { value }
   def getMax(): Int = synchronized { max }
 }
 try {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46931][PS] Implement `{Frame, Series}.to_hdf`

2024-01-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 88f121c47778 [SPARK-46931][PS] Implement `{Frame, Series}.to_hdf`
88f121c47778 is described below

commit 88f121c47778f0755862046d09484a83932cb30b
Author: Ruifeng Zheng 
AuthorDate: Wed Jan 31 08:41:21 2024 -0800

[SPARK-46931][PS] Implement `{Frame, Series}.to_hdf`

### What changes were proposed in this pull request?
Implement `{Frame, Series}.to_hdf`

### Why are the changes needed?
pandas parity

### Does this PR introduce _any_ user-facing change?
yes
```
In [3]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 
'b', 'c'])

In [4]: df.to_hdf('/tmp/data.h5', key='df', mode='w')

In [5]: psdf = ps.from_pandas(df)

In [6]: psdf.to_hdf('/tmp/data2.h5', key='df', mode='w')
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1015: 
PandasAPIOnSparkAdviceWarning: `to_hdf` loads all data into the driver's 
memory. It should only be used if the resulting DataFrame is expected to be 
small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)

In [7]: !ls /tmp/*h5
/tmp/data.h5/tmp/data2.h5

In [8]: !ls -lh /tmp/*h5
-rw-r--r-- 1 ruifeng.zheng  wheel   6.9K Jan 31 12:21 /tmp/data.h5
-rw-r--r-- 1 ruifeng.zheng  wheel   6.9K Jan 31 12:21 /tmp/data2.h5
```

### How was this patch tested?
manually test, `hdf` requires additional library `pytables` which in turn 
needs [many 
prerequisites](https://www.pytables.org/usersguide/installation.html#prerequisites)

since `pytables` is just a optional dep of `Pandas`, so I think we can 
avoid adding it to CI first.

### Was this patch authored or co-authored using generative AI tooling?
no
    
Closes #44966 from zhengruifeng/ps_to_hdf.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .../docs/source/reference/pyspark.pandas/frame.rst |   1 +
 .../source/reference/pyspark.pandas/series.rst |   1 +
 python/pyspark/pandas/generic.py   | 120 +
 python/pyspark/pandas/missing/frame.py |   1 -
 python/pyspark/pandas/missing/series.py|   1 -
 5 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/python/docs/source/reference/pyspark.pandas/frame.rst 
b/python/docs/source/reference/pyspark.pandas/frame.rst
index 12cf6e7db12f..77b60468b8fb 100644
--- a/python/docs/source/reference/pyspark.pandas/frame.rst
+++ b/python/docs/source/reference/pyspark.pandas/frame.rst
@@ -286,6 +286,7 @@ Serialization / IO / Conversion
DataFrame.to_json
DataFrame.to_dict
DataFrame.to_excel
+   DataFrame.to_hdf
DataFrame.to_clipboard
DataFrame.to_markdown
DataFrame.to_records
diff --git a/python/docs/source/reference/pyspark.pandas/series.rst 
b/python/docs/source/reference/pyspark.pandas/series.rst
index 88d1861c6ccf..5606fa93a5f3 100644
--- a/python/docs/source/reference/pyspark.pandas/series.rst
+++ b/python/docs/source/reference/pyspark.pandas/series.rst
@@ -486,6 +486,7 @@ Serialization / IO / Conversion
Series.to_json
Series.to_csv
Series.to_excel
+   Series.to_hdf
Series.to_frame
 
 Pandas-on-Spark specific
diff --git a/python/pyspark/pandas/generic.py b/python/pyspark/pandas/generic.py
index 77cefb53fe5d..ed2aeb8ea6af 100644
--- a/python/pyspark/pandas/generic.py
+++ b/python/pyspark/pandas/generic.py
@@ -1103,6 +1103,126 @@ class Frame(object, metaclass=ABCMeta):
 psdf._to_internal_pandas(), self.to_excel, f, args
 )
 
+def to_hdf(
+self,
+path_or_buf: Union[str, pd.HDFStore],
+key: str,
+mode: str = "a",
+complevel: Optional[int] = None,
+complib: Optional[str] = None,
+append: bool = False,
+format: Optional[str] = None,
+index: bool = True,
+min_itemsize: Optional[Union[int, Dict[str, int]]] = None,
+nan_rep: Optional[Any] = None,
+dropna: Optional[bool] = None,
+data_columns: Optional[Union[bool, List[str]]] = None,
+errors: str = "strict",
+encoding: str = "UTF-8",
+) -> None:
+"""
+Write the contained data to an HDF5 file using HDFStore.
+
+.. note:: This method should only be used if the resulting DataFrame 
is expected
+  to be small, as all the data is loaded into the driver's 
memory.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+path_or_buf : str or pandas.HDFStore
+File path or HDFStore object.
+key : str
+

(spark) branch master updated: [SPARK-46930][SQL] Add support for a custom prefix for Union type fields in Avro

2024-01-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d49265a170fb [SPARK-46930][SQL] Add support for a custom prefix for 
Union type fields in Avro
d49265a170fb is described below

commit d49265a170fb7bb06471d97f4483139529939ecd
Author: Ivan Sadikov 
AuthorDate: Wed Jan 31 08:39:46 2024 -0800

[SPARK-46930][SQL] Add support for a custom prefix for Union type fields in 
Avro

### What changes were proposed in this pull request?

This PR enhances stable ids functionality in Avro by allowing users to 
configure a custom prefix for Union type member fields when 
`enableStableIdentifiersForUnionType` is enabled.

Without the patch, the fields are generated with `member_` prefix, e.g. 
`member_int`, `member_string`. This could become difficult to change for 
complex schemas.

The solution is to add a new option `stableIdentifierPrefixForUnionType` 
which defaults to `member_` and allows users to configure whatever prefix they 
require, e.g. `member`, `tmp_`, or even an empty string.

### Why are the changes needed?

Allows to customise the prefix of stable ids in Avro without the need to 
rename all of the columns which could be cumbersome for complex schemas.

### Does this PR introduce _any_ user-facing change?

Yes. The PR adds a new option in Avro: `stableIdentifierPrefixForUnionType`.

### How was this patch tested?

Existing tests + a new unit test to verify different prefixes.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44964 from sadikovi/SPARK-46930.

Authored-by: Ivan Sadikov 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/avro/AvroDataToCatalyst.scala | 12 +++--
 .../apache/spark/sql/avro/AvroDeserializer.scala   | 12 +++--
 .../org/apache/spark/sql/avro/AvroFileFormat.scala |  3 +-
 .../org/apache/spark/sql/avro/AvroOptions.scala|  6 +++
 .../org/apache/spark/sql/avro/AvroUtils.scala  |  5 +-
 .../apache/spark/sql/avro/SchemaConverters.scala   | 58 +++---
 .../sql/v2/avro/AvroPartitionReaderFactory.scala   |  3 +-
 .../sql/avro/AvroCatalystDataConversionSuite.scala |  7 +--
 .../apache/spark/sql/avro/AvroFunctionsSuite.scala |  3 +-
 .../apache/spark/sql/avro/AvroRowReaderSuite.scala |  3 +-
 .../org/apache/spark/sql/avro/AvroSerdeSuite.scala |  3 +-
 .../org/apache/spark/sql/avro/AvroSuite.scala  | 54 +++-
 docs/sql-data-sources-avro.md  | 10 +++-
 13 files changed, 133 insertions(+), 46 deletions(-)

diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
index 9f31a2db55a5..7d80998d96eb 100644
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
@@ -40,7 +40,9 @@ private[sql] case class AvroDataToCatalyst(
 
   override lazy val dataType: DataType = {
 val dt = SchemaConverters.toSqlType(
-  expectedSchema, avroOptions.useStableIdForUnionType).dataType
+  expectedSchema,
+  avroOptions.useStableIdForUnionType,
+  avroOptions.stableIdPrefixForUnionType).dataType
 parseMode match {
   // With PermissiveMode, the output Catalyst row might contain columns of 
null values for
   // corrupt records, even if some of the columns are not nullable in the 
user-provided schema.
@@ -62,8 +64,12 @@ private[sql] case class AvroDataToCatalyst(
   @transient private lazy val reader = new 
GenericDatumReader[Any](actualSchema, expectedSchema)
 
   @transient private lazy val deserializer =
-new AvroDeserializer(expectedSchema, dataType,
-  avroOptions.datetimeRebaseModeInRead, 
avroOptions.useStableIdForUnionType)
+new AvroDeserializer(
+  expectedSchema,
+  dataType,
+  avroOptions.datetimeRebaseModeInRead,
+  avroOptions.useStableIdForUnionType,
+  avroOptions.stableIdPrefixForUnionType)
 
   @transient private var decoder: BinaryDecoder = _
 
diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
index 9e10fac8bb55..139c45adb442 100644
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
@@ -50,20 +50,23 @@ private[sql] class AvroDeserializer(
 positionalFieldMatch: Boolean,
 datetimeRebaseSpec: RebaseSpec,
 filters: StructFilters,
-useStableIdForUnionType: Boolean) {
+useStableIdForUnionType: Boolean

(spark) branch master updated: [SPARK-46921][BUILD] Move `ProblemFilters` that do not belong to `defaultExcludes` to `v40excludes`

2024-01-30 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8e29c0d8c5cd [SPARK-46921][BUILD] Move `ProblemFilters` that do not 
belong to `defaultExcludes` to `v40excludes`
8e29c0d8c5cd is described below

commit 8e29c0d8c5cdc87d0a7358e090af864c4f03b1a8
Author: yangjie01 
AuthorDate: Tue Jan 30 08:44:34 2024 -0800

[SPARK-46921][BUILD] Move `ProblemFilters` that do not belong to 
`defaultExcludes` to `v40excludes`

### What changes were proposed in this pull request?
This pr just move `ProblemFilters` that do not belong to `defaultExcludes` 
to `v40excludes`.

### Why are the changes needed?
We should not arbitrarily add entries to `defaultExcludes`, as it 
represents never participating in the mima check

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Mima check passed

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44952 from LuciferYang/SPARK-46921.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 project/MimaExcludes.scala | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 43723742be97..64c5599919a6 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -59,7 +59,25 @@ object MimaExcludes {
 // [SPARK-45762][CORE] Support shuffle managers defined in user jars by 
changing startup order
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SparkEnv.this"),
 // [SPARK-46480][CORE][SQL] Fix NPE when table cache task attempt
-
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.TaskContext.isFailed")
+
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.TaskContext.isFailed"),
+
+// SPARK-43299: Convert StreamingQueryException in Scala Client
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.StreamingQueryException"),
+
+// SPARK-45856: Move ArtifactManager from Spark Connect into SparkSession 
(sql/core)
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.apply"),
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.userId"),
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.sessionId"),
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.copy"),
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.copy$default$3"),
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.this"),
+
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.storage.CacheId$"),
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.apply"),
+
+// SPARK-46410: Assign error classes/subclasses to 
JdbcUtils.classifyException
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.JdbcDialect.classifyException"),
+// [SPARK-464878][CORE][SQL] (false alert). Invalid rule for StringType 
extension.
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.StringType.this")
   )
 
   // Default exclude rules
@@ -92,24 +110,6 @@ object MimaExcludes {
 
ProblemFilters.exclude[Problem]("org.sparkproject.spark_protobuf.protobuf.*"),
 
ProblemFilters.exclude[Problem]("org.apache.spark.sql.protobuf.utils.SchemaConverters.*"),
 
-// SPARK-43299: Convert StreamingQueryException in Scala Client
-
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.StreamingQueryException"),
-
-// SPARK-45856: Move ArtifactManager from Spark Connect into SparkSession 
(sql/core)
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.apply"),
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.userId"),
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.sessionId"),
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.copy"),
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.copy$default$3"),
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.this"),
-
ProblemFilters.exclude[MissingTypesPro

(spark) branch branch-3.4 updated: [SPARK-46893][UI] Remove inline scripts from UI descriptions

2024-01-29 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new edaa0fd8d096 [SPARK-46893][UI] Remove inline scripts from UI 
descriptions
edaa0fd8d096 is described below

commit edaa0fd8d096a3e57918e4b6e437337fcfdc8276
Author: Willi Raschkowski 
AuthorDate: Mon Jan 29 22:43:21 2024 -0800

[SPARK-46893][UI] Remove inline scripts from UI descriptions

### What changes were proposed in this pull request?
This PR prevents malicious users from injecting inline scripts via job and 
stage descriptions.

Spark's Web UI [already checks the security of job and stage 
descriptions](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L528-L545)
 before rendering them as HTML (or treating them as plain text). The UI already 
disallows `

(spark) branch branch-3.5 updated: [SPARK-46893][UI] Remove inline scripts from UI descriptions

2024-01-29 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 343ae8226161 [SPARK-46893][UI] Remove inline scripts from UI 
descriptions
343ae8226161 is described below

commit 343ae822616185022570f1c14b151e54ff54e265
Author: Willi Raschkowski 
AuthorDate: Mon Jan 29 22:43:21 2024 -0800

[SPARK-46893][UI] Remove inline scripts from UI descriptions

### What changes were proposed in this pull request?
This PR prevents malicious users from injecting inline scripts via job and 
stage descriptions.

Spark's Web UI [already checks the security of job and stage 
descriptions](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L528-L545)
 before rendering them as HTML (or treating them as plain text). The UI already 
disallows `

(spark) branch master updated (41a1426e9ee3 -> abd9d27e87b9)

2024-01-29 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 41a1426e9ee3 [SPARK-46914][UI] Shorten app name in the summary table 
on the History Page
 add abd9d27e87b9 [SPARK-46893][UI] Remove inline scripts from UI 
descriptions

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/ui/UIUtils.scala  | 12 +---
 core/src/test/scala/org/apache/spark/ui/UIUtilsSuite.scala | 14 ++
 2 files changed, 23 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46914][UI] Shorten app name in the summary table on the History Page

2024-01-29 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 41a1426e9ee3 [SPARK-46914][UI] Shorten app name in the summary table 
on the History Page
41a1426e9ee3 is described below

commit 41a1426e9ee318a9421fad11776eb6894bb1f04b
Author: Kent Yao 
AuthorDate: Mon Jan 29 22:07:19 2024 -0800

[SPARK-46914][UI] Shorten app name in the summary table on the History Page

### What changes were proposed in this pull request?

This Pull Request shortens long app names to prevent overflow in the app 
table.

### Why are the changes needed?

better UX

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new js tests and built and tested locally:

![image](https://github.com/apache/spark/assets/8326978/f78bd580-74b1-4fe5-9d8b-f2d49ce85ed9)

![image](https://github.com/apache/spark/assets/8326978/10bca509-00e5-4d8f-bf11-324c1080190b)

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44944 from yaooqinn/SPARK-46914.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../resources/org/apache/spark/ui/static/historypage.js   | 12 
 .../main/resources/org/apache/spark/ui/static/utils.js| 15 ++-
 ui-test/tests/utils.test.js   |  7 +++
 3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/core/src/main/resources/org/apache/spark/ui/static/historypage.js 
b/core/src/main/resources/org/apache/spark/ui/static/historypage.js
index 85cd5a554750..8961140a4019 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/historypage.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/historypage.js
@@ -17,7 +17,7 @@
 
 /* global $, Mustache, jQuery, uiRoot */
 
-import {formatDuration, formatTimeMillis} from "./utils.js";
+import {formatDuration, formatTimeMillis, stringAbbreviate} from "./utils.js";
 
 export {setAppLimit};
 
@@ -186,9 +186,13 @@ $(document).ready(function() {
 name: 'appId',
 type: "appid-numeric",
 data: 'id',
-render:  (id, type, row) => `${id}`
+render: (id, type, row) => `${id}`
+  },
+  {
+name: 'appName',
+data: 'name',
+render: (name) => stringAbbreviate(name, 60)
   },
-  {name: 'appName', data: 'name' },
   {
 name: attemptIdColumnName,
 data: 'attemptId',
@@ -200,7 +204,7 @@ $(document).ready(function() {
 name: durationColumnName,
 type: "title-numeric",
 data: 'duration',
-render:  (id, type, row) => `${row.duration}`
+render: (id, type, row) => `${row.duration}`
   },
   {name: 'user', data: 'sparkUser' },
   {name: 'lastUpdated', data: 'lastUpdated' },
diff --git a/core/src/main/resources/org/apache/spark/ui/static/utils.js 
b/core/src/main/resources/org/apache/spark/ui/static/utils.js
index 960640791fe5..2d4123bc75ab 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/utils.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/utils.js
@@ -20,7 +20,7 @@ export {
   errorMessageCell, errorSummary,
   formatBytes, formatDate, formatDuration, formatLogsCells, formatTimeMillis,
   getBaseURI, getStandAloneAppId, getTimeZone,
-  setDataTableDefaults
+  setDataTableDefaults, stringAbbreviate
 };
 
 /* global $, uiRoot */
@@ -272,3 +272,16 @@ function errorMessageCell(errorMessage) {
   const details = detailsUINode(isMultiline, errorMessage);
   return summary + details;
 }
+
+function stringAbbreviate(content, limit) {
+  if (content && content.length > limit) {
+const summary = content.substring(0, limit) + '...';
+// TODO: Reused stacktrace-details* style for convenience, but it's not 
really a stacktrace
+// Consider creating a new style for this case if stacktrace-details is 
not appropriate in
+// the future.
+const details = detailsUINode(true, content);
+return summary + details;
+  } else {
+return content;
+  }
+}
diff --git a/ui-test/tests/utils.test.js b/ui-test/tests/utils.test.js
index ad3e87b76641..a6815577bd82 100644
--- a/ui-test/tests/utils.test.js
+++ b/ui-test/tests/utils.test.js
@@ -67,3 +67,10 @@ test('errorSummary', function () {
   const e2 = "java.lang.RuntimeException: random text";
   
expect(utils.errorSummary(e2).toString()).toBe('java.lang.RuntimeException,true');
 });
+
+test('stringAbbreviat

(spark) branch master updated: [SPARK-46916][PS][TESTS] Clean up `pyspark.pandas.tests.indexes.*`

2024-01-29 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e3143c4c2806 [SPARK-46916][PS][TESTS] Clean up 
`pyspark.pandas.tests.indexes.*`
e3143c4c2806 is described below

commit e3143c4c28068b80865c4ed9780a5a4beec0a7e8
Author: Ruifeng Zheng 
AuthorDate: Mon Jan 29 22:05:12 2024 -0800

[SPARK-46916][PS][TESTS] Clean up `pyspark.pandas.tests.indexes.*`

### What changes were proposed in this pull request?
Clean up `pyspark.pandas.tests.indexes.*`:
1, delete unused imports, variables;
2, avoid double definition of the testing datasets;

### Why are the changes needed?
code clean up

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44945 from zhengruifeng/ps_test_index_cleanup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .../pandas/tests/connect/indexes/test_parity_align.py| 11 ++-
 .../pandas/tests/connect/indexes/test_parity_indexing.py | 11 ++-
 .../pandas/tests/connect/indexes/test_parity_reindex.py  | 11 ++-
 .../pandas/tests/connect/indexes/test_parity_rename.py   | 11 ++-
 .../pandas/tests/connect/indexes/test_parity_reset_index.py  |  9 -
 python/pyspark/pandas/tests/indexes/test_align.py|  8 ++--
 python/pyspark/pandas/tests/indexes/test_asof.py |  4 ++--
 python/pyspark/pandas/tests/indexes/test_astype.py   |  4 ++--
 python/pyspark/pandas/tests/indexes/test_datetime.py |  6 +-
 python/pyspark/pandas/tests/indexes/test_delete.py   |  4 ++--
 python/pyspark/pandas/tests/indexes/test_diff.py |  4 ++--
 python/pyspark/pandas/tests/indexes/test_drop.py |  4 ++--
 python/pyspark/pandas/tests/indexes/test_indexing.py | 12 ++--
 .../pandas/tests/indexes/test_indexing_loc_multi_idx.py  |  1 -
 python/pyspark/pandas/tests/indexes/test_insert.py   | 11 ++-
 python/pyspark/pandas/tests/indexes/test_map.py  |  4 ++--
 python/pyspark/pandas/tests/indexes/test_reindex.py  |  8 ++--
 python/pyspark/pandas/tests/indexes/test_rename.py   |  8 ++--
 python/pyspark/pandas/tests/indexes/test_reset_index.py  |  8 ++--
 python/pyspark/pandas/tests/indexes/test_sort.py |  4 ++--
 python/pyspark/pandas/tests/indexes/test_symmetric_diff.py   |  4 ++--
 python/pyspark/pandas/tests/indexes/test_take.py |  4 ++--
 python/pyspark/pandas/tests/indexes/test_timedelta.py|  6 +-
 23 files changed, 92 insertions(+), 65 deletions(-)

diff --git a/python/pyspark/pandas/tests/connect/indexes/test_parity_align.py 
b/python/pyspark/pandas/tests/connect/indexes/test_parity_align.py
index 0bf84e6421f2..2bb56242ba34 100644
--- a/python/pyspark/pandas/tests/connect/indexes/test_parity_align.py
+++ b/python/pyspark/pandas/tests/connect/indexes/test_parity_align.py
@@ -16,16 +16,17 @@
 #
 import unittest
 
-from pyspark import pandas as ps
 from pyspark.pandas.tests.indexes.test_align import FrameAlignMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 from pyspark.testing.pandasutils import PandasOnSparkTestUtils
 
 
-class FrameParityAlignTests(FrameAlignMixin, PandasOnSparkTestUtils, 
ReusedConnectTestCase):
-@property
-def psdf(self):
-return ps.from_pandas(self.pdf)
+class FrameParityAlignTests(
+FrameAlignMixin,
+PandasOnSparkTestUtils,
+ReusedConnectTestCase,
+):
+pass
 
 
 if __name__ == "__main__":
diff --git 
a/python/pyspark/pandas/tests/connect/indexes/test_parity_indexing.py 
b/python/pyspark/pandas/tests/connect/indexes/test_parity_indexing.py
index a76489314d25..5e52dd91474a 100644
--- a/python/pyspark/pandas/tests/connect/indexes/test_parity_indexing.py
+++ b/python/pyspark/pandas/tests/connect/indexes/test_parity_indexing.py
@@ -16,16 +16,17 @@
 #
 import unittest
 
-from pyspark import pandas as ps
 from pyspark.pandas.tests.indexes.test_indexing import FrameIndexingMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 from pyspark.testing.pandasutils import PandasOnSparkTestUtils
 
 
-class FrameParityIndexingTests(FrameIndexingMixin, PandasOnSparkTestUtils, 
ReusedConnectTestCase):
-@property
-def psdf(self):
-return ps.from_pandas(self.pdf)
+class FrameParityIndexingTests(
+FrameIndexingMixin,
+PandasOnSparkTestUtils,
+ReusedConnectTestCase,
+):
+pass
 
 
 if __name__ == "__main__":
diff --git a/python/pyspark/pandas/tests/connect/indexes/test_parity_reindex.py 
b/python/pyspark/pandas/tests/connect/indexes/test_parity_reindex.p

(spark) branch master updated: [SPARK-46907][CORE] Show driver log location in Spark History Server

2024-01-29 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 29355c07580e [SPARK-46907][CORE] Show driver log location in Spark 
History Server
29355c07580e is described below

commit 29355c07580e68d48546e9e210c876b69a8c10a2
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 29 14:04:07 2024 -0800

[SPARK-46907][CORE] Show driver log location in Spark History Server

### What changes were proposed in this pull request?

This PR aims to show `Driver Log Location` in Spark History Server UI if 
`spark.driver.log.dfsDir` is configured.

### Why are the changes needed?

**BEFORE (or `spark.driver.log.dfsDir` is absent)**
![Screenshot 2024-01-29 at 10 11 06 
AM](https://github.com/apache/spark/assets/9700541/6d709b4b-d002-422b-a1df-bb5e1b50b539)

**AFTER**
![Screenshot 2024-01-29 at 10 10 25 
AM](https://github.com/apache/spark/assets/9700541/83b35a7d-5fc9-443a-a6e5-7b6bd98dbdc6)

### Does this PR introduce _any_ user-facing change?

No, this is a new additional UI information only for the users who uses 
`spark.driver.log.dfsDir` configurations.

### How was this patch tested?

Manual.

```
$ mkdir /tmp/history
$ mkdir /tmp/driver-logs
$ SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/history 
-Dspark.driver.log.dfsDir=/tmp/driver-logs"  sbin/start-history-server.sh
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44936 from dongjoon-hyun/SPARK-46907.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/deploy/history/FsHistoryProvider.scala  | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
index 8f64de0847ec..7c888a07263a 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
@@ -381,7 +381,12 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
 } else {
   Map()
 }
-Map("Event log directory" -> logDir) ++ safeMode
+val driverLog = if (conf.contains(DRIVER_LOG_DFS_DIR)) {
+  Map("Driver log directory" -> conf.get(DRIVER_LOG_DFS_DIR).get)
+} else {
+  Map()
+}
+Map("Event log directory" -> logDir) ++ safeMode ++ driverLog
   }
 
   override def start(): Unit = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (e211dbdee42c -> c468c3d5c685)

2024-01-29 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e211dbdee42c [SPARK-46831][SQL] Collations - Extending StringType and 
PhysicalStringType with collationId field
 add c468c3d5c685 [SPARK-46904][UI] Fix display issue of History UI summary

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/deploy/history/HistoryPage.scala  | 121 +++--
 1 file changed, 64 insertions(+), 57 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46903][CORE] Support Spark History Server Log UI

2024-01-29 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8dd395b2eabd [SPARK-46903][CORE] Support Spark History Server Log UI
8dd395b2eabd is described below

commit 8dd395b2eabd2815982022b38a5287dae7af8b82
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 29 01:32:45 2024 -0800

[SPARK-46903][CORE] Support Spark History Server Log UI

### What changes were proposed in this pull request?

This PR aims to make `Spark History Server` to provide its server log view 
link and page.

### Why are the changes needed?

To improve UX.

- `Show server log` link is added at the bottom of page.
![Screenshot 2024-01-29 at 12 54 41 
AM](https://github.com/apache/spark/assets/9700541/7e5cea9f-8ac8-4a60-a249-d1bb31f6e269)

- The link opens the following log view page.
![Screenshot 2024-01-29 at 12 55 41 
AM](https://github.com/apache/spark/assets/9700541/70cf0c77-fc67-4ad8-97db-b061fdd1ffd0)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44932 from dongjoon-hyun/SPARK-46903.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/history/HistoryPage.scala  |   2 +
 .../spark/deploy/history/HistoryServer.scala   |   1 +
 .../org/apache/spark/deploy/history/LogPage.scala  | 126 +
 3 files changed, 129 insertions(+)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
index 7ba9b2c54937..03d880f47306 100644
--- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
@@ -94,6 +94,8 @@ private[history] class HistoryPage(parent: HistoryServer) 
extends WebUIPage("")
   }
   }
 
+
+  Show server log
   
   
 UIUtils.basicSparkPage(request, content, "History Server", true)
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
index 321f76923411..8ba610e0a13d 100644
--- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
@@ -148,6 +148,7 @@ class HistoryServer(
*/
   def initialize(): Unit = {
 attachPage(new HistoryPage(this))
+attachPage(new LogPage(conf))
 
 attachHandler(ApiRootResource.getServletHandler(this))
 
diff --git a/core/src/main/scala/org/apache/spark/deploy/history/LogPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/LogPage.scala
new file mode 100644
index ..72d88e14a122
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/deploy/history/LogPage.scala
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.history
+
+import java.io.File
+import javax.servlet.http.HttpServletRequest
+
+import scala.xml.{Node, Unparsed}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.ui.{UIUtils, WebUIPage}
+import org.apache.spark.util.Utils
+import org.apache.spark.util.logging.RollingFileAppender
+
+private[history] class LogPage(conf: SparkConf) extends WebUIPage("logPage") 
with Logging {
+  private val defaultBytes = 100 * 1024
+
+  def render(request: HttpServletRequest): Seq[Node] = {
+val logDir = sys.env.getOrElse("SPARK_LOG_DIR", "logs/")
+val logType = request.getParameter("logType")
+val offset = Option(request.getParameter("offset")).map(_.toLong)
+val byteLength = Option(request.getParameter("byteLength")).map(_.toInt)
+

(spark) branch master updated: [SPARK-46902][UI] Fix Spark History Server UI for using un-exported setAppLimit

2024-01-29 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1386b52f3eb6 [SPARK-46902][UI] Fix Spark History Server UI for using 
un-exported setAppLimit
1386b52f3eb6 is described below

commit 1386b52f3eb624331345611ef1f6ecc44047f80f
Author: Kent Yao 
AuthorDate: Mon Jan 29 01:26:57 2024 -0800

[SPARK-46902][UI] Fix Spark History Server UI for using un-exported 
setAppLimit

### What changes were proposed in this pull request?

Fix Spark History Server UI for using un-exported `setAppLimit` to render 
the dataTables of app list

close #44930

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Locally built and tested


![image](https://github.com/apache/spark/assets/8326978/6899b1a2-0232-4f85-9389-e5c18db8d9d3)

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44931 from yaooqinn/SPARK-46902.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../main/resources/org/apache/spark/ui/static/historypage.js  |  2 ++
 .../scala/org/apache/spark/deploy/history/HistoryPage.scala   | 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/core/src/main/resources/org/apache/spark/ui/static/historypage.js 
b/core/src/main/resources/org/apache/spark/ui/static/historypage.js
index 08438e6eda61..85cd5a554750 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/historypage.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/historypage.js
@@ -19,6 +19,8 @@
 
 import {formatDuration, formatTimeMillis} from "./utils.js";
 
+export {setAppLimit};
+
 var appLimit = -1;
 
 /* eslint-disable no-unused-vars */
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
index b8f064c68cdd..7ba9b2c54937 100644
--- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
@@ -19,10 +19,11 @@ package org.apache.spark.deploy.history
 
 import javax.servlet.http.HttpServletRequest
 
-import scala.xml.Node
+import scala.xml.{Node, Unparsed}
 
 import org.apache.spark.status.api.v1.ApplicationInfo
 import org.apache.spark.ui.{UIUtils, WebUIPage}
+import org.apache.spark.ui.UIUtils.formatImportJavaScript
 
 private[history] class HistoryPage(parent: HistoryServer) extends 
WebUIPage("") {
 
@@ -63,12 +64,18 @@ private[history] class HistoryPage(parent: HistoryServer) 
extends WebUIPage("")
 
 {
 if (displayApplications) {
+  val js =
+s"""
+   |${formatImportJavaScript(request, 
"/static/historypage.js", "setAppLimit")}
+   |
+   |setAppLimit(${parent.maxApplications});
+   |""".stripMargin
++
  ++
  ++
-setAppLimit({parent.maxApplications})
+{Unparsed(js)}
 } else if (requestedIncomplete) {
   No incomplete applications found!
 } else if (eventLogsUnderProcessCount > 0) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46900][BUILD] Upgrade slf4j to 2.0.11

2024-01-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a368280708dd [SPARK-46900][BUILD] Upgrade slf4j to 2.0.11
a368280708dd is described below

commit a368280708dd3c6eb90bd3b09a36a68bdd096222
Author: yangjie01 
AuthorDate: Sun Jan 28 23:42:37 2024 -0800

[SPARK-46900][BUILD] Upgrade slf4j to 2.0.11

### What changes were proposed in this pull request?
This pr aims to upgrade slf4j from 2.0.10 to 2.0.11

### Why are the changes needed?
This release reinstates the `renderLevel()` method in SimpleLogger which 
was removed by mistake.

The full release notes as follows:
- https://www.slf4j.org/news.html#2.0.11

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44928 from LuciferYang/SPARK-46900.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 6 +++---
 pom.xml   | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 09291de50350..06fb4d879db2 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -123,7 +123,7 @@ javassist/3.29.2-GA//javassist-3.29.2-GA.jar
 javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar
 javolution/5.5.1//javolution-5.5.1.jar
 jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar
-jcl-over-slf4j/2.0.10//jcl-over-slf4j-2.0.10.jar
+jcl-over-slf4j/2.0.11//jcl-over-slf4j-2.0.11.jar
 jdo-api/3.0.1//jdo-api-3.0.1.jar
 jdom2/2.0.6//jdom2-2.0.6.jar
 jersey-client/2.41//jersey-client-2.41.jar
@@ -148,7 +148,7 @@ 
json4s-jackson_2.13/3.7.0-M11//json4s-jackson_2.13-3.7.0-M11.jar
 json4s-scalap_2.13/3.7.0-M11//json4s-scalap_2.13-3.7.0-M11.jar
 jsr305/3.0.0//jsr305-3.0.0.jar
 jta/1.1//jta-1.1.jar
-jul-to-slf4j/2.0.10//jul-to-slf4j-2.0.10.jar
+jul-to-slf4j/2.0.11//jul-to-slf4j-2.0.11.jar
 kryo-shaded/4.0.2//kryo-shaded-4.0.2.jar
 kubernetes-client-api/6.10.0//kubernetes-client-api-6.10.0.jar
 kubernetes-client/6.10.0//kubernetes-client-6.10.0.jar
@@ -247,7 +247,7 @@ 
scala-parallel-collections_2.13/1.0.4//scala-parallel-collections_2.13-1.0.4.jar
 scala-parser-combinators_2.13/2.3.0//scala-parser-combinators_2.13-2.3.0.jar
 scala-reflect/2.13.12//scala-reflect-2.13.12.jar
 scala-xml_2.13/2.2.0//scala-xml_2.13-2.2.0.jar
-slf4j-api/2.0.10//slf4j-api-2.0.10.jar
+slf4j-api/2.0.11//slf4j-api-2.0.11.jar
 snakeyaml-engine/2.7//snakeyaml-engine-2.7.jar
 snakeyaml/2.2//snakeyaml-2.2.jar
 snappy-java/1.1.10.5//snappy-java-1.1.10.5.jar
diff --git a/pom.xml b/pom.xml
index a5f2b6f74b7a..b78f49499feb 100644
--- a/pom.xml
+++ b/pom.xml
@@ -119,7 +119,7 @@
 3.1.0
 spark
 9.6
-2.0.10
+2.0.11
 2.22.1
 
 3.3.6


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46901][PYTHON] Upgrade `pyarrow` to 15.0.0

2024-01-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 487cbc086a30 [SPARK-46901][PYTHON] Upgrade `pyarrow` to 15.0.0
487cbc086a30 is described below

commit 487cbc086a30ec4d58695336acbe8037a3d5ebe7
Author: Ruifeng Zheng 
AuthorDate: Sun Jan 28 23:41:49 2024 -0800

[SPARK-46901][PYTHON] Upgrade `pyarrow` to 15.0.0

### What changes were proposed in this pull request?
Upgrade `pyarrow` to 15.0.0

### Why are the changes needed?
to support latest pyarrow

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44924 from zhengruifeng/py_arrow_15.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 dev/infra/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index 976f94251d7a..fc515d4478ad 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -94,7 +94,7 @@ RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
 RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage 
matplotlib lxml
 
 
-ARG BASIC_PIP_PKGS="numpy pyarrow>=14.0.0 six==1.16.0 pandas<=2.1.4 scipy 
plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
+ARG BASIC_PIP_PKGS="numpy pyarrow>=15.0.0 six==1.16.0 pandas<=2.1.4 scipy 
plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
 # Python deps for Spark Connect
 ARG CONNECT_PIP_PKGS="grpcio==1.59.3 grpcio-status==1.59.3 protobuf==4.25.1 
googleapis-common-protos==1.56.4"
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46899][CORE] Remove `POST` APIs from `MasterWebUI` when `spark.ui.killEnabled` is `false`

2024-01-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 95a4abd5b5bc [SPARK-46899][CORE] Remove `POST` APIs from `MasterWebUI` 
when `spark.ui.killEnabled` is `false`
95a4abd5b5bc is described below

commit 95a4abd5b5bcc36335be9af84b7bbddd7d0034ba
Author: Dongjoon Hyun 
AuthorDate: Sun Jan 28 22:38:32 2024 -0800

[SPARK-46899][CORE] Remove `POST` APIs from `MasterWebUI` when 
`spark.ui.killEnabled` is `false`

### What changes were proposed in this pull request?

This PR aims to remove `POST` APIs from `MasterWebUI` when 
`spark.ui.killEnabled` is false.

### Why are the changes needed?

If `spark.ui.killEnabled` is false, we don't need to attach `POST`-related 
redirect or servlet handlers from the beginning because it will be ignored in 
`MasterPage`.


https://github.com/apache/spark/blob/8cd0d1854da04334aff3188e4eca08a48f734579/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala#L64-L65

### Does this PR introduce _any_ user-facing change?

Previously, the user request is ignored silently after redirecting. Now, it 
will response with a correct HTTP error code, 405 `Method Not Allowed`.

### How was this patch tested?

Pass the CIs with newly added test suite, `ReadOnlyMasterWebUISuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44926 from dongjoon-hyun/SPARK-46899.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/deploy/master/ui/MasterWebUI.scala   | 46 ++---
 .../spark/deploy/master/ui/MasterWebUISuite.scala  |  9 ++-
 .../master/ui/ReadOnlyMasterWebUISuite.scala   | 75 ++
 3 files changed, 105 insertions(+), 25 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
index 3025c0bf468b..14ea6dbb3d20 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
@@ -54,31 +54,33 @@ class MasterWebUI(
 attachPage(new LogPage(this))
 attachPage(masterPage)
 addStaticHandler(MasterWebUI.STATIC_RESOURCE_DIR)
-attachHandler(createRedirectHandler(
-  "/app/kill", "/", masterPage.handleAppKillRequest, httpMethods = 
Set("POST")))
-attachHandler(createRedirectHandler(
-  "/driver/kill", "/", masterPage.handleDriverKillRequest, httpMethods = 
Set("POST")))
-attachHandler(createServletHandler("/workers/kill", new HttpServlet {
-  override def doPost(req: HttpServletRequest, resp: HttpServletResponse): 
Unit = {
-val hostnames: Seq[String] = Option(req.getParameterValues("host"))
-  .getOrElse(Array[String]()).toImmutableArraySeq
-if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) {
-  resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED)
-} else {
-  val removedWorkers = masterEndpointRef.askSync[Integer](
-DecommissionWorkersOnHosts(hostnames))
-  logInfo(s"Decommissioning of hosts $hostnames decommissioned 
$removedWorkers workers")
-  if (removedWorkers > 0) {
-resp.setStatus(HttpServletResponse.SC_OK)
-  } else if (removedWorkers == 0) {
-resp.sendError(HttpServletResponse.SC_NOT_FOUND)
+if (killEnabled) {
+  attachHandler(createRedirectHandler(
+"/app/kill", "/", masterPage.handleAppKillRequest, httpMethods = 
Set("POST")))
+  attachHandler(createRedirectHandler(
+"/driver/kill", "/", masterPage.handleDriverKillRequest, httpMethods = 
Set("POST")))
+  attachHandler(createServletHandler("/workers/kill", new HttpServlet {
+override def doPost(req: HttpServletRequest, resp: 
HttpServletResponse): Unit = {
+  val hostnames: Seq[String] = Option(req.getParameterValues("host"))
+.getOrElse(Array[String]()).toImmutableArraySeq
+  if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) {
+resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED)
   } else {
-// We shouldn't even see this case.
-resp.setStatus(HttpServletResponse.SC_INTERNAL_SERVER_ERROR)
+val removedWorkers = masterEndpointRef.askSync[Integer](
+  DecommissionWorkersOnHosts(hostnames))
+logInfo(s"Decommissioning of hosts $hostnames decommissioned 
$removedWorkers workers")
+

(spark) branch master updated: [SPARK-46897][PYTHON][DOCS] Refine docstring of `bit_and/bit_or/bit_xor`

2024-01-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5056a17919ac [SPARK-46897][PYTHON][DOCS] Refine docstring of 
`bit_and/bit_or/bit_xor`
5056a17919ac is described below

commit 5056a17919ac88d35475dd13ae4167e783f9504a
Author: yangjie01 
AuthorDate: Sun Jan 28 21:33:39 2024 -0800

[SPARK-46897][PYTHON][DOCS] Refine docstring of `bit_and/bit_or/bit_xor`

### What changes were proposed in this pull request?
This pr refine docstring of  `bit_and/bit_or/bit_xor` and add some new 
examples.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44923 from LuciferYang/SPARK-46897.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/functions/builtin.py | 138 ++--
 1 file changed, 132 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index d3a94fe4b9e9..0932ac1c2843 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -3790,9 +3790,51 @@ def bit_and(col: "ColumnOrName") -> Column:
 
 Examples
 
+Example 1: Bitwise AND with all non-null values
+
+>>> from pyspark.sql import functions as sf
 >>> df = spark.createDataFrame([[1],[1],[2]], ["c"])
->>> df.select(bit_and("c")).first()
-Row(bit_and(c)=0)
+>>> df.select(sf.bit_and("c")).show()
++--+
+|bit_and(c)|
++--+
+| 0|
++--+
+
+Example 2: Bitwise AND with null values
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([[1],[None],[2]], ["c"])
+>>> df.select(sf.bit_and("c")).show()
++--+
+|bit_and(c)|
++--+
+| 0|
++--+
+
+Example 3: Bitwise AND with all null values
+
+>>> from pyspark.sql import functions as sf
+>>> from pyspark.sql.types import IntegerType, StructType, StructField
+>>> schema = StructType([StructField("c", IntegerType(), True)])
+>>> df = spark.createDataFrame([[None],[None],[None]], schema=schema)
+>>> df.select(sf.bit_and("c")).show()
++--+
+|bit_and(c)|
++--+
+|  NULL|
++--+
+
+Example 4: Bitwise AND with single input value
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([[5]], ["c"])
+>>> df.select(sf.bit_and("c")).show()
++--+
+|bit_and(c)|
++--+
+| 5|
++--+
 """
 return _invoke_function_over_columns("bit_and", col)
 
@@ -3816,9 +3858,51 @@ def bit_or(col: "ColumnOrName") -> Column:
 
 Examples
 
+Example 1: Bitwise OR with all non-null values
+
+>>> from pyspark.sql import functions as sf
 >>> df = spark.createDataFrame([[1],[1],[2]], ["c"])
->>> df.select(bit_or("c")).first()
-Row(bit_or(c)=3)
+>>> df.select(sf.bit_or("c")).show()
++-+
+|bit_or(c)|
++-+
+|3|
++-+
+
+Example 2: Bitwise OR with some null values
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([[1],[None],[2]], ["c"])
+>>> df.select(sf.bit_or("c")).show()
++-+
+|bit_or(c)|
++-+
+|3|
++-+
+
+Example 3: Bitwise OR with all null values
+
+>>> from pyspark.sql import functions as sf
+>>> from pyspark.sql.types import IntegerType, StructType, StructField
+>>> schema = StructType([StructField("c", IntegerType(), True)])
+>>> df = spark.createDataFrame([[None],[None],[None]], schema=schema)
+>>> df.select(sf.bit_or("c")).show()
++-+
+|bit_or(c)|
++-+
+| NULL|
++-+
+
+Example 4: Bitwise OR with single input value
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([[5]], ["c"])
+>>> df.select(sf.bit_or("c")).show()
++-+
+|bit_or(c)|
++-+

(spark) branch master updated (f078998df2f3 -> bb2195554e6d)

2024-01-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f078998df2f3 [MINOR][DOCS] Miscellaneous documentation improvements
 add bb2195554e6d [SPARK-46874][PYTHON] Remove `pyspark.pandas` dependency 
from `assertDataFrameEqual`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/testing/utils.py | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46892][BUILD] Upgrade dropwizard metrics 4.2.25

2024-01-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d74aecd11dcd [SPARK-46892][BUILD] Upgrade dropwizard metrics 4.2.25
d74aecd11dcd is described below

commit d74aecd11dcd1c8414b662457e49b6001395bb8d
Author: panbingkun 
AuthorDate: Sun Jan 28 12:12:02 2024 -0800

[SPARK-46892][BUILD] Upgrade dropwizard metrics 4.2.25

### What changes were proposed in this pull request?
The pr aims to upgrade dropwizard metrics from `4.2.21` to `4.2.25`.

### Why are the changes needed?
The last update occurred 3 months ago.

- The new version bringes some bug fixes:
  Fix IndexOutOfBoundsException in Jetty 9, 10, 11, 12 InstrumentedHandler 
https://github.com/dropwizard/metrics/pull/3912

- The full version release notes:
  https://github.com/dropwizard/metrics/releases/tag/v4.2.25
  https://github.com/dropwizard/metrics/releases/tag/v4.2.24
  https://github.com/dropwizard/metrics/releases/tag/v4.2.23
  https://github.com/dropwizard/metrics/releases/tag/v4.2.22

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44918 from panbingkun/SPARK-46892.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 10 +-
 pom.xml   |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 71f9ac8665b0..09291de50350 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -185,11 +185,11 @@ log4j-core/2.22.1//log4j-core-2.22.1.jar
 log4j-slf4j2-impl/2.22.1//log4j-slf4j2-impl-2.22.1.jar
 logging-interceptor/3.12.12//logging-interceptor-3.12.12.jar
 lz4-java/1.8.0//lz4-java-1.8.0.jar
-metrics-core/4.2.21//metrics-core-4.2.21.jar
-metrics-graphite/4.2.21//metrics-graphite-4.2.21.jar
-metrics-jmx/4.2.21//metrics-jmx-4.2.21.jar
-metrics-json/4.2.21//metrics-json-4.2.21.jar
-metrics-jvm/4.2.21//metrics-jvm-4.2.21.jar
+metrics-core/4.2.25//metrics-core-4.2.25.jar
+metrics-graphite/4.2.25//metrics-graphite-4.2.25.jar
+metrics-jmx/4.2.25//metrics-jmx-4.2.25.jar
+metrics-json/4.2.25//metrics-json-4.2.25.jar
+metrics-jvm/4.2.25//metrics-jvm-4.2.25.jar
 minlog/1.3.0//minlog-1.3.0.jar
 netty-all/4.1.106.Final//netty-all-4.1.106.Final.jar
 netty-buffer/4.1.106.Final//netty-buffer-4.1.106.Final.jar
diff --git a/pom.xml b/pom.xml
index d4e8a7db71de..a5f2b6f74b7a 100644
--- a/pom.xml
+++ b/pom.xml
@@ -156,7 +156,7 @@
 If you change codahale.metrics.version, you also need to change
 the link to metrics.dropwizard.io in docs/monitoring.md.
 -->
-4.2.21
+4.2.25
 
 1.11.3
 1.12.0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled

2024-01-27 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 51b021fdf915 [SPARK-46888][CORE] Fix `Master` to reject 
`/workers/kill/` requests if decommission is disabled
51b021fdf915 is described below

commit 51b021fdf915d4aab62056ee60e4098047bd9841
Author: Dongjoon Hyun 
AuthorDate: Sat Jan 27 20:24:15 2024 -0800

[SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if 
decommission is disabled

This PR aims to fix `Master` to reject `/workers/kill/` request if 
`spark.decommission.enabled` is `false` in order to fix the dangling worker 
issue.

Currently, `spark.decommission.enabled` is `false` by default. So, when a 
user asks to decommission, only Master marked it `DECOMMISSIONED` while the 
worker is alive.
```
$ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1
```

**Master UI**
![Screenshot 2024-01-27 at 6 19 18 
PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be)

**Worker Log**
```
24/01/27 18:18:06 WARN Worker: Receive decommission request, but 
decommission feature is disabled.
```

To be consistent with the existing `Worker` behavior which ignores the 
request.


https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868

No, this is a bug fix.

Pass the CI with the newly added test case.

No.

Closes #44915 from dongjoon-hyun/SPARK-46888.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 20b593811dc02c96c71978851e051d32bf8c3496)
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/master/ui/MasterWebUI.scala |  4 +++-
 .../apache/spark/deploy/master/MasterSuite.scala| 21 +
 .../spark/deploy/master/ui/MasterWebUISuite.scala   |  3 ++-
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
index af94bd6d9e0f..53e5c5ac2a8f 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
@@ -23,6 +23,7 @@ import javax.servlet.http.{HttpServlet, HttpServletRequest, 
HttpServletResponse}
 import org.apache.spark.deploy.DeployMessages.{DecommissionWorkersOnHosts, 
MasterStateResponse, RequestMasterState}
 import org.apache.spark.deploy.master.Master
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.DECOMMISSION_ENABLED
 import org.apache.spark.internal.config.UI.MASTER_UI_DECOMMISSION_ALLOW_MODE
 import org.apache.spark.internal.config.UI.UI_KILL_ENABLED
 import org.apache.spark.ui.{SparkUI, WebUI}
@@ -40,6 +41,7 @@ class MasterWebUI(
 
   val masterEndpointRef = master.self
   val killEnabled = master.conf.get(UI_KILL_ENABLED)
+  val decommissionDisabled = !master.conf.get(DECOMMISSION_ENABLED)
   val decommissionAllowMode = 
master.conf.get(MASTER_UI_DECOMMISSION_ALLOW_MODE)
 
   initialize()
@@ -58,7 +60,7 @@ class MasterWebUI(
   override def doPost(req: HttpServletRequest, resp: HttpServletResponse): 
Unit = {
 val hostnames: Seq[String] = Option(req.getParameterValues("host"))
   .getOrElse(Array[String]()).toSeq
-if (!isDecommissioningRequestAllowed(req)) {
+if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) {
   resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED)
 } else {
   val removedWorkers = masterEndpointRef.askSync[Integer](
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
index 1cec863b1e7f..37874de98766 100644
--- a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.deploy.master
 
+import java.net.{HttpURLConnection, URL}
 import java.util.Date
 import java.util.concurrent.{ConcurrentLinkedQueue, CountDownLatch, TimeUnit}
 import java.util.concurrent.atomic.AtomicInteger
@@ -325,6 +326,26 @@ class MasterSuite extends SparkFunSuite
 }
   }
 
+  test("SPARK-46888: master should reject worker kill request if decommision 
is disabled") {
+implicit val formats = org.json4s.DefaultFormats
+val conf = new SparkConf()
+  .set(DECOMMISSION_ENABLED, false)
+  .set(MASTER_UI_DECOMMISSION_ALLOW_MODE, "ALLOW")
+val localCluster = LocalSparkCluster(1, 1, 512, conf)
+localCluster.s

(spark) branch branch-3.5 updated: [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled

2024-01-27 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new accfb39e4ddf [SPARK-46888][CORE] Fix `Master` to reject 
`/workers/kill/` requests if decommission is disabled
accfb39e4ddf is described below

commit accfb39e4ddf7f7b54396bd0e35256a04461c693
Author: Dongjoon Hyun 
AuthorDate: Sat Jan 27 20:24:15 2024 -0800

[SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if 
decommission is disabled

This PR aims to fix `Master` to reject `/workers/kill/` request if 
`spark.decommission.enabled` is `false` in order to fix the dangling worker 
issue.

Currently, `spark.decommission.enabled` is `false` by default. So, when a 
user asks to decommission, only Master marked it `DECOMMISSIONED` while the 
worker is alive.
```
$ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1
```

**Master UI**
![Screenshot 2024-01-27 at 6 19 18 
PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be)

**Worker Log**
```
24/01/27 18:18:06 WARN Worker: Receive decommission request, but 
decommission feature is disabled.
```

To be consistent with the existing `Worker` behavior which ignores the 
request.


https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868

No, this is a bug fix.

Pass the CI with the newly added test case.

No.

Closes #44915 from dongjoon-hyun/SPARK-46888.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 20b593811dc02c96c71978851e051d32bf8c3496)
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/master/ui/MasterWebUI.scala |  4 +++-
 .../apache/spark/deploy/master/MasterSuite.scala| 21 +
 .../spark/deploy/master/ui/MasterWebUISuite.scala   |  3 ++-
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
index af94bd6d9e0f..53e5c5ac2a8f 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
@@ -23,6 +23,7 @@ import javax.servlet.http.{HttpServlet, HttpServletRequest, 
HttpServletResponse}
 import org.apache.spark.deploy.DeployMessages.{DecommissionWorkersOnHosts, 
MasterStateResponse, RequestMasterState}
 import org.apache.spark.deploy.master.Master
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.DECOMMISSION_ENABLED
 import org.apache.spark.internal.config.UI.MASTER_UI_DECOMMISSION_ALLOW_MODE
 import org.apache.spark.internal.config.UI.UI_KILL_ENABLED
 import org.apache.spark.ui.{SparkUI, WebUI}
@@ -40,6 +41,7 @@ class MasterWebUI(
 
   val masterEndpointRef = master.self
   val killEnabled = master.conf.get(UI_KILL_ENABLED)
+  val decommissionDisabled = !master.conf.get(DECOMMISSION_ENABLED)
   val decommissionAllowMode = 
master.conf.get(MASTER_UI_DECOMMISSION_ALLOW_MODE)
 
   initialize()
@@ -58,7 +60,7 @@ class MasterWebUI(
   override def doPost(req: HttpServletRequest, resp: HttpServletResponse): 
Unit = {
 val hostnames: Seq[String] = Option(req.getParameterValues("host"))
   .getOrElse(Array[String]()).toSeq
-if (!isDecommissioningRequestAllowed(req)) {
+if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) {
   resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED)
 } else {
   val removedWorkers = masterEndpointRef.askSync[Integer](
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
index 1cec863b1e7f..37874de98766 100644
--- a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.deploy.master
 
+import java.net.{HttpURLConnection, URL}
 import java.util.Date
 import java.util.concurrent.{ConcurrentLinkedQueue, CountDownLatch, TimeUnit}
 import java.util.concurrent.atomic.AtomicInteger
@@ -325,6 +326,26 @@ class MasterSuite extends SparkFunSuite
 }
   }
 
+  test("SPARK-46888: master should reject worker kill request if decommision 
is disabled") {
+implicit val formats = org.json4s.DefaultFormats
+val conf = new SparkConf()
+  .set(DECOMMISSION_ENABLED, false)
+  .set(MASTER_UI_DECOMMISSION_ALLOW_MODE, "ALLOW")
+val localCluster = LocalSparkCluster(1, 1, 512, conf)
+localCluster.s

(spark) branch master updated: [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled

2024-01-27 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 20b593811dc0 [SPARK-46888][CORE] Fix `Master` to reject 
`/workers/kill/` requests if decommission is disabled
20b593811dc0 is described below

commit 20b593811dc02c96c71978851e051d32bf8c3496
Author: Dongjoon Hyun 
AuthorDate: Sat Jan 27 20:24:15 2024 -0800

[SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if 
decommission is disabled

### What changes were proposed in this pull request?

This PR aims to fix `Master` to reject `/workers/kill/` request if 
`spark.decommission.enabled` is `false` in order to fix the dangling worker 
issue.

### Why are the changes needed?

Currently, `spark.decommission.enabled` is `false` by default. So, when a 
user asks to decommission, only Master marked it `DECOMMISSIONED` while the 
worker is alive.
```
$ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1
```

**Master UI**
![Screenshot 2024-01-27 at 6 19 18 
PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be)

**Worker Log**
```
24/01/27 18:18:06 WARN Worker: Receive decommission request, but 
decommission feature is disabled.
```

To be consistent with the existing `Worker` behavior which ignores the 
request.


https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868

### Does this PR introduce _any_ user-facing change?

No, this is a bug fix.

### How was this patch tested?

Pass the CI with the newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44915 from dongjoon-hyun/SPARK-46888.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/master/ui/MasterWebUI.scala |  4 +++-
 .../apache/spark/deploy/master/MasterSuite.scala| 21 +
 .../spark/deploy/master/ui/MasterWebUISuite.scala   |  3 ++-
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
index d71ef8b9e36e..3025c0bf468b 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
@@ -23,6 +23,7 @@ import javax.servlet.http.{HttpServlet, HttpServletRequest, 
HttpServletResponse}
 import org.apache.spark.deploy.DeployMessages.{DecommissionWorkersOnHosts, 
MasterStateResponse, RequestMasterState}
 import org.apache.spark.deploy.master.Master
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.DECOMMISSION_ENABLED
 import org.apache.spark.internal.config.UI.MASTER_UI_DECOMMISSION_ALLOW_MODE
 import org.apache.spark.internal.config.UI.UI_KILL_ENABLED
 import org.apache.spark.ui.{SparkUI, WebUI}
@@ -41,6 +42,7 @@ class MasterWebUI(
 
   val masterEndpointRef = master.self
   val killEnabled = master.conf.get(UI_KILL_ENABLED)
+  val decommissionDisabled = !master.conf.get(DECOMMISSION_ENABLED)
   val decommissionAllowMode = 
master.conf.get(MASTER_UI_DECOMMISSION_ALLOW_MODE)
 
   initialize()
@@ -60,7 +62,7 @@ class MasterWebUI(
   override def doPost(req: HttpServletRequest, resp: HttpServletResponse): 
Unit = {
 val hostnames: Seq[String] = Option(req.getParameterValues("host"))
   .getOrElse(Array[String]()).toImmutableArraySeq
-if (!isDecommissioningRequestAllowed(req)) {
+if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) {
   resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED)
 } else {
   val removedWorkers = masterEndpointRef.askSync[Integer](
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
index 6966a7f660b2..0db58ae0c834 100644
--- a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.deploy.master
 
+import java.net.{HttpURLConnection, URL}
 import java.util.Date
 import java.util.concurrent.{ConcurrentLinkedQueue, CountDownLatch, TimeUnit}
 import java.util.concurrent.atomic.AtomicInteger
@@ -444,6 +445,26 @@ class MasterSuite extends SparkFunSuite
 }
   }
 
+  test("SPARK-46888: master should reject worker kill request if decommision 
is disabled") {
+implicit val formats = org.json4s.DefaultFormats
+val conf = new Spa

(spark) branch master updated: [SPARK-46883][CORE][FOLLOWUP] Fix `clusterutilization` API to handle 0 worker case

2024-01-27 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a96f399d5094 [SPARK-46883][CORE][FOLLOWUP] Fix `clusterutilization` 
API to handle 0 worker case
a96f399d5094 is described below

commit a96f399d5094a3473ffc0e55390105d013a3d22f
Author: Dongjoon Hyun 
AuthorDate: Sat Jan 27 19:02:37 2024 -0800

[SPARK-46883][CORE][FOLLOWUP] Fix `clusterutilization` API to handle 0 
worker case

### What changes were proposed in this pull request?

This PR is a follow-up of #44908 to fix `clusterutilization` API to handle 
0 worker case.

### Why are the changes needed?

To fix `ArithmeticException`
```
$ curl http://localhost:8080/json/clusterutilization



Error 500 java.lang.ArithmeticException: / by zero

HTTP ERROR 500 java.lang.ArithmeticException: / by zero

URI:/json/clusterutilization
```

### Does this PR introduce _any_ user-facing change?

No, this feature and bug is not released yet.

### How was this patch tested?

Pass the CIs with the newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44914 from dongjoon-hyun/SPARK-46883-2.

Lead-authored-by: Dongjoon Hyun 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/deploy/JsonProtocol.scala |  4 ++--
 .../apache/spark/deploy/JsonProtocolSuite.scala| 24 +-
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala 
b/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala
index 9c73e84f4166..04302c77a398 100644
--- a/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala
@@ -312,9 +312,9 @@ private[deploy] object JsonProtocol {
 ("waitingDrivers" -> obj.activeDrivers.count(_.state == 
DriverState.SUBMITTED)) ~
 ("cores" -> cores) ~
 ("coresused" -> coresUsed) ~
-("coresutilization" -> 100 * coresUsed / cores) ~
+("coresutilization" -> (if (cores == 0) 100 else 100 * coresUsed / cores)) 
~
 ("memory" -> memory) ~
 ("memoryused" -> memoryUsed) ~
-("memoryutilization" -> 100 * memoryUsed / memory)
+("memoryutilization" -> (if (memory == 0) 100 else 100 * memoryUsed / 
memory))
   }
 }
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala
index 6fca31234ee2..518a8c8b3d05 100644
--- a/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala
@@ -25,7 +25,7 @@ import org.json4s.jackson.JsonMethods
 
 import org.apache.spark.{JsonTestUtils, SparkFunSuite}
 import org.apache.spark.deploy.DeployMessages.{MasterStateResponse, 
WorkerStateResponse}
-import org.apache.spark.deploy.master.{ApplicationInfo, RecoveryState}
+import org.apache.spark.deploy.master.{ApplicationInfo, RecoveryState, 
WorkerInfo}
 import org.apache.spark.deploy.worker.ExecutorRunner
 
 class JsonProtocolSuite extends SparkFunSuite with JsonTestUtils {
@@ -119,6 +119,21 @@ class JsonProtocolSuite extends SparkFunSuite with 
JsonTestUtils {
 assertValidDataInJson(output, 
JsonMethods.parse(JsonConstants.clusterUtilizationJsonStr))
   }
 
+  test("SPARK-46883: writeClusterUtilization without workers") {
+val workers = Array.empty[WorkerInfo]
+val activeApps = Array(createAppInfo())
+val completedApps = Array.empty[ApplicationInfo]
+val activeDrivers = Array(createDriverInfo())
+val completedDrivers = Array(createDriverInfo())
+val stateResponse = new MasterStateResponse(
+  "host", 8080, None, workers, activeApps, completedApps,
+  activeDrivers, completedDrivers, RecoveryState.ALIVE)
+val output = JsonProtocol.writeClusterUtilization(stateResponse)
+assertValidJson(output)
+assertValidDataInJson(output,
+  JsonMethods.parse(JsonConstants.clusterUtilizationWithoutWorkersJsonStr))
+  }
+
   def assertValidJson(json: JValue): Unit = {
 try {
   JsonMethods.parse(JsonMethods.compact(json))
@@ -227,4 +242,11 @@ object JsonConstants {
   |"cores":8,"coresused":0,"coresutilization":0,
   |"memory":2468,"memoryused":0,"memoryutilization":0}
 """.stripMargin
+
+  val clusterUtilizationWithoutWorkersJsonStr =
+"""
+  |{"waitingDrivers":1,
+  |"cores":0,"

(spark) branch master updated: [SPARK-46887][DOCS] Document a few missed `spark.ui.*` configs to `Configuration` page

2024-01-27 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1787a5261e87 [SPARK-46887][DOCS] Document a few missed `spark.ui.*` 
configs to `Configuration` page
1787a5261e87 is described below

commit 1787a5261e87e0214a3f803f6534c5e52a0138e6
Author: Dongjoon Hyun 
AuthorDate: Sat Jan 27 16:48:26 2024 -0800

[SPARK-46887][DOCS] Document a few missed `spark.ui.*` configs to 
`Configuration` page

### What changes were proposed in this pull request?

This PR aims to document a few missed `spark.ui.*` configurations for 
Apache Spark 4. This PR focuses only public configurations and excludes 
`internal` configuration like `spark.ui.jettyStopTimeout`.

### Why are the changes needed?

To improve documentations.

After this PR, I verified the following configurations are documented at 
least once in `Configuration` or `Security` page.
```
$ git grep 'ConfigBuilder("spark.ui.'
core/src/main/scala/org/apache/spark/internal/config/Status.scala:  val 
LIVE_ENTITY_UPDATE_PERIOD = ConfigBuilder("spark.ui.liveUpdate.period")
core/src/main/scala/org/apache/spark/internal/config/Status.scala:  val 
LIVE_ENTITY_UPDATE_MIN_FLUSH_PERIOD = 
ConfigBuilder("spark.ui.liveUpdate.minFlushPeriod")
core/src/main/scala/org/apache/spark/internal/config/Status.scala:  val 
MAX_RETAINED_JOBS = ConfigBuilder("spark.ui.retainedJobs")
core/src/main/scala/org/apache/spark/internal/config/Status.scala:  val 
MAX_RETAINED_STAGES = ConfigBuilder("spark.ui.retainedStages")
core/src/main/scala/org/apache/spark/internal/config/Status.scala:  val 
MAX_RETAINED_TASKS_PER_STAGE = ConfigBuilder("spark.ui.retainedTasks")
core/src/main/scala/org/apache/spark/internal/config/Status.scala:  val 
MAX_RETAINED_DEAD_EXECUTORS = ConfigBuilder("spark.ui.retainedDeadExecutors")
core/src/main/scala/org/apache/spark/internal/config/Status.scala:  val 
MAX_RETAINED_ROOT_NODES = ConfigBuilder("spark.ui.dagGraph.retainedRootRDDs")
core/src/main/scala/org/apache/spark/internal/config/Status.scala:  val 
LIVE_UI_LOCAL_STORE_DIR = ConfigBuilder("spark.ui.store.path")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:
ConfigBuilder("spark.ui.consoleProgress.update.interval")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_ENABLED = ConfigBuilder("spark.ui.enabled")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val UI_PORT 
= ConfigBuilder("spark.ui.port")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_FILTERS = ConfigBuilder("spark.ui.filters")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_ALLOW_FRAMING_FROM = ConfigBuilder("spark.ui.allowFramingFrom")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_REVERSE_PROXY = ConfigBuilder("spark.ui.reverseProxy")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_REVERSE_PROXY_URL = ConfigBuilder("spark.ui.reverseProxyUrl")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_KILL_ENABLED = ConfigBuilder("spark.ui.killEnabled")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_THREAD_DUMPS_ENABLED = ConfigBuilder("spark.ui.threadDumpsEnabled")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_FLAMEGRAPH_ENABLED = ConfigBuilder("spark.ui.threadDump.flamegraphEnabled")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_HEAP_HISTOGRAM_ENABLED = ConfigBuilder("spark.ui.heapHistogramEnabled")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_PROMETHEUS_ENABLED = ConfigBuilder("spark.ui.prometheus.enabled")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_X_XSS_PROTECTION = ConfigBuilder("spark.ui.xXssProtection")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_X_CONTENT_TYPE_OPTIONS = 
ConfigBuilder("spark.ui.xContentTypeOptions.enabled")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_STRICT_TRANSPORT_SECURITY = ConfigBuilder("spark.ui.strictTransportSecurity")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
UI_REQUEST_HEADER_SIZE = ConfigBuilder("spark.ui.requestHeaderSize")
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val

(spark) branch master updated: [SPARK-46886][CORE] Enable `spark.ui.prometheus.enabled` by default

2024-01-27 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c8d116bcfde9 [SPARK-46886][CORE] Enable `spark.ui.prometheus.enabled` 
by default
c8d116bcfde9 is described below

commit c8d116bcfde938a9e4b33ace1e8257c2798000f4
Author: Dongjoon Hyun 
AuthorDate: Sat Jan 27 15:15:26 2024 -0800

[SPARK-46886][CORE] Enable `spark.ui.prometheus.enabled` by default

### What changes were proposed in this pull request?

`spark.ui.prometheus.enabled` has been used since Apache Spark 3.0.0.

- https://github.com/apache/spark/pull/25770

This PR aims to enable `spark.ui.prometheus.enabled` by default like Driver 
`JSON` API in Apache Spark 4.0.0.

|  |JSON End Point| 
   Prometheus End Point |
| --- |  | 
- |
| Driver   | /api/v1/applications/{id}/executors/   | 
/metrics/executors/prometheus/   |

### Why are the changes needed?

**BEFORE**
```
$ bin/spark-shell
$ curl -s http://localhost:4040/metrics/executors/prometheus | wc -l
   0
```

**AFTER**
```
$ bin/spark-shell
 $ curl -s http://localhost:4040/metrics/executors/prometheus | wc -l
  20
```

### Does this PR introduce _any_ user-facing change?

No, this is only a new endpoint.

### How was this patch tested?

Pass the CIs and do manual test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44912 from dongjoon-hyun/SPARK-46886.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/internal/config/UI.scala   | 2 +-
 .../main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala  | 2 +-
 docs/monitoring.md  | 1 -
 3 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/internal/config/UI.scala 
b/core/src/main/scala/org/apache/spark/internal/config/UI.scala
index 320808d5018c..086c83552732 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/UI.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/UI.scala
@@ -114,7 +114,7 @@ private[spark] object UI {
   "For master/worker/driver metrics, you need to configure 
`conf/metrics.properties`.")
 .version("3.0.0")
 .booleanConf
-.createWithDefault(false)
+.createWithDefault(true)
 
   val UI_X_XSS_PROTECTION = ConfigBuilder("spark.ui.xXssProtection")
 .doc("Value for HTTP X-XSS-Protection response header")
diff --git 
a/core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala 
b/core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala
index 8cfed4a4bd39..c4e3bdc64ee3 100644
--- 
a/core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala
+++ 
b/core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala
@@ -31,7 +31,7 @@ import org.apache.spark.ui.SparkUI
  * :: Experimental ::
  * This aims to expose Executor metrics like REST API which is documented in
  *
- *https://spark.apache.org/docs/3.0.0/monitoring.html#executor-metrics
+ *https://spark.apache.org/docs/latest/monitoring.html#executor-metrics
  *
  * Note that this is based on ExecutorSummary which is different from 
ExecutorSource.
  */
diff --git a/docs/monitoring.md b/docs/monitoring.md
index 056543deb094..8d3dbe375b82 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -821,7 +821,6 @@ A list of the available metrics, with a short description:
 Executor-level metrics are sent from each executor to the driver as part of 
the Heartbeat to describe the performance metrics of Executor itself like JVM 
heap memory, GC information.
 Executor metric values and their measured memory peak values per executor are 
exposed via the REST API in JSON format and in Prometheus format.
 The JSON end point is exposed at: `/applications/[app-id]/executors`, and the 
Prometheus endpoint at: `/metrics/executors/prometheus`.
-The Prometheus endpoint is conditional to a configuration parameter: 
`spark.ui.prometheus.enabled=true` (the default is `false`).
 In addition, aggregated per-stage peak values of the executor memory metrics 
are written to the event log if
 `spark.eventLog.logStageExecutorMetrics` is true.
 Executor memory metrics are also exposed via the Spark metrics system based on 
the [Dropwizard metrics library](https://metrics.dropwizard.io/4.2.0).


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
Fo

(spark) branch master updated: [SPARK-46883][CORE] Support `/json/clusterutilization` API

2024-01-27 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f75c7a7b5240 [SPARK-46883][CORE] Support `/json/clusterutilization` API
f75c7a7b5240 is described below

commit f75c7a7b52402e4c8faa39b2f88623e9f0bca916
Author: Dongjoon Hyun 
AuthorDate: Sat Jan 27 09:21:17 2024 -0800

[SPARK-46883][CORE] Support `/json/clusterutilization` API

### What changes were proposed in this pull request?

This PR aims to support new `/json/clusterutilization` API in `Master` JSON 
endpoint

### Why are the changes needed?

The user can get CPU/Memory/Waiting apps in a single API call.
```
# Start Spark Cluster and Spark Shell
$ sbin/start-master.sh
$ sbin/start-worker.sh spark://$(hostname):7077;
$ bin/spark-shell --master spark://$(hostname):7077

# Check `Cluster Utilization API`
$ curl http://localhost:8080/json/clusterutilization
{
  "waitingDrivers" : 0,
  "cores" : 10,
  "coresused" : 10,
  "coresutilization" : 100,
  "memory" : 31744,
  "memoryused" : 1024,
  "memoryutilization" : 3
}
```

### Does this PR introduce _any_ user-facing change?

No. This is a newly added API.

### How was this patch tested?

Pass the CIs with the newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

    Closes #44908 from dongjoon-hyun/SPARK-46883.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/deploy/JsonProtocol.scala  | 18 ++
 .../apache/spark/deploy/master/ui/MasterPage.scala  |  2 ++
 .../org/apache/spark/deploy/JsonProtocolSuite.scala | 21 +
 3 files changed, 41 insertions(+)

diff --git a/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala 
b/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala
index 8c356081b277..9c73e84f4166 100644
--- a/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala
@@ -299,4 +299,22 @@ private[deploy] object JsonProtocol {
 ("executors" -> obj.executors.map(writeExecutorRunner)) ~
 ("finishedexecutors" -> obj.finishedExecutors.map(writeExecutorRunner))
   }
+
+  /**
+   * Export the cluster utilization based on the [[MasterStateResponse]] to a 
Json object.
+   */
+  def writeClusterUtilization(obj: MasterStateResponse): JObject = {
+val aliveWorkers = obj.workers.filter(_.isAlive())
+val cores = aliveWorkers.map(_.cores).sum
+val coresUsed = aliveWorkers.map(_.coresUsed).sum
+val memory = aliveWorkers.map(_.memory).sum
+val memoryUsed = aliveWorkers.map(_.memoryUsed).sum
+("waitingDrivers" -> obj.activeDrivers.count(_.state == 
DriverState.SUBMITTED)) ~
+("cores" -> cores) ~
+("coresused" -> coresUsed) ~
+("coresutilization" -> 100 * coresUsed / cores) ~
+("memory" -> memory) ~
+("memoryused" -> memoryUsed) ~
+("memoryutilization" -> 100 * memoryUsed / memory)
+  }
 }
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
index 36a79e060f01..cbeda23013ac 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
@@ -41,6 +41,8 @@ private[ui] class MasterPage(parent: MasterWebUI) extends 
WebUIPage("") {
   override def renderJson(request: HttpServletRequest): JValue = {
 jsonFieldPattern.findFirstMatchIn(request.getRequestURI()) match {
   case None => JsonProtocol.writeMasterState(getMasterState)
+  case Some(m) if m.group(1) == "clusterutilization" =>
+JsonProtocol.writeClusterUtilization(getMasterState)
   case Some(m) => JsonProtocol.writeMasterState(getMasterState, 
Some(m.group(1)))
 }
   }
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala
index 4a6ace6facde..6fca31234ee2 100644
--- a/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala
@@ -105,6 +105,20 @@ class JsonProtocolSuite extends SparkFunSuite with 
JsonTestUtils {
 assertValidDataInJson(output, 
JsonMethods.parse(JsonConstants.workerStateJsonStr))
   }
 
+  test("SPARK-46883: writeClusterUtilization") {
+val workers = Array(createWorkerInfo(), create

(spark) branch master updated (e014248434ac -> ecdacf8e14c8)

2024-01-26 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e014248434ac [SPARK-46880][PYTHON][CONNECT][TESTS] Improve and test 
warning for Arrow-optimized Python UDF
 add ecdacf8e14c8 [SPARK-46881][CORE] Support 
`spark.deploy.workerSelectionPolicy`

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/deploy/master/Master.scala| 13 +-
 .../org/apache/spark/internal/config/Deploy.scala  | 18 +
 .../apache/spark/deploy/master/MasterSuite.scala   | 47 ++
 3 files changed, 76 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46880][PYTHON][CONNECT][TESTS] Improve and test warning for Arrow-optimized Python UDF

2024-01-26 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e014248434ac [SPARK-46880][PYTHON][CONNECT][TESTS] Improve and test 
warning for Arrow-optimized Python UDF
e014248434ac is described below

commit e014248434ac241b9681aceff79f900f0c41dd28
Author: Xinrong Meng 
AuthorDate: Fri Jan 26 15:43:32 2024 -0800

[SPARK-46880][PYTHON][CONNECT][TESTS] Improve and test warning for 
Arrow-optimized Python UDF

### What changes were proposed in this pull request?
Improve and test warning for Arrow-optimized Python UDF

### Why are the changes needed?
To improve usability and test coverage.

### Does this PR introduce _any_ user-facing change?
Only a user warning changed.

FROM
```
>>> udf(lambda: print("do"), useArrow=True)
UserWarning: Arrow optimization for Python UDFs cannot be enabled.
  warnings.warn(
 at ..>
```
TO
```
>>> udf(lambda: print("do"), useArrow=True)
UserWarning: Arrow optimization for Python UDFs cannot be enabled for 
functions without arguments.
  warnings.warn(
 at ..>
```

### How was this patch tested?
Unit test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44905 from xinrong-meng/arr_udf_warn.

Authored-by: Xinrong Meng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/connect/udf.py | 3 ++-
 python/pyspark/sql/tests/test_arrow_python_udf.py | 9 +
 python/pyspark/sql/udf.py | 3 ++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/udf.py 
b/python/pyspark/sql/connect/udf.py
index 5386398bdca8..1c42f4d74b7a 100644
--- a/python/pyspark/sql/connect/udf.py
+++ b/python/pyspark/sql/connect/udf.py
@@ -85,7 +85,8 @@ def _create_py_udf(
 eval_type = PythonEvalType.SQL_ARROW_BATCHED_UDF
 else:
 warnings.warn(
-"Arrow optimization for Python UDFs cannot be enabled.",
+"Arrow optimization for Python UDFs cannot be enabled for 
functions"
+" without arguments.",
 UserWarning,
 )
 
diff --git a/python/pyspark/sql/tests/test_arrow_python_udf.py 
b/python/pyspark/sql/tests/test_arrow_python_udf.py
index c59326edc31a..114fdf602223 100644
--- a/python/pyspark/sql/tests/test_arrow_python_udf.py
+++ b/python/pyspark/sql/tests/test_arrow_python_udf.py
@@ -188,6 +188,15 @@ class PythonUDFArrowTestsMixin(BaseUDFTestsMixin):
 },
 )
 
+def test_warn_no_args(self):
+with self.assertWarns(UserWarning) as w:
+udf(lambda: print("do"), useArrow=True)
+self.assertEqual(
+str(w.warning),
+"Arrow optimization for Python UDFs cannot be enabled for 
functions"
+" without arguments.",
+)
+
 
 class PythonUDFArrowTests(PythonUDFArrowTestsMixin, ReusedSQLTestCase):
 @classmethod
diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index ca38556431ad..0324bc678667 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -142,7 +142,8 @@ def _create_py_udf(
 eval_type = PythonEvalType.SQL_ARROW_BATCHED_UDF
 else:
 warnings.warn(
-"Arrow optimization for Python UDFs cannot be enabled.",
+"Arrow optimization for Python UDFs cannot be enabled for 
functions"
+" without arguments.",
 UserWarning,
 )
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46849][SQL] Run optimizer on CREATE TABLE column defaults

2024-01-26 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7ab7509fa124 [SPARK-46849][SQL] Run optimizer on CREATE TABLE column 
defaults
7ab7509fa124 is described below

commit 7ab7509fa12418ff5f93782670b7e939c055703a
Author: Daniel Tenedorio 
AuthorDate: Fri Jan 26 12:28:10 2024 -0800

[SPARK-46849][SQL] Run optimizer on CREATE TABLE column defaults

### What changes were proposed in this pull request?

This PR updates Catalyst to run the optimizer over `CREATE TABLE` column 
default expressions.

### Why are the changes needed?

This helps speed up future commands that assign default values within the 
table.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The functionality is covered by existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44876 from dtenedor/analyze-column-defaults.

Authored-by: Daniel Tenedorio 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/parser/AstBuilder.scala | 19 +++-
 .../sql/catalyst/plans/logical/v2Commands.scala| 18 ++-
 .../catalyst/util/ResolveDefaultColumnsUtil.scala  | 15 +
 .../sql/connector/catalog/CatalogV2Util.scala  | 26 ++
 .../spark/sql/catalyst/parser/DDLParserSuite.scala |  2 +-
 .../catalyst/analysis/ResolveSessionCatalog.scala  |  2 +-
 .../datasources/v2/DataSourceV2Strategy.scala  | 16 +++--
 7 files changed, 83 insertions(+), 15 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index 54c4343e7ff9..d147d22e4b13 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.catalyst.parser
 import java.util.Locale
 import java.util.concurrent.TimeUnit
 
+import scala.collection.mutable
 import scala.collection.mutable.{ArrayBuffer, Set}
 import scala.jdk.CollectionConverters._
 import scala.util.{Left, Right}
@@ -3997,6 +3998,22 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
 val tableSpec = UnresolvedTableSpec(properties, provider, options, 
location, comment,
   serdeInfo, external)
 
+// Parse column defaults from the table into separate expressions in the 
CREATE TABLE operator.
+val specifiedDefaults: mutable.Map[Int, Expression] = mutable.Map.empty
+Option(ctx.createOrReplaceTableColTypeList()).foreach {
+  _.createOrReplaceTableColType().asScala.zipWithIndex.foreach { case 
(typeContext, index) =>
+typeContext.colDefinitionOption().asScala.foreach { option =>
+  Option(option.defaultExpression()).foreach { defaultExprContext =>
+specifiedDefaults.update(index, 
expression(defaultExprContext.expression()))
+  }
+}
+  }
+}
+val defaultValueExpressions: Seq[Option[Expression]] =
+  (0 until columns.size).map { index: Int =>
+specifiedDefaults.get(index)
+  }
+
 Option(ctx.query).map(plan) match {
   case Some(_) if columns.nonEmpty =>
 operationNotAllowed(
@@ -4018,7 +4035,7 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
 // with data type.
 val schema = StructType(columns ++ partCols)
 CreateTable(withIdentClause(identifierContext, 
UnresolvedIdentifier(_)),
-  schema, partitioning, tableSpec, ignoreIfExists = ifNotExists)
+  schema, partitioning, tableSpec, ignoreIfExists = ifNotExists, 
defaultValueExpressions)
 }
   }
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
index b17926818900..30be30cb2e04 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
@@ -456,13 +456,16 @@ trait V2CreateTablePlan extends LogicalPlan {
 
 /**
  * Create a new table with a v2 catalog.
+ * The [[defaultValueExpressions]] hold optional default value expressions to 
use when creating the
+ * table, mapping 1:1 with the fields in [[tableSchema]].
  */
 case class CreateTable(
 name: LogicalPlan,
 tableSchema: StructType,
 partitioning: Seq[Transform],
 tableSpec: TableSpecBase,
-ignoreIfExists: Boolean)
+ignoreIfExists: Boolean,
+defaultV

(spark) branch master updated: [SPARK-46871][PS][TESTS] Clean up the imports in `pyspark.pandas.tests.computation.*`

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f9f413e5ff6a [SPARK-46871][PS][TESTS] Clean up the imports in 
`pyspark.pandas.tests.computation.*`
f9f413e5ff6a is described below

commit f9f413e5ff6abe00a664e2dc75fb0ade2ff2986a
Author: Ruifeng Zheng 
AuthorDate: Thu Jan 25 22:40:35 2024 -0800

[SPARK-46871][PS][TESTS] Clean up the imports in 
`pyspark.pandas.tests.computation.*`

### What changes were proposed in this pull request?
Clean up the imports in `pyspark.pandas.tests.computation.*`

### Why are the changes needed?
1, remove unused imports;
2, define the test dataset in the vanilla side, so that won't need to 
define it again in the parity tests;

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44895 from zhengruifeng/ps_test_comput_cleanup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/pandas/tests/computation/test_any_all.py  |  8 ++--
 python/pyspark/pandas/tests/computation/test_apply_func.py   | 12 ++--
 python/pyspark/pandas/tests/computation/test_binary_ops.py   | 12 ++--
 python/pyspark/pandas/tests/computation/test_combine.py  |  8 ++--
 python/pyspark/pandas/tests/computation/test_compute.py  |  8 ++--
 python/pyspark/pandas/tests/computation/test_corr.py |  6 +-
 python/pyspark/pandas/tests/computation/test_corrwith.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_cov.py  |  8 ++--
 python/pyspark/pandas/tests/computation/test_cumulative.py   |  8 ++--
 python/pyspark/pandas/tests/computation/test_describe.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_eval.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_melt.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_missing_data.py |  8 ++--
 python/pyspark/pandas/tests/computation/test_pivot.py|  4 ++--
 python/pyspark/pandas/tests/computation/test_pivot_table.py  |  4 ++--
 .../pyspark/pandas/tests/computation/test_pivot_table_adv.py |  4 ++--
 .../pandas/tests/computation/test_pivot_table_multi_idx.py   |  4 ++--
 .../tests/computation/test_pivot_table_multi_idx_adv.py  |  4 ++--
 python/pyspark/pandas/tests/computation/test_stats.py|  6 +-
 .../pandas/tests/connect/computation/test_parity_any_all.py  | 11 ++-
 .../tests/connect/computation/test_parity_apply_func.py  |  9 -
 .../tests/connect/computation/test_parity_binary_ops.py  | 11 ++-
 .../pandas/tests/connect/computation/test_parity_combine.py  |  6 +-
 .../pandas/tests/connect/computation/test_parity_compute.py  |  6 +-
 .../pandas/tests/connect/computation/test_parity_corr.py |  7 +--
 .../pandas/tests/connect/computation/test_parity_corrwith.py | 11 ++-
 .../pandas/tests/connect/computation/test_parity_cov.py  | 11 ++-
 .../tests/connect/computation/test_parity_cumulative.py  |  9 -
 .../pandas/tests/connect/computation/test_parity_describe.py |  5 +
 .../pandas/tests/connect/computation/test_parity_eval.py | 11 ++-
 .../pandas/tests/connect/computation/test_parity_melt.py | 11 ++-
 .../tests/connect/computation/test_parity_missing_data.py|  9 -
 32 files changed, 164 insertions(+), 89 deletions(-)

diff --git a/python/pyspark/pandas/tests/computation/test_any_all.py 
b/python/pyspark/pandas/tests/computation/test_any_all.py
index 5e946be7b08b..784e355f3b58 100644
--- a/python/pyspark/pandas/tests/computation/test_any_all.py
+++ b/python/pyspark/pandas/tests/computation/test_any_all.py
@@ -20,7 +20,7 @@ import numpy as np
 import pandas as pd
 
 from pyspark import pandas as ps
-from pyspark.testing.pandasutils import ComparisonTestBase
+from pyspark.testing.pandasutils import PandasOnSparkTestCase
 from pyspark.testing.sqlutils import SQLTestUtils
 
 
@@ -149,7 +149,11 @@ class FrameAnyAllMixin:
 psdf.any(axis=1)
 
 
-class FrameAnyAllTests(FrameAnyAllMixin, ComparisonTestBase, SQLTestUtils):
+class FrameAnyAllTests(
+FrameAnyAllMixin,
+PandasOnSparkTestCase,
+SQLTestUtils,
+):
 pass
 
 
diff --git a/python/pyspark/pandas/tests/computation/test_apply_func.py 
b/python/pyspark/pandas/tests/computation/test_apply_func.py
index de82c061b58c..ad43a2f2b270 100644
--- a/python/pyspark/pandas/tests/computation/test_apply_func.py
+++ b/python/pyspark/pandas/tests/computation/test_apply_func.py
@@ -25,7 +25,7 @@ import pandas as pd
 from pyspark import pandas as ps
 from pyspark.loose_version import LooseVe

(spark) branch branch-3.4 updated: [SPARK-46855][INFRA][3.4] Add `sketch` to the dependencies of the `catalyst` in `module.py`

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 441c33da0dbb [SPARK-46855][INFRA][3.4] Add `sketch` to the 
dependencies of the `catalyst` in `module.py`
441c33da0dbb is described below

commit 441c33da0dbba26c54d6a46805f8902605472007
Author: yangjie01 
AuthorDate: Thu Jan 25 22:36:32 2024 -0800

[SPARK-46855][INFRA][3.4] Add `sketch` to the dependencies of the 
`catalyst` in `module.py`

### What changes were proposed in this pull request?
This pr add `sketch` to the dependencies of the `catalyst` module in 
`module.py` due to `sketch` is direct dependency of `catalyst` module.

### Why are the changes needed?
Ensure that when modifying the `sketch` module, both `catalyst` and 
cascading modules will trigger tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44894 from LuciferYang/SPARK-46855-34.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 dev/sparktestsupport/modules.py | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index ac24ea19d0e7..100dd236c81d 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -168,6 +168,15 @@ launcher = Module(
 ],
 )
 
+sketch = Module(
+name="sketch",
+dependencies=[tags],
+source_file_regexes=[
+"common/sketch/",
+],
+sbt_test_goals=["sketch/test"],
+)
+
 core = Module(
 name="core",
 dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher],
@@ -181,7 +190,7 @@ core = Module(
 
 catalyst = Module(
 name="catalyst",
-dependencies=[tags, core],
+dependencies=[tags, sketch, core],
 source_file_regexes=[
 "sql/catalyst/",
 ],
@@ -295,15 +304,6 @@ protobuf = Module(
 ],
 )
 
-sketch = Module(
-name="sketch",
-dependencies=[tags],
-source_file_regexes=[
-"common/sketch/",
-],
-sbt_test_goals=["sketch/test"],
-)
-
 graphx = Module(
 name="graphx",
 dependencies=[tags, core],


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-46855][INFRA][3.5] Add `sketch` to the dependencies of the `catalyst` in `module.py`

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e5a654e818b4 [SPARK-46855][INFRA][3.5] Add `sketch` to the 
dependencies of the `catalyst` in `module.py`
e5a654e818b4 is described below

commit e5a654e818b4698260807a081e5cf3d71480ac13
Author: yangjie01 
AuthorDate: Thu Jan 25 22:35:38 2024 -0800

[SPARK-46855][INFRA][3.5] Add `sketch` to the dependencies of the 
`catalyst` in `module.py`

### What changes were proposed in this pull request?
This pr add `sketch` to the dependencies of the `catalyst` module in 
`module.py` due to `sketch` is direct dependency of `catalyst` module.

### Why are the changes needed?
Ensure that when modifying the `sketch` module, both `catalyst` and 
cascading modules will trigger tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44893 from LuciferYang/SPARK-46855-35.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 dev/sparktestsupport/modules.py | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 33d253a47ea0..d29fc8726018 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -168,6 +168,15 @@ launcher = Module(
 ],
 )
 
+sketch = Module(
+name="sketch",
+dependencies=[tags],
+source_file_regexes=[
+"common/sketch/",
+],
+sbt_test_goals=["sketch/test"],
+)
+
 core = Module(
 name="core",
 dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher],
@@ -181,7 +190,7 @@ core = Module(
 
 catalyst = Module(
 name="catalyst",
-dependencies=[tags, core],
+dependencies=[tags, sketch, core],
 source_file_regexes=[
 "sql/catalyst/",
 ],
@@ -295,15 +304,6 @@ connect = Module(
 ],
 )
 
-sketch = Module(
-name="sketch",
-dependencies=[tags],
-source_file_regexes=[
-"common/sketch/",
-],
-sbt_test_goals=["sketch/test"],
-)
-
 graphx = Module(
 name="graphx",
 dependencies=[tags, core],


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46872][CORE] Recover `log-view.js` to be non-module

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1eedb7507ae2 [SPARK-46872][CORE] Recover `log-view.js` to be non-module
1eedb7507ae2 is described below

commit 1eedb7507ae23d069e65a40c202173a709c5e94d
Author: Dongjoon Hyun 
AuthorDate: Thu Jan 25 22:31:56 2024 -0800

[SPARK-46872][CORE] Recover `log-view.js` to be non-module

### What changes were proposed in this pull request?

This PR aims to recover `log-view.js` to be no-module to fix loading issue.

### Why are the changes needed?

- #43903

![Screenshot 2024-01-25 at 9 08 48 
PM](https://github.com/apache/spark/assets/9700541/830fadc8-ab1c-4cf4-9e56-493f9553b3ae)

### Does this PR introduce _any_ user-facing change?

No. This is a recovery to the status before SPARK-46003 which is not 
released yet.

### How was this patch tested?

Manually.

- Checkout SPARK-46003 commit and build.
- Start Master and Worker.
- Open `Incognito` or `Private` mode browser and go to Worker Log.
- Check `initLogPage` error via the developer tools

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44896 from dongjoon-hyun/SPARK-46872.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/resources/org/apache/spark/ui/static/log-view.js | 4 +---
 core/src/main/scala/org/apache/spark/ui/UIUtils.scala  | 2 +-
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/core/src/main/resources/org/apache/spark/ui/static/log-view.js 
b/core/src/main/resources/org/apache/spark/ui/static/log-view.js
index eaf7130e974b..0b917ee5c8d8 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/log-view.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/log-view.js
@@ -17,8 +17,6 @@
 
 /* global $ */
 
-import {getBaseURI} from "./utils.js";
-
 var baseParams;
 
 var curLogLength;
@@ -60,7 +58,7 @@ function getRESTEndPoint() {
   // If the worker is served from the master through a proxy (see doc on 
spark.ui.reverseProxy), 
   // we need to retain the leading ../proxy// part of the URL when 
making REST requests.
   // Similar logic is contained in executorspage.js function 
createRESTEndPoint.
-  var words = getBaseURI().split('/');
+  var words = (document.baseURI || document.URL).split('/');
   var ind = words.indexOf("proxy");
   if (ind > 0) {
 return words.slice(0, ind + 2).join('/') + "/log";
diff --git a/core/src/main/scala/org/apache/spark/ui/UIUtils.scala 
b/core/src/main/scala/org/apache/spark/ui/UIUtils.scala
index d124717ea85a..14255d276d66 100644
--- a/core/src/main/scala/org/apache/spark/ui/UIUtils.scala
+++ b/core/src/main/scala/org/apache/spark/ui/UIUtils.scala
@@ -220,7 +220,7 @@ private[spark] object UIUtils extends Logging {
 
 
 
-
+
 
 setUIRoot('{UIUtils.uiRoot(request)}')
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46870][CORE] Support Spark Master Log UI

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d6fc06bd4515 [SPARK-46870][CORE] Support Spark Master Log UI
d6fc06bd4515 is described below

commit d6fc06bd451586edc5e55068aabecb3dc7ec5849
Author: Dongjoon Hyun 
AuthorDate: Thu Jan 25 21:15:30 2024 -0800

[SPARK-46870][CORE] Support Spark Master Log UI

### What changes were proposed in this pull request?

This PR aims to support `Spark Master` Log UI.

### Why are the changes needed?

This is a new feature to allow the users to access the master log like the 
following. The value of `Status`, e.g., `ALIVE`, has a new link for log UI.

**BEFORE**

![Screenshot 2024-01-25 at 7 30 07 
PM](https://github.com/apache/spark/assets/9700541/2c263944-ebfa-49bb-955f-d9a022e23cba)

**AFTER**

![Screenshot 2024-01-25 at 7 28 59 
PM](https://github.com/apache/spark/assets/9700541/8d096261-3a31-4746-b52b-e01cfcdf3237)

![Screenshot 2024-01-25 at 7 29 21 
PM](https://github.com/apache/spark/assets/9700541/fc4d3c10-8695-4529-a92b-6ab477c961da)

### Does this PR introduce _any_ user-facing change?

No. This is a new link and UI.

### How was this patch tested?

Manually.

```
$ sbin/start-master.sh
```

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44890 from dongjoon-hyun/SPARK-46870.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/master/ui/LogPage.scala| 125 +
 .../apache/spark/deploy/master/ui/MasterPage.scala |   4 +-
 .../spark/deploy/master/ui/MasterWebUI.scala   |   1 +
 3 files changed, 129 insertions(+), 1 deletion(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/LogPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/LogPage.scala
new file mode 100644
index ..9da05025e1a3
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/LogPage.scala
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.master.ui
+
+import java.io.File
+import javax.servlet.http.HttpServletRequest
+
+import scala.xml.{Node, Unparsed}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.ui.{UIUtils, WebUIPage}
+import org.apache.spark.util.Utils
+import org.apache.spark.util.logging.RollingFileAppender
+
+private[ui] class LogPage(parent: MasterWebUI) extends WebUIPage("logPage") 
with Logging {
+  private val defaultBytes = 100 * 1024
+
+  def render(request: HttpServletRequest): Seq[Node] = {
+val logDir = sys.env.getOrElse("SPARK_LOG_DIR", "logs/")
+val logType = request.getParameter("logType")
+val offset = Option(request.getParameter("offset")).map(_.toLong)
+val byteLength = Option(request.getParameter("byteLength")).map(_.toInt)
+  .getOrElse(defaultBytes)
+val (logText, startByte, endByte, logLength) = getLog(logDir, logType, 
offset, byteLength)
+val curLogLength = endByte - startByte
+val range =
+  
+Showing {curLogLength} Bytes: {startByte.toString} - 
{endByte.toString} of {logLength}
+  
+
+val moreButton =
+  
+Load More
+  
+
+val newButton =
+  
+Load New
+  
+
+val alert =
+  
+End of Log
+  
+
+val logParams = "?self&logType=%s".format(logType)
+val jsOnload = "window.onload = " +
+  s"initLogPage('$logParams', $curLogLength, $startByte, $endByte, 
$logLength, $byteLength);"
+
+val content =
+   ++
+  
+Back to Master
+{range}
+
+  {moreButton}
+  {logText}
+  {alert}
+  {newButton}
+
+{Unparsed(jsOnload)}
+  
+
+UIUtils.basicSparkPage(request, content, logType + " log pag

(spark) branch master updated: [SPARK-46868][CORE] Support Spark Worker Log UI

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 48cd8604f953 [SPARK-46868][CORE] Support Spark Worker Log UI
48cd8604f953 is described below

commit 48cd8604f953dc82cadb6c076914d4d5c69b8126
Author: Dongjoon Hyun 
AuthorDate: Thu Jan 25 19:31:26 2024 -0800

[SPARK-46868][CORE] Support Spark Worker Log UI

### What changes were proposed in this pull request?

This PR aims to support `Spark Worker Log UI` when `SPARK_LOG_DIR` is under 
work directory.

### Why are the changes needed?

This is a new feature to allow the users to access the worker log like the 
following.

**BEFORE**

![Screenshot 2024-01-25 at 3 04 20 
PM](https://github.com/apache/spark/assets/9700541/73ef33d5-9b56-4cca-83c2-9fd2e8ab5201)

**AFTER**

- Worker Page (Worker ID provides a new hyperlink for Log UI)
![Screenshot 2024-01-25 at 2 58 44 
PM](https://github.com/apache/spark/assets/9700541/1de66eee-7b73-4be3-a12c-e008442b7b6c)

- Log UI
![Screenshot 2024-01-25 at 6 00 25 
PM](https://github.com/apache/spark/assets/9700541/e20fde05-ce5e-42cb-9112-4a8d2ec69418)

### Does this PR introduce _any_ user-facing change?

To provide a better UX.

### How was this patch tested?

Manually.

```
$ sbin/start-master.sh
$ SPARK_LOG_DIR=$PWD/work/logs sbin/start-worker.sh spark://$(hostname):7077
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44888 from dongjoon-hyun/SPARK-46868.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/worker/ui/LogPage.scala| 29 --
 .../apache/spark/deploy/worker/ui/WorkerPage.scala |  6 -
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
index dd714cdc4437..991c791cc79e 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala
@@ -30,23 +30,26 @@ import org.apache.spark.util.logging.RollingFileAppender
 private[ui] class LogPage(parent: WorkerWebUI) extends WebUIPage("logPage") 
with Logging {
   private val worker = parent.worker
   private val workDir = new File(parent.workDir.toURI.normalize().getPath)
-  private val supportedLogTypes = Set("stderr", "stdout")
+  private val supportedLogTypes = Set("stderr", "stdout", "out")
   private val defaultBytes = 100 * 1024
 
   def renderLog(request: HttpServletRequest): String = {
 val appId = Option(request.getParameter("appId"))
 val executorId = Option(request.getParameter("executorId"))
 val driverId = Option(request.getParameter("driverId"))
+val self = Option(request.getParameter("self"))
 val logType = request.getParameter("logType")
 val offset = Option(request.getParameter("offset")).map(_.toLong)
 val byteLength = Option(request.getParameter("byteLength")).map(_.toInt)
   .getOrElse(defaultBytes)
 
-val logDir = (appId, executorId, driverId) match {
-  case (Some(a), Some(e), None) =>
+val logDir = (appId, executorId, driverId, self) match {
+  case (Some(a), Some(e), None, None) =>
 s"${workDir.getPath}/$a/$e/"
-  case (None, None, Some(d)) =>
+  case (None, None, Some(d), None) =>
 s"${workDir.getPath}/$d/"
+  case (None, None, None, Some(_)) =>
+s"${sys.env.getOrElse("SPARK_LOG_DIR", workDir.getPath)}/"
   case _ =>
 throw new Exception("Request must specify either application or driver 
identifiers")
 }
@@ -60,16 +63,19 @@ private[ui] class LogPage(parent: WorkerWebUI) extends 
WebUIPage("logPage") with
 val appId = Option(request.getParameter("appId"))
 val executorId = Option(request.getParameter("executorId"))
 val driverId = Option(request.getParameter("driverId"))
+val self = Option(request.getParameter("self"))
 val logType = request.getParameter("logType")
 val offset = Option(request.getParameter("offset")).map(_.toLong)
 val byteLength = Option(request.getParameter("byteLength")).map(_.toInt)
   .getOrElse(defaultBytes)
 
-val (logDir, params, pageName) = (appId, executorId, driverId) match {
-  case (Some(a), Some(e), None) =>
+val (logDir, params, pageName) = (appId, executorId, driverId, self) match 
{
+

(spark) branch master updated: [SPARK-46869][K8S] Add `logrotate` to Spark docker files

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 25df5dad6761 [SPARK-46869][K8S] Add `logrotate` to Spark docker files
25df5dad6761 is described below

commit 25df5dad67610357287bef075ee755c59acdb904
Author: Dongjoon Hyun 
AuthorDate: Thu Jan 25 16:30:55 2024 -0800

[SPARK-46869][K8S] Add `logrotate` to Spark docker files

### What changes were proposed in this pull request?

This PR aims to add `logrotate` to Spark docker files.

### Why are the changes needed?

To help a user to easily rotate the logs by configuration. Note that this 
is not for rigorous users who cannot allow log data loss. `logratate` is easy 
but is known to allow log loss during rotation.

### Does this PR introduce _any_ user-facing change?

The image size change is negligible.
```
$ docker images spark
REPOSITORY   TAGIMAGE ID   CREATEDSIZE
sparklatest-logrotate   d843879458af   18 hours ago   657MB
sparklatest 0e281bd1fbe6   18 hours ago   657MB
```

### How was this patch tested?

Manually.
```
$ docker run -it --rm spark:latest-logrotate /usr/sbin/logrotate | tail -n7
logrotate 3.19.0 - Copyright (C) 1995-2001 Red Hat, Inc.
This may be freely redistributed under the terms of the GNU General Public 
License

Usage: logrotate [-dfv?] [-d|--debug] [-f|--force] [-m|--mail=command]
[-s|--state=statefile] [--skip-state-lock] [-v|--verbose]
[-l|--log=logfile] [--version] [-?|--help] [--usage]
[OPTION...] 
```

### Was this patch authored or co-authored using generative AI tooling?

Pass the CIs.

Closes #44889 from dongjoon-hyun/SPARK-46869.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
index b80e72c768c6..25d7e076169b 100644
--- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
+++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
@@ -30,7 +30,7 @@ ARG spark_uid=185
 RUN set -ex && \
 apt-get update && \
 ln -s /lib /lib64 && \
-apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps 
net-tools && \
+apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps 
net-tools logrotate && \
 mkdir -p /opt/spark && \
 mkdir -p /opt/spark/examples && \
 mkdir -p /opt/spark/work-dir && \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 3130ac9276bd [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler
3130ac9276bd is described below

commit 3130ac9276bd43dd21aa1aa5e5ef920b00bc3aff
Author: fred-db 
AuthorDate: Thu Jan 25 08:34:37 2024 -0800

[SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

* The DAGScheduler could currently run into a deadlock with another thread 
if both access the partitions of the same RDD at the same time.
* To make progress in getCacheLocs, we require both exclusive access to the 
RDD partitions and the location cache. We first lock on the location cache, and 
then on the RDD.
* When accessing partitions of an RDD, the RDD first acquires exclusive 
access on the partitions, and then might acquire exclusive access on the 
location cache.
* If thread 1 is able to acquire access on the RDD, while thread 2 holds 
the access to the location cache, we can run into a deadlock situation.
* To fix this, acquire locks in the same order. Change the DAGScheduler to 
first acquire the lock on the RDD, and then the lock on the location cache.

* This is a deadlock you can run into, which can prevent any progress on 
the cluster.

* No

* Unit test that reproduces the issue.

No

Closes #44882 from fred-db/fix-deadlock.

Authored-by: fred-db 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 617014cc92d933c70c9865a578fceb265883badd)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/rdd/RDD.scala | 11 ---
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 31 ++
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 38 +-
 3 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index 407820b663a3..fc5a2089f43b 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -223,14 +223,17 @@ abstract class RDD[T: ClassTag](
* not use `this` because RDDs are user-visible, so users might have added 
their own locking on
* RDDs; sharing that could lead to a deadlock.
*
-   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies; but
-   * because DAGs are acyclic, and we only ever hold locks for one path in 
that DAG, there is no
-   * chance of deadlock.
+   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies. Deadlocks
+   * are possible if we try to lock another resource while holding the 
stateLock,
+   * and the lock acquisition sequence of these locks is not guaranteed to be 
the same.
+   * This can lead lead to a deadlock as one thread might first acquire the 
stateLock,
+   * and then the resource,
+   * while another thread might first acquire the resource, and then the 
stateLock.
*
* Executors may reference the shared fields (though they should never 
mutate them,
* that only happens on the driver).
*/
-  private val stateLock = new Serializable {}
+  private[spark] val stateLock = new Serializable {}
 
   // Our dependencies and partitions will be gotten by calling subclass's 
methods below, and will
   // be overwritten when we're checkpointed
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index 2a966fab6f02..26be8c72bbcb 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -173,6 +173,9 @@ private[spark] class DAGScheduler(
* locations where that RDD partition is cached.
*
* All accesses to this map should be guarded by synchronizing on it (see 
SPARK-4454).
+   * If you need to access any RDD while synchronizing on the cache locations,
+   * first synchronize on the RDD, and then synchronize on this map to avoid 
deadlocks. The RDD
+   * could try to access the cache locations after synchronizing on the RDD.
*/
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
@@ -408,22 +411,24 @@ private[spark] class DAGScheduler(
   }
 
   private[scheduler]
-  def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
-// Note: this doesn't use `getOrElse()` because this method is called 
O(num tasks) times
-if (!cacheLocs.contains(rdd.id)) {
-  // Note: if the storage level is NONE, we don't need to get locations 
from block manager.
-  val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == 
StorageLevel.NONE) {
-IndexedSeq.fill(rdd.pa

(spark) branch branch-3.5 updated: [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 125b2f87d453 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler
125b2f87d453 is described below

commit 125b2f87d453a16325f24e7382707f2b365bba14
Author: fred-db 
AuthorDate: Thu Jan 25 08:34:37 2024 -0800

[SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

* The DAGScheduler could currently run into a deadlock with another thread 
if both access the partitions of the same RDD at the same time.
* To make progress in getCacheLocs, we require both exclusive access to the 
RDD partitions and the location cache. We first lock on the location cache, and 
then on the RDD.
* When accessing partitions of an RDD, the RDD first acquires exclusive 
access on the partitions, and then might acquire exclusive access on the 
location cache.
* If thread 1 is able to acquire access on the RDD, while thread 2 holds 
the access to the location cache, we can run into a deadlock situation.
* To fix this, acquire locks in the same order. Change the DAGScheduler to 
first acquire the lock on the RDD, and then the lock on the location cache.

* This is a deadlock you can run into, which can prevent any progress on 
the cluster.

* No

* Unit test that reproduces the issue.

No

Closes #44882 from fred-db/fix-deadlock.

Authored-by: fred-db 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 617014cc92d933c70c9865a578fceb265883badd)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/rdd/RDD.scala | 11 ---
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 31 ++
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 38 +-
 3 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index a21d2ae77396..f695b1020275 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -223,14 +223,17 @@ abstract class RDD[T: ClassTag](
* not use `this` because RDDs are user-visible, so users might have added 
their own locking on
* RDDs; sharing that could lead to a deadlock.
*
-   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies; but
-   * because DAGs are acyclic, and we only ever hold locks for one path in 
that DAG, there is no
-   * chance of deadlock.
+   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies. Deadlocks
+   * are possible if we try to lock another resource while holding the 
stateLock,
+   * and the lock acquisition sequence of these locks is not guaranteed to be 
the same.
+   * This can lead lead to a deadlock as one thread might first acquire the 
stateLock,
+   * and then the resource,
+   * while another thread might first acquire the resource, and then the 
stateLock.
*
* Executors may reference the shared fields (though they should never 
mutate them,
* that only happens on the driver).
*/
-  private val stateLock = new Serializable {}
+  private[spark] val stateLock = new Serializable {}
 
   // Our dependencies and partitions will be gotten by calling subclass's 
methods below, and will
   // be overwritten when we're checkpointed
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index d8adaae19b90..89d16e579348 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -174,6 +174,9 @@ private[spark] class DAGScheduler(
* locations where that RDD partition is cached.
*
* All accesses to this map should be guarded by synchronizing on it (see 
SPARK-4454).
+   * If you need to access any RDD while synchronizing on the cache locations,
+   * first synchronize on the RDD, and then synchronize on this map to avoid 
deadlocks. The RDD
+   * could try to access the cache locations after synchronizing on the RDD.
*/
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
@@ -420,22 +423,24 @@ private[spark] class DAGScheduler(
   }
 
   private[scheduler]
-  def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
-// Note: this doesn't use `getOrElse()` because this method is called 
O(num tasks) times
-if (!cacheLocs.contains(rdd.id)) {
-  // Note: if the storage level is NONE, we don't need to get locations 
from block manager.
-  val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == 
StorageLevel.NONE) {
-IndexedSeq.fill(rdd.pa

(spark) branch master updated: [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 617014cc92d9 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler
617014cc92d9 is described below

commit 617014cc92d933c70c9865a578fceb265883badd
Author: fred-db 
AuthorDate: Thu Jan 25 08:34:37 2024 -0800

[SPARK-46861][CORE] Avoid Deadlock in DAGScheduler

### What changes were proposed in this pull request?

* The DAGScheduler could currently run into a deadlock with another thread 
if both access the partitions of the same RDD at the same time.
* To make progress in getCacheLocs, we require both exclusive access to the 
RDD partitions and the location cache. We first lock on the location cache, and 
then on the RDD.
* When accessing partitions of an RDD, the RDD first acquires exclusive 
access on the partitions, and then might acquire exclusive access on the 
location cache.
* If thread 1 is able to acquire access on the RDD, while thread 2 holds 
the access to the location cache, we can run into a deadlock situation.
* To fix this, acquire locks in the same order. Change the DAGScheduler to 
first acquire the lock on the RDD, and then the lock on the location cache.

### Why are the changes needed?

* This is a deadlock you can run into, which can prevent any progress on 
the cluster.

### Does this PR introduce _any_ user-facing change?

* No

### How was this patch tested?

* Unit test that reproduces the issue.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44882 from fred-db/fix-deadlock.

Authored-by: fred-db 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/rdd/RDD.scala | 11 ---
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 31 ++
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 38 +-
 3 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index d73fb1b9bc3b..a48eaa253ad1 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -224,14 +224,17 @@ abstract class RDD[T: ClassTag](
* not use `this` because RDDs are user-visible, so users might have added 
their own locking on
* RDDs; sharing that could lead to a deadlock.
*
-   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies; but
-   * because DAGs are acyclic, and we only ever hold locks for one path in 
that DAG, there is no
-   * chance of deadlock.
+   * One thread might hold the lock on many of these, for a chain of RDD 
dependencies. Deadlocks
+   * are possible if we try to lock another resource while holding the 
stateLock,
+   * and the lock acquisition sequence of these locks is not guaranteed to be 
the same.
+   * This can lead lead to a deadlock as one thread might first acquire the 
stateLock,
+   * and then the resource,
+   * while another thread might first acquire the resource, and then the 
stateLock.
*
* Executors may reference the shared fields (though they should never 
mutate them,
* that only happens on the driver).
*/
-  private val stateLock = new Serializable {}
+  private[spark] val stateLock = new Serializable {}
 
   // Our dependencies and partitions will be gotten by calling subclass's 
methods below, and will
   // be overwritten when we're checkpointed
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index e728d921d290..e74a3efac250 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -181,6 +181,9 @@ private[spark] class DAGScheduler(
* locations where that RDD partition is cached.
*
* All accesses to this map should be guarded by synchronizing on it (see 
SPARK-4454).
+   * If you need to access any RDD while synchronizing on the cache locations,
+   * first synchronize on the RDD, and then synchronize on this map to avoid 
deadlocks. The RDD
+   * could try to access the cache locations after synchronizing on the RDD.
*/
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
@@ -435,22 +438,24 @@ private[spark] class DAGScheduler(
   }
 
   private[scheduler]
-  def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
-// Note: this doesn't use `getOrElse()` because this method is called 
O(num tasks) times
-if (!cacheLocs.contains(rdd.id)) {
-  // Note: if the storage level is NONE, we don't n

(spark) branch master updated: [SPARK-46855][INFRA] Add `sketch` to the dependencies of the `catalyst` in `module.py`

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1bee07e39f1b [SPARK-46855][INFRA] Add `sketch` to the dependencies of 
the `catalyst` in `module.py`
1bee07e39f1b is described below

commit 1bee07e39f1b5aef6ce81e028207691f1dd1fc7c
Author: yangjie01 
AuthorDate: Thu Jan 25 08:26:13 2024 -0800

[SPARK-46855][INFRA] Add `sketch` to the dependencies of the `catalyst` in 
`module.py`

### What changes were proposed in this pull request?
This pr add `sketch` to the dependencies of the `catalyst` module in 
`module.py` due to `sketch` is direct dependency of `catalyst` module.

### Why are the changes needed?
Ensure that when modifying the `sketch` module, both `catalyst` and 
cascading modules will trigger tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44878 from LuciferYang/SPARK-46855.

Lead-authored-by: yangjie01 
Co-authored-by: YangJie 
Signed-off-by: Dongjoon Hyun 
---
 dev/sparktestsupport/modules.py | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index be3e798b0779..b9541c4be9b3 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -179,6 +179,15 @@ launcher = Module(
 ],
 )
 
+sketch = Module(
+name="sketch",
+dependencies=[tags],
+source_file_regexes=[
+"common/sketch/",
+],
+sbt_test_goals=["sketch/test"],
+)
+
 core = Module(
 name="core",
 dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher, 
utils],
@@ -200,7 +209,7 @@ api = Module(
 
 catalyst = Module(
 name="catalyst",
-dependencies=[tags, core, api],
+dependencies=[tags, sketch, core, api],
 source_file_regexes=[
 "sql/catalyst/",
 ],
@@ -315,15 +324,6 @@ connect = Module(
 ],
 )
 
-sketch = Module(
-name="sketch",
-dependencies=[tags],
-source_file_regexes=[
-"common/sketch/",
-],
-sbt_test_goals=["sketch/test"],
-)
-
 graphx = Module(
 name="graphx",
 dependencies=[tags, core],


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark-website) branch asf-site updated: docs: udpate third party projects (#497)

2024-01-25 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new a6ce63fb9c docs: udpate third party projects (#497)
a6ce63fb9c is described below

commit a6ce63fb9c82dc8f25f42f377b487c0de2aff826
Author: Matthew Powers 
AuthorDate: Thu Jan 25 11:18:05 2024 -0500

docs: udpate third party projects (#497)
---
 site/third-party-projects.html | 79 ++
 third-party-projects.md| 77 
 2 files changed, 81 insertions(+), 75 deletions(-)

diff --git a/site/third-party-projects.html b/site/third-party-projects.html
index ba0911b733..a0f7a953f8 100644
--- a/site/third-party-projects.html
+++ b/site/third-party-projects.html
@@ -141,40 +141,57 @@
 
   This page tracks external software projects that supplement Apache 
Spark and add to its ecosystem.
 
-To add a project, open a pull request against the https://github.com/apache/spark-website";>spark-website 
-repository. Add an entry to 
-https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md";>this
 markdown file, 
-then run jekyll 
build to generate the HTML too. Include
-both in your pull request. See the README in this repo for more 
information.
+Popular libraries with 
PySpark integrations
 
-Note that all project and product names should follow trademark guidelines.
+
+  https://github.com/great-expectations/great_expectations";>great-expectations
 - Always know what to expect from your data
+  https://github.com/apache/airflow";>Apache Airflow - A 
platform to programmatically author, schedule, and monitor workflows
+  https://github.com/dmlc/xgboost";>xgboost - Scalable, 
portable and distributed gradient boosting
+  https://github.com/shap/shap";>shap - A game theoretic 
approach to explain the output of any machine learning model
+  https://github.com/awslabs/python-deequ";>python-deequ - 
Measures data quality in large datasets
+  https://github.com/datahub-project/datahub";>datahub - 
Metadata platform for the modern data stack
+  https://github.com/dbt-labs/dbt-spark";>dbt-spark - Enables 
dbt to work with Apache Spark
+
 
-spark-packages.org
+Connectors
 
-https://spark-packages.org/";>spark-packages.org is an 
external, 
-community-managed list of third-party libraries, add-ons, and applications 
that work with 
-Apache Spark. You can add a package as long as you have a GitHub 
repository.
+
+  https://github.com/spark-redshift-community/spark-redshift";>spark-redshift
 - Performant Redshift data source for Apache Spark
+  https://github.com/microsoft/sql-spark-connector";>spark-sql-connector 
- Apache Spark Connector for SQL Server and Azure SQL
+  https://github.com/Azure/azure-cosmosdb-spark";>azure-cosmos-spark - 
Apache Spark Connector for Azure Cosmos DB
+  https://github.com/Azure/azure-event-hubs-spark";>azure-event-hubs-spark
 - Enables continuous data processing with Apache Spark and Azure Event 
Hubs
+  https://github.com/Azure/azure-kusto-spark";>azure-kusto-spark - 
Apache Spark connector for Azure Kusto
+  https://github.com/mongodb/mongo-spark";>mongo-spark - The 
MongoDB Spark connector
+  https://github.com/couchbase/couchbase-spark-connector";>couchbase-spark-connector
 - The Official Couchbase Spark connector
+  https://github.com/datastax/spark-cassandra-connector";>spark-cassandra-connector
 - DataStax connector for Apache Spark to Apache Cassandra
+  https://github.com/elastic/elasticsearch-hadoop";>elasticsearch-hadoop 
- Elasticsearch real-time search and analytics natively integrated with 
Spark
+  https://github.com/neo4j-contrib/neo4j-spark-connector";>neo4j-spark-connector
 - Neo4j Connector for Apache Spark
+  https://github.com/StarRocks/starrocks-connector-for-apache-spark";>starrocks-connector-for-apache-spark
 - StarRocks Apache Spark connector
+  https://github.com/pingcap/tispark";>tispark - TiSpark is 
built for running Apache Spark on top of TiDB/TiKV
+
+
+Open table formats
+
+
+  https://delta.io";>Delta Lake - Storage layer that provides 
ACID transactions and scalable metadata handling for Apache Spark workloads
+  https://github.com/apache/hudi";>Hudi: Upserts, Deletes And 
Incremental Processing on Big Data
+  https://github.com/apache/iceberg";>Iceberg - Open table 
format for analytic datasets
+
 
 Infrastructure projects
 
 
-  https://github.com/spark-jobserver/spark-jobserver";>REST Job 
Server for Apache Spark - 
-REST interface for managing and submitting Spark jobs on the same cluster.
-  http://mlbase.org/";>MLbase - Machine Learning research 
project on top of Spark
+

(spark) branch master updated: [SPARK-46828][SQL] Remove the invalid assertion of remote mode for spark sql shell

2024-01-24 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 51cdf34226ed [SPARK-46828][SQL] Remove the invalid assertion of remote 
mode for spark sql shell
51cdf34226ed is described below

commit 51cdf34226ed8d137ac1c8374cc2473dc4818bbf
Author: Kent Yao 
AuthorDate: Wed Jan 24 22:17:16 2024 -0800

[SPARK-46828][SQL] Remove the invalid assertion of remote mode for spark 
sql shell

### What changes were proposed in this pull request?

It is safe to clean up the read side code in SparkSQLCLIDriver as 
`org.apache.hadoop.hive.ql.session.SessionState.setIsHiveServerQuery` is never 
invoked.

### Why are the changes needed?

code refactoring for the purpose of having more upgradable hive deps.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

- build and run `bin/spark-sql`

```
Spark Web UI available at http://***:4040
Spark master: local[*], Application Id: local-1706087266338
spark-sql (default)> show tables;
Time taken: 0.327 seconds
spark-sql (default)> show databases;
default
Time taken: 0.161 seconds, Fetched 1 row(s)


- CliSuite

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44868 from yaooqinn/SPARK-46828.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../src/main/resources/error/error-classes.json|  5 ---
 .../spark/sql/errors/QueryExecutionErrors.scala|  6 
 .../sql/hive/thriftserver/SparkSQLCLIDriver.scala  | 39 --
 3 files changed, 7 insertions(+), 43 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index 6088300f8e64..1f3122a502c5 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -6201,11 +6201,6 @@
   "Cannot create array with  elements of data due to 
exceeding the limit  elements for ArrayData. 
"
 ]
   },
-  "_LEGACY_ERROR_TEMP_2178" : {
-"message" : [
-  "Remote operations not supported."
-]
-  },
   "_LEGACY_ERROR_TEMP_2179" : {
 "message" : [
   "HiveServer2 Kerberos principal or keytab is not correctly configured."
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
index a3e905090bf3..69794517f917 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
@@ -1525,12 +1525,6 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase with ExecutionE
   cause = e)
   }
 
-  def remoteOperationsUnsupportedError(): SparkRuntimeException = {
-new SparkRuntimeException(
-  errorClass = "_LEGACY_ERROR_TEMP_2178",
-  messageParameters = Map.empty)
-  }
-
   def invalidKerberosConfigForHiveServer2Error(): Throwable = {
 new SparkException(
   errorClass = "_LEGACY_ERROR_TEMP_2179",
diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
index e0a1a31a36f3..0d3538e30941 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
@@ -36,7 +36,6 @@ import org.apache.hadoop.hive.ql.Driver
 import org.apache.hadoop.hive.ql.processors._
 import org.apache.hadoop.hive.ql.session.SessionState
 import org.apache.hadoop.security.{Credentials, UserGroupInformation}
-import org.slf4j.LoggerFactory
 import sun.misc.{Signal, SignalHandler}
 
 import org.apache.spark.{ErrorMessageFormat, SparkConf, SparkThrowable, 
SparkThrowableHelper}
@@ -45,7 +44,6 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.analysis.FunctionRegistry
 import org.apache.spark.sql.catalyst.util.SQLKeywordUtils
-import org.apache.spark.sql.errors.QueryExecutionErrors
 import org.apache.spark.sql.hive.client.HiveClientImpl
 import org.apache.spark.sql.hive.security.HiveDelegationTokenProvider
 import 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.closeHiveSessionStateIfStarted
@@ -149,10 +147,6 @@ private[hive] object SparkSQLCLIDriver extends Logging {
   SparkSQLEnv.stop(exi

(spark) branch master updated: [SPARK-46846][CORE] Make `WorkerResourceInfo` extend `Serializable` explicitly

2024-01-24 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b86053c430e5 [SPARK-46846][CORE] Make `WorkerResourceInfo` extend 
`Serializable` explicitly
b86053c430e5 is described below

commit b86053c430e5e1411467bdc1f0ddb337ca01649f
Author: Dongjoon Hyun 
AuthorDate: Wed Jan 24 15:49:23 2024 -0800

[SPARK-46846][CORE] Make `WorkerResourceInfo` extend `Serializable` 
explicitly

### What changes were proposed in this pull request?

This PR aims to make `WorkerResourceInfo` extend `Serializable` interface 
explicitly.

- 
https://docs.oracle.com/en/java/javase/17/docs/specs/serialization/serial-arch.html
> A Serializable class must do the following:
>   - Implement the `java.io.Serializable` interface

### Why are the changes needed?

`WorkerInfo` extends `Serializable` and has `WorkerResourceInfo` as data.

https://github.com/apache/spark/blob/1f23edfa84aa3318791d5fbbbae22d479a49134a/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala#L49-L58

`WorkerInfo` itself has no data field, but inherits `ResourceAllocator` 
which has data.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44873 from dongjoon-hyun/SPARK-46846.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala
index bdc9e9c6106c..a20adcbddc24 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala
@@ -25,7 +25,7 @@ import org.apache.spark.rpc.RpcEndpointRef
 import org.apache.spark.util.Utils
 
 private[spark] case class WorkerResourceInfo(name: String, addresses: 
Seq[String])
-  extends ResourceAllocator {
+  extends Serializable with ResourceAllocator {
 
   override protected def resourceName = this.name
   override protected def resourceAddresses = this.addresses


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46827][CORE] Make `RocksDBPersistenceEngine` to support a symbolic link

2024-01-24 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1f23edfa84aa [SPARK-46827][CORE] Make `RocksDBPersistenceEngine` to 
support a symbolic link
1f23edfa84aa is described below

commit 1f23edfa84aa3318791d5fbbbae22d479a49134a
Author: Dongjoon Hyun 
AuthorDate: Wed Jan 24 07:35:14 2024 -0800

[SPARK-46827][CORE] Make `RocksDBPersistenceEngine` to support a symbolic 
link

### What changes were proposed in this pull request?

This PR aims to make `RocksDBPersistenceEngine` to support a symbolic link 
location.

### Why are the changes needed?

To be consistent with `FileSystemPersistenceEngine` which supports symbolic 
link locations.


https://github.com/apache/spark/blob/7004dd9edcad32d34d0448df9498d32c444ab082/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L45-L50

### Does this PR introduce _any_ user-facing change?

No. This is a new feature at 4.0.0.

### How was this patch tested?

Pass the CIs with a newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44867 from dongjoon-hyun/SPARK-46827.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/deploy/master/RocksDBPersistenceEngine.scala|  9 +++--
 .../spark/deploy/master/PersistenceEngineSuite.scala  | 15 +++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/RocksDBPersistenceEngine.scala
 
b/core/src/main/scala/org/apache/spark/deploy/master/RocksDBPersistenceEngine.scala
index 5c43dab4d066..8364dbd693b1 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/master/RocksDBPersistenceEngine.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/master/RocksDBPersistenceEngine.scala
@@ -19,7 +19,7 @@ package org.apache.spark.deploy.master
 
 import java.nio.ByteBuffer
 import java.nio.charset.StandardCharsets.UTF_8
-import java.nio.file.{Files, Paths}
+import java.nio.file.{FileAlreadyExistsException, Files, Paths}
 
 import scala.collection.mutable.ArrayBuffer
 import scala.reflect.ClassTag
@@ -43,7 +43,12 @@ private[master] class RocksDBPersistenceEngine(
 
   RocksDB.loadLibrary()
 
-  private val path = Files.createDirectories(Paths.get(dir))
+  private val path = try {
+Files.createDirectories(Paths.get(dir))
+  } catch {
+case _: FileAlreadyExistsException if Files.isSymbolicLink(Paths.get(dir)) 
=>
+  Files.createDirectories(Paths.get(dir).toRealPath())
+  }
 
   /**
* Use full filter.
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala
 
b/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala
index b977a1142444..01b7e46eb2a8 100644
--- 
a/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala
@@ -88,6 +88,21 @@ class PersistenceEngineSuite extends SparkFunSuite {
 }
   }
 
+  test("SPARK-46827: RocksDBPersistenceEngine with a symbolic link") {
+withTempDir { dir =>
+  val target = Paths.get(dir.getAbsolutePath(), "target")
+  val link = Paths.get(dir.getAbsolutePath(), "symbolic_link");
+
+  Files.createDirectories(target)
+  Files.createSymbolicLink(link, target);
+
+  val conf = new SparkConf()
+  testPersistenceEngine(conf, serializer =>
+new RocksDBPersistenceEngine(link.toAbsolutePath.toString, serializer)
+  )
+}
+  }
+
   test("SPARK-46205: Support KryoSerializer in FileSystemPersistenceEngine") {
 withTempDir { dir =>
   val conf = new SparkConf()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46823][CONNECT][PYTHON] `LocalDataToArrowConversion` should check the nullability

2024-01-24 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1642e928478c [SPARK-46823][CONNECT][PYTHON] 
`LocalDataToArrowConversion` should check the nullability
1642e928478c is described below

commit 1642e928478c8c20bae5203ecf2e4d659aca7692
Author: Ruifeng Zheng 
AuthorDate: Wed Jan 24 00:43:41 2024 -0800

[SPARK-46823][CONNECT][PYTHON] `LocalDataToArrowConversion` should check 
the nullability

### What changes were proposed in this pull request?
`LocalDataToArrowConversion` should check the nullability

### Why are the changes needed?
this check was missing

### Does this PR introduce _any_ user-facing change?
yes

```
data = [("asd", None)]
schema = StructType(
[
StructField("name", StringType(), nullable=True),
StructField("age", IntegerType(), nullable=False),
]
)
```

before:
```
In [3]: df = spark.createDataFrame([("asd", None)], schema)

In [4]: df
Out[4]: 24/01/24 12:08:28 ERROR ErrorUtils: Spark Connect RPC error during: 
analyze. UserId: ruifeng.zheng. SessionId: cd692bb1-d503-4043-a9db-d29cb5c16517.
java.lang.IllegalStateException: Value at index is null
at org.apache.arrow.vector.IntVector.get(IntVector.java:107)
at 
org.apache.spark.sql.vectorized.ArrowColumnVector$IntAccessor.getInt(ArrowColumnVector.java:338)
at 
org.apache.spark.sql.vectorized.ArrowColumnVector.getInt(ArrowColumnVector.java:88)
at 
org.apache.spark.sql.vectorized.ColumnarBatchRow.getInt(ColumnarBatchRow.java:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$9.next(Iterator.scala:584)
at scala.collection.Iterator$$anon$9.next(Iterator.scala:584)
at scala.collection.Iterator$$anon$9.next(Iterator.scala:584)
at scala.collection.immutable.List.prependedAll(List.scala:153)
at scala.collection.immutable.List$.from(List.scala:684)
at scala.collection.immutable.List$.from(List.scala:681)
at scala.collection.SeqFactory$Delegate.from(Factory.scala:306)
at scala.collection.immutable.Seq$.from(Seq.scala:42)
at scala.collection.IterableOnceOps.toSeq(IterableOnce.scala:1326)
at scala.collection.IterableOnceOps.toSeq$(IterableOnce.scala:1326)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1300)
at 
org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformLocalRelation(SparkConnectPlanner.scala:1239)
at 
org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:139)
at 
org.apache.spark.sql.connect.service.SparkConnectAnalyzeHandler.process(SparkConnectAnalyzeHandler.scala:59)
at 
org.apache.spark.sql.connect.service.SparkConnectAnalyzeHandler.$anonfun$handle$1(SparkConnectAnalyzeHandler.scala:43)
at 
org.apache.spark.sql.connect.service.SparkConnectAnalyzeHandler.$anonfun$handle$1$adapted(SparkConnectAnalyzeHandler.scala:42)
at 
org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:289)
at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:918)
at 
org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:289)
at 
org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
at 
org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:80)
at 
org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:182)
at 
org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:79)
at 
org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:288)
at 
org.apache.spark.sql.connect.service.SparkConnectAnalyzeHandler.handle(SparkConnectAnalyzeHandler.scala:42)
at 
org.apache.spark.sql.connect.service.SparkConnectService.analyzePlan(SparkConnectService.scala:95)
at 
org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:907)
at 
org.sparkproject.connect.grpc.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at 
org.sparkproject.conn

(spark) branch master updated: [SPARK-46822][SQL] Respect spark.sql.legacy.charVarcharAsString when casting jdbc type to catalyst type in jdbc

2024-01-23 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0a470430c81c [SPARK-46822][SQL] Respect 
spark.sql.legacy.charVarcharAsString when casting jdbc type to catalyst type in 
jdbc
0a470430c81c is described below

commit 0a470430c81ca2d46020f863c45e96227fbdd07c
Author: Kent Yao 
AuthorDate: Tue Jan 23 22:57:02 2024 -0800

[SPARK-46822][SQL] Respect spark.sql.legacy.charVarcharAsString when 
casting jdbc type to catalyst type in jdbc

### What changes were proposed in this pull request?

This PR makes `spark.sql.legacy.charVarcharAsString` be activated in 
`JdbcUtils.getCatalystType`.

### Why are the changes needed?

For cases like CTAS, which respects schema from the query field can restore 
their behavior to create tables with strings instead of char/varchar.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44860 from yaooqinn/SPARK-46822.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../sql/execution/datasources/jdbc/JdbcUtils.scala |  2 ++
 .../v2/jdbc/JDBCTableCatalogSuite.scala| 22 ++
 2 files changed, 24 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
index 9fb10f42164f..89ac615a3097 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
@@ -185,6 +185,7 @@ object JdbcUtils extends Logging with SQLConfHelper {
 case java.sql.Types.BIT => BooleanType // @see JdbcDialect for quirks
 case java.sql.Types.BLOB => BinaryType
 case java.sql.Types.BOOLEAN => BooleanType
+case java.sql.Types.CHAR if conf.charVarcharAsString => StringType
 case java.sql.Types.CHAR => CharType(precision)
 case java.sql.Types.CLOB => StringType
 case java.sql.Types.DATE => DateType
@@ -214,6 +215,7 @@ object JdbcUtils extends Logging with SQLConfHelper {
 case java.sql.Types.TIMESTAMP => TimestampType
 case java.sql.Types.TINYINT => IntegerType
 case java.sql.Types.VARBINARY => BinaryType
+case java.sql.Types.VARCHAR if conf.charVarcharAsString => StringType
 case java.sql.Types.VARCHAR => VarcharType(precision)
 case _ =>
   // For unmatched types:
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
index 5408d434fced..0088fab7d209 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
@@ -608,6 +608,28 @@ class JDBCTableCatalogSuite extends QueryTest with 
SharedSparkSession {
 }
   }
 
+  test("SPARK-46822: Respect charVarcharAsString when casting jdbc type to 
catalyst type in jdbc") {
+try {
+  withConnection(
+_.prepareStatement("""CREATE TABLE "test"."char_tbl" (ID CHAR(5), 
deptno VARCHAR(10))""")
+.executeUpdate())
+  withSQLConf(SQLConf.LEGACY_CHAR_VARCHAR_AS_STRING.key -> "true") {
+val expected = new StructType()
+  .add("ID", StringType, true, defaultMetadata)
+  .add("DEPTNO", StringType, true, defaultMetadata)
+assert(sql(s"SELECT * FROM h2.test.char_tbl").schema === expected)
+  }
+  val expected = new StructType()
+.add("ID", CharType(5), true, defaultMetadata)
+.add("DEPTNO", VarcharType(10), true, defaultMetadata)
+  val replaced = 
CharVarcharUtils.replaceCharVarcharWithStringInSchema(expected)
+  assert(sql(s"SELECT * FROM h2.test.char_tbl").schema === replaced)
+} finally {
+  withConnection(
+_.prepareStatement("""DROP TABLE IF EXISTS 
"test"."char_tbl"""").executeUpdate())
+}
+  }
+
   test("SPARK-45449: Cache Invalidation Issue with JDBC Table") {
 withTable("h2.test.cache_t") {
   withConnection { conn =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command

2024-01-23 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new e56bd97c04c1 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding 
`decommission` command
e56bd97c04c1 is described below

commit e56bd97c04c184104046e51e6759e616c86683fa
Author: Dongjoon Hyun 
AuthorDate: Tue Jan 23 16:38:45 2024 -0800

[SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` 
command

### What changes were proposed in this pull request?

This PR aims to fix `spark-daemon.sh` usage by adding `decommission` 
command.

### Why are the changes needed?

This was missed when SPARK-20628 added `decommission` command at Apache 
Spark 3.1.0. The command has been used like the following.


https://github.com/apache/spark/blob/0356ac00947282b1a0885ad7eaae1e25e43671fe/sbin/decommission-worker.sh#L41

### Does this PR introduce _any_ user-facing change?

No, this is only a change on usage message.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44856 from dongjoon-hyun/SPARK-46817.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 00a92d328576c39b04cfd0fdd8a30c5a9bc37e36)
Signed-off-by: Dongjoon Hyun 
---
 sbin/spark-daemon.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sbin/spark-daemon.sh b/sbin/spark-daemon.sh
index 3cfd5acfe2b5..28d205f03e0f 100755
--- a/sbin/spark-daemon.sh
+++ b/sbin/spark-daemon.sh
@@ -31,7 +31,7 @@
 #   SPARK_NO_DAEMONIZE   If set, will run the proposed command in the 
foreground. It will not output a PID file.
 ##
 
-usage="Usage: spark-daemon.sh [--config ] (start|stop|submit|status) 
  "
+usage="Usage: spark-daemon.sh [--config ] 
(start|stop|submit|decommission|status)   
"
 
 # if no args specified, show usage
 if [ $# -le 1 ]; then


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command

2024-01-23 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new be7f1e9979c3 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding 
`decommission` command
be7f1e9979c3 is described below

commit be7f1e9979c38b1358b0af2b358bacb0bd523c80
Author: Dongjoon Hyun 
AuthorDate: Tue Jan 23 16:38:45 2024 -0800

[SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` 
command

### What changes were proposed in this pull request?

This PR aims to fix `spark-daemon.sh` usage by adding `decommission` 
command.

### Why are the changes needed?

This was missed when SPARK-20628 added `decommission` command at Apache 
Spark 3.1.0. The command has been used like the following.


https://github.com/apache/spark/blob/0356ac00947282b1a0885ad7eaae1e25e43671fe/sbin/decommission-worker.sh#L41

### Does this PR introduce _any_ user-facing change?

No, this is only a change on usage message.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44856 from dongjoon-hyun/SPARK-46817.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 00a92d328576c39b04cfd0fdd8a30c5a9bc37e36)
Signed-off-by: Dongjoon Hyun 
---
 sbin/spark-daemon.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sbin/spark-daemon.sh b/sbin/spark-daemon.sh
index 3cfd5acfe2b5..28d205f03e0f 100755
--- a/sbin/spark-daemon.sh
+++ b/sbin/spark-daemon.sh
@@ -31,7 +31,7 @@
 #   SPARK_NO_DAEMONIZE   If set, will run the proposed command in the 
foreground. It will not output a PID file.
 ##
 
-usage="Usage: spark-daemon.sh [--config ] (start|stop|submit|status) 
  "
+usage="Usage: spark-daemon.sh [--config ] 
(start|stop|submit|decommission|status)   
"
 
 # if no args specified, show usage
 if [ $# -le 1 ]; then


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command

2024-01-23 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 00a92d328576 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding 
`decommission` command
00a92d328576 is described below

commit 00a92d328576c39b04cfd0fdd8a30c5a9bc37e36
Author: Dongjoon Hyun 
AuthorDate: Tue Jan 23 16:38:45 2024 -0800

[SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` 
command

### What changes were proposed in this pull request?

This PR aims to fix `spark-daemon.sh` usage by adding `decommission` 
command.

### Why are the changes needed?

This was missed when SPARK-20628 added `decommission` command at Apache 
Spark 3.1.0. The command has been used like the following.


https://github.com/apache/spark/blob/0356ac00947282b1a0885ad7eaae1e25e43671fe/sbin/decommission-worker.sh#L41

### Does this PR introduce _any_ user-facing change?

No, this is only a change on usage message.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44856 from dongjoon-hyun/SPARK-46817.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 sbin/spark-daemon.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sbin/spark-daemon.sh b/sbin/spark-daemon.sh
index 3cfd5acfe2b5..28d205f03e0f 100755
--- a/sbin/spark-daemon.sh
+++ b/sbin/spark-daemon.sh
@@ -31,7 +31,7 @@
 #   SPARK_NO_DAEMONIZE   If set, will run the proposed command in the 
foreground. It will not output a PID file.
 ##
 
-usage="Usage: spark-daemon.sh [--config ] (start|stop|submit|status) 
  "
+usage="Usage: spark-daemon.sh [--config ] 
(start|stop|submit|decommission|status)   
"
 
 # if no args specified, show usage
 if [ $# -le 1 ]; then


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader

2024-01-23 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0356ac009472 [SPARK-40876][SQL] Widening type promotion from integers 
to decimal in Parquet vectorized reader
0356ac009472 is described below

commit 0356ac00947282b1a0885ad7eaae1e25e43671fe
Author: Johan Lasperas 
AuthorDate: Tue Jan 23 12:37:18 2024 -0800

[SPARK-40876][SQL] Widening type promotion from integers to decimal in 
Parquet vectorized reader

### What changes were proposed in this pull request?
This is a follow-up from https://github.com/apache/spark/pull/44368 and 
https://github.com/apache/spark/pull/44513, implementing an additional type 
promotion from integers to decimals in the parquet vectorized reader, bringing 
it at parity with the non-vectorized reader in that regard.

### Why are the changes needed?
This allows reading parquet files that have different schemas and mix 
decimals and integers - e.g reading files containing either `Decimal(15, 2)` 
and `INT32` as `Decimal(15, 2)` - as long as the requested decimal type is 
large enough to accommodate the integer values without precision loss.

### Does this PR introduce _any_ user-facing change?
Yes, the following now succeeds when using the vectorized Parquet reader:
```
  Seq(20).toDF($"a".cast(IntegerType)).write.parquet(path)
  spark.read.schema("a decimal(12, 0)").parquet(path).collect()
```
It failed before with the vectorized reader and succeeded with the 
non-vectorized reader.

### How was this patch tested?
- Tests added to `ParquetWideningTypeSuite`
- Updated relevant `ParquetQuerySuite` test.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44803 from johanl-db/SPARK-40876-widening-promotion-int-to-decimal.

Authored-by: Johan Lasperas 
Signed-off-by: Dongjoon Hyun 
---
 .../parquet/ParquetVectorUpdaterFactory.java   |  39 ++-
 .../parquet/VectorizedColumnReader.java|   7 +-
 .../datasources/parquet/ParquetQuerySuite.scala|   8 +-
 .../parquet/ParquetTypeWideningSuite.scala | 123 ++---
 4 files changed, 150 insertions(+), 27 deletions(-)

diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
index 0d8713b58cec..f369688597b9 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
@@ -1407,7 +1407,11 @@ public class ParquetVectorUpdaterFactory {
   super(sparkType);
   LogicalTypeAnnotation typeAnnotation =
 descriptor.getPrimitiveType().getLogicalTypeAnnotation();
-  this.parquetScale = ((DecimalLogicalTypeAnnotation) 
typeAnnotation).getScale();
+  if (typeAnnotation instanceof DecimalLogicalTypeAnnotation) {
+this.parquetScale = ((DecimalLogicalTypeAnnotation) 
typeAnnotation).getScale();
+  } else {
+this.parquetScale = 0;
+  }
 }
 
 @Override
@@ -1436,14 +1440,18 @@ public class ParquetVectorUpdaterFactory {
 }
   }
 
-private static class LongToDecimalUpdater extends DecimalUpdater {
+  private static class LongToDecimalUpdater extends DecimalUpdater {
 private final int parquetScale;
 
-   LongToDecimalUpdater(ColumnDescriptor descriptor, DecimalType sparkType) {
+LongToDecimalUpdater(ColumnDescriptor descriptor, DecimalType sparkType) {
   super(sparkType);
   LogicalTypeAnnotation typeAnnotation =
 descriptor.getPrimitiveType().getLogicalTypeAnnotation();
-  this.parquetScale = ((DecimalLogicalTypeAnnotation) 
typeAnnotation).getScale();
+  if (typeAnnotation instanceof DecimalLogicalTypeAnnotation) {
+this.parquetScale = ((DecimalLogicalTypeAnnotation) 
typeAnnotation).getScale();
+  } else {
+this.parquetScale = 0;
+  }
 }
 
 @Override
@@ -1641,6 +1649,12 @@ private static class FixedLenByteArrayToDecimalUpdater 
extends DecimalUpdater {
 return typeAnnotation instanceof DateLogicalTypeAnnotation;
   }
 
+  private static boolean isSignedIntAnnotation(LogicalTypeAnnotation 
typeAnnotation) {
+if (!(typeAnnotation instanceof IntLogicalTypeAnnotation)) return false;
+IntLogicalTypeAnnotation intAnnotation = (IntLogicalTypeAnnotation) 
typeAnnotation;
+return intAnnotation.isSigned();
+  }
+
   private static boolean isDecimalTypeMatched(ColumnDescriptor descriptor, 
DataType dt) {
 DecimalType requestedType = (DecimalType) dt;
 LogicalTypeAnnot

(spark) branch branch-3.4 updated: [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints

2024-01-23 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 894faabbffb4 [SPARK-46794][SQL] Remove subqueries from LogicalRDD 
constraints
894faabbffb4 is described below

commit 894faabbffb4a7075ade6b5e830d76aa4ae7542f
Author: Tom van Bussel 
AuthorDate: Tue Jan 23 08:45:32 2024 -0800

[SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints

This PR modifies `LogicalRDD` to filter out all subqueries from its 
`constraints`.

Fixes a correctness bug. Spark can produce incorrect results when using a 
checkpointed `DataFrame` with a filter containing a scalar subquery. This 
subquery is included in the constraints of the resulting `LogicalRDD`, and may 
then be propagated as a filter when joining with the checkpointed `DataFrame`. 
This causes the subquery to be evaluated twice: once during checkpointing and 
once while evaluating the query. These two subquery evaluations may return 
different results, e.g. when t [...]

No

Added a test to `DataFrameSuite`.

No

Closes #44833 from tomvanbussel/SPARK-46794.

Authored-by: Tom van Bussel 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit d26e871136e0c6e1f84a25978319733a516b7b2e)
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/execution/ExistingRDD.scala |  7 +++
 .../test/scala/org/apache/spark/sql/DataFrameSuite.scala | 16 +++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
index 3dcf0efaadd8..3b49abcb1a86 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
@@ -150,6 +150,13 @@ case class LogicalRDD(
   }
 
   override lazy val constraints: ExpressionSet = 
originConstraints.getOrElse(ExpressionSet())
+// Subqueries can have non-deterministic results even when they only 
contain deterministic
+// expressions (e.g. consider a LIMIT 1 subquery without an ORDER BY). 
Propagating predicates
+// containing a subquery causes the subquery to be executed twice (as the 
result of the subquery
+// in the checkpoint computation cannot be reused), which could result in 
incorrect results.
+// Therefore we assume that all subqueries are non-deterministic, and we 
do not expose any
+// constraints that contain a subquery.
+.filterNot(SubqueryExpression.hasSubquery)
 }
 
 object LogicalRDD extends Logging {
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index 9ddb4abe98b2..a9f69ab28a17 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -35,7 +35,7 @@ import org.apache.spark.scheduler.{SparkListener, 
SparkListenerJobEnd}
 import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
 import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder}
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, EqualTo, ExpressionSet, GreaterThan, Literal, PythonUDF, 
Uuid}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, EqualTo, ExpressionSet, GreaterThan, Literal, PythonUDF, 
ScalarSubquery, Uuid}
 import org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation
 import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LeafNode, 
LocalRelation, LogicalPlan, OneRowRelation, Statistics}
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
@@ -2219,6 +2219,20 @@ class DataFrameSuite extends QueryTest
 assert(newConstraints === newExpectedConstraints)
   }
 
+  test("SPARK-46794: exclude subqueries from LogicalRDD constraints") {
+withTempDir { checkpointDir =>
+  val subquery =
+new 
Column(ScalarSubquery(spark.range(10).selectExpr("max(id)").logicalPlan))
+  val df = spark.range(1000).filter($"id" === subquery)
+  
assert(df.logicalPlan.constraints.exists(_.exists(_.isInstanceOf[ScalarSubquery])))
+
+  spark.sparkContext.setCheckpointDir(checkpointDir.getAbsolutePath)
+  val checkpointedDf = df.checkpoint()
+  assert(!checkpointedDf.logicalPlan.constraints
+.exists(_.exists(_.isInstanceOf[ScalarSubquery])))
+}
+  }
+
   test("SPARK-10656: completely support special chars") {
 val df = Seq(1 -> "a").toDF("i_

(spark) branch branch-3.5 updated: [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints

2024-01-23 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 05f7aa596c7b [SPARK-46794][SQL] Remove subqueries from LogicalRDD 
constraints
05f7aa596c7b is described below

commit 05f7aa596c7b1c05704abfad94b1b1d3085c530e
Author: Tom van Bussel 
AuthorDate: Tue Jan 23 08:45:32 2024 -0800

[SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints

This PR modifies `LogicalRDD` to filter out all subqueries from its 
`constraints`.

Fixes a correctness bug. Spark can produce incorrect results when using a 
checkpointed `DataFrame` with a filter containing a scalar subquery. This 
subquery is included in the constraints of the resulting `LogicalRDD`, and may 
then be propagated as a filter when joining with the checkpointed `DataFrame`. 
This causes the subquery to be evaluated twice: once during checkpointing and 
once while evaluating the query. These two subquery evaluations may return 
different results, e.g. when t [...]

No

Added a test to `DataFrameSuite`.

No

Closes #44833 from tomvanbussel/SPARK-46794.

Authored-by: Tom van Bussel 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit d26e871136e0c6e1f84a25978319733a516b7b2e)
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/execution/ExistingRDD.scala |  7 +++
 .../test/scala/org/apache/spark/sql/DataFrameSuite.scala | 16 +++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
index 3dcf0efaadd8..3b49abcb1a86 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
@@ -150,6 +150,13 @@ case class LogicalRDD(
   }
 
   override lazy val constraints: ExpressionSet = 
originConstraints.getOrElse(ExpressionSet())
+// Subqueries can have non-deterministic results even when they only 
contain deterministic
+// expressions (e.g. consider a LIMIT 1 subquery without an ORDER BY). 
Propagating predicates
+// containing a subquery causes the subquery to be executed twice (as the 
result of the subquery
+// in the checkpoint computation cannot be reused), which could result in 
incorrect results.
+// Therefore we assume that all subqueries are non-deterministic, and we 
do not expose any
+// constraints that contain a subquery.
+.filterNot(SubqueryExpression.hasSubquery)
 }
 
 object LogicalRDD extends Logging {
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index 2eba9f181098..002719f06896 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -35,7 +35,7 @@ import org.apache.spark.scheduler.{SparkListener, 
SparkListenerJobEnd}
 import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
 import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, EqualTo, ExpressionSet, GreaterThan, Literal, PythonUDF, 
Uuid}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, EqualTo, ExpressionSet, GreaterThan, Literal, PythonUDF, 
ScalarSubquery, Uuid}
 import org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation
 import org.apache.spark.sql.catalyst.parser.ParseException
 import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LeafNode, 
LocalRelation, LogicalPlan, OneRowRelation, Statistics}
@@ -2258,6 +2258,20 @@ class DataFrameSuite extends QueryTest
 assert(newConstraints === newExpectedConstraints)
   }
 
+  test("SPARK-46794: exclude subqueries from LogicalRDD constraints") {
+withTempDir { checkpointDir =>
+  val subquery =
+new 
Column(ScalarSubquery(spark.range(10).selectExpr("max(id)").logicalPlan))
+  val df = spark.range(1000).filter($"id" === subquery)
+  
assert(df.logicalPlan.constraints.exists(_.exists(_.isInstanceOf[ScalarSubquery])))
+
+  spark.sparkContext.setCheckpointDir(checkpointDir.getAbsolutePath)
+  val checkpointedDf = df.checkpoint()
+  assert(!checkpointedDf.logicalPlan.constraints
+.exists(_.exists(_.isInstanceOf[ScalarSubquery])))
+}
+  }
+
   test("SPARK-10656: completely support special chars") {
 val df = Seq(1 -> "a").toDF("i_

(spark) branch master updated (8ab69992584a -> d26e871136e0)

2024-01-23 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8ab69992584a [SPARK-46772][SQL][TESTS] Benchmarking Avro with 
Compression Codecs
 add d26e871136e0 [SPARK-46794][SQL] Remove subqueries from LogicalRDD 
constraints

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/execution/ExistingRDD.scala |  7 +++
 .../test/scala/org/apache/spark/sql/DataFrameSuite.scala | 16 +++-
 2 files changed, 22 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46772][SQL][TESTS] Benchmarking Avro with Compression Codecs

2024-01-23 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8ab69992584a [SPARK-46772][SQL][TESTS] Benchmarking Avro with 
Compression Codecs
8ab69992584a is described below

commit 8ab69992584aa68e882c4a4aa4863049e6a58e7e
Author: Kent Yao 
AuthorDate: Tue Jan 23 08:06:21 2024 -0800

[SPARK-46772][SQL][TESTS] Benchmarking Avro with Compression Codecs

### What changes were proposed in this pull request?

This PR improves AvroWriteBenchmark by adding benchmarks with codec and 
their extra functionalities.

- Avro compression with different codec
- Avro deflate/xz/zstandard with different levels
  - buffer pool if zstandard

### Why are the changes needed?

performance observation.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?


connector/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroWriteBenchmark.scala

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44849 from yaooqinn/SPARK-46772.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../AvroWriteBenchmark-jdk21-results.txt   | 58 ++
 .../avro/benchmarks/AvroWriteBenchmark-results.txt | 58 ++
 .../execution/benchmark/AvroWriteBenchmark.scala   | 52 ---
 3 files changed, 143 insertions(+), 25 deletions(-)

diff --git a/connector/avro/benchmarks/AvroWriteBenchmark-jdk21-results.txt 
b/connector/avro/benchmarks/AvroWriteBenchmark-jdk21-results.txt
index f3e1dfa39829..86c6b6647f2f 100644
--- a/connector/avro/benchmarks/AvroWriteBenchmark-jdk21-results.txt
+++ b/connector/avro/benchmarks/AvroWriteBenchmark-jdk21-results.txt
@@ -1,16 +1,56 @@
-OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure
+OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 5.15.0-1053-azure
 AMD EPYC 7763 64-Core Processor
 Avro writer benchmark:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Output Single Int Column   1389   1404 
 21 11.3  88.3   1.0X
-Output Single Double Column1522   1523 
  1 10.3  96.8   0.9X
-Output Int and String Column   3398   3400 
  3  4.6 216.0   0.4X
-Output Partitions  2855   2874 
 27  5.5 181.5   0.5X
-Output Buckets 3857   3903 
 66  4.1 245.2   0.4X
+Output Single Int Column   1433   1505 
101 11.0  91.1   1.0X
+Output Single Double Column1467   1487 
 28 10.7  93.3   1.0X
+Output Int and String Column   3187   3203 
 23  4.9 202.6   0.4X
+Output Partitions  2759   2796 
 52  5.7 175.4   0.5X
+Output Buckets 3760   3767 
  9  4.2 239.1   0.4X
 
-OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure
+OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 5.15.0-1053-azure
 AMD EPYC 7763 64-Core Processor
-Write wide rows into 20 files:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+Avro compression with different codec:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Write wide rows   22729  22774 
 63  0.0   45458.0   1.0X
+BZIP2:   116001 116248 
349  0.0 1160008.1   1.0X
+DEFLATE:   6867   6870 
  4  0.0   68672.5  16.9X
+UNCOMPRESSED:  5339   5354 
 21  0.0   53388.4  21.7X
+SNAPPY:5077   5096 
 28  0.0   50769.3  22.8X
+XZ:   61387  61501 
161  0.0  613871.9   1.9X
+ZSTANDARD: 5333   5349 
 23

(spark) branch master updated: [SPARK-46718][BUILD] Upgrade Arrow to 15.0.0

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8a7609f1cb2d [SPARK-46718][BUILD] Upgrade Arrow to 15.0.0
8a7609f1cb2d is described below

commit 8a7609f1cb2dd92ee30ec8172a1c1501d5810dae
Author: yangjie01 
AuthorDate: Mon Jan 22 21:25:21 2024 -0800

[SPARK-46718][BUILD] Upgrade Arrow to 15.0.0

### What changes were proposed in this pull request?
This pr aims to upgrade Arrow from 14.0.2 to 15.0.0, this version fixes the 
compatibility issue with Netty 4.1.104.Final(GH-39265).

Additionally, since the `arrow-vector` module uses `eclipse-collections` to 
replace `netty-common` as a compile-level dependency, Apache Spark has added a 
dependency on `eclipse-collections` after upgrading to use Arrow 15.0.0.

### Why are the changes needed?
The new version brings the following major changes:

Bug Fixes
GH-34610 - [Java] Fix valueCount and field name when loading/transferring 
NullVector
GH-38242 - [Java] Fix incorrect internal struct accounting for 
DenseUnionVector#getBufferSizeFor
GH-38254 - [Java] Add reusable buffer getters to char/binary vectors
GH-38366 - [Java] Fix Murmur hash on buffers less than 4 bytes
GH-38387 - [Java] Fix JDK8 compilation issue with TestAllTypes
GH-38614 - [Java] Add VarBinary and VarCharWriter helper methods to more 
writers
GH-38725 - [Java] decompression in Lz4CompressionCodec.java does not set 
writer index

New Features and Improvements
GH-38511 - [Java] Add getTransferPair(Field, BufferAllocator, CallBack) for 
StructVector and MapVector
GH-14936 - [Java] Remove netty dependency from arrow-vector
GH-38990 - [Java] Upgrade to flatc version 23.5.26
GH-39265 - [Java] Make it run well with the netty newest version 4.1.104

The full release notes as follows:

- https://arrow.apache.org/release/15.0.0.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44797 from LuciferYang/SPARK-46718.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 12 +++-
 pom.xml   |  2 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 6220626069af..4ee0f5a41191 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -16,10 +16,10 @@ antlr4-runtime/4.13.1//antlr4-runtime-4.13.1.jar
 aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar
 arpack/3.0.3//arpack-3.0.3.jar
 arpack_combined_all/0.1//arpack_combined_all-0.1.jar
-arrow-format/14.0.2//arrow-format-14.0.2.jar
-arrow-memory-core/14.0.2//arrow-memory-core-14.0.2.jar
-arrow-memory-netty/14.0.2//arrow-memory-netty-14.0.2.jar
-arrow-vector/14.0.2//arrow-vector-14.0.2.jar
+arrow-format/15.0.0//arrow-format-15.0.0.jar
+arrow-memory-core/15.0.0//arrow-memory-core-15.0.0.jar
+arrow-memory-netty/15.0.0//arrow-memory-netty-15.0.0.jar
+arrow-vector/15.0.0//arrow-vector-15.0.0.jar
 audience-annotations/0.12.0//audience-annotations-0.12.0.jar
 avro-ipc/1.11.3//avro-ipc-1.11.3.jar
 avro-mapred/1.11.3//avro-mapred-1.11.3.jar
@@ -63,7 +63,9 @@ derby/10.16.1.1//derby-10.16.1.1.jar
 derbyshared/10.16.1.1//derbyshared-10.16.1.1.jar
 derbytools/10.16.1.1//derbytools-10.16.1.1.jar
 
dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
-flatbuffers-java/1.12.0//flatbuffers-java-1.12.0.jar
+eclipse-collections-api/11.1.0//eclipse-collections-api-11.1.0.jar
+eclipse-collections/11.1.0//eclipse-collections-11.1.0.jar
+flatbuffers-java/23.5.26//flatbuffers-java-23.5.26.jar
 gcs-connector/hadoop3-2.2.18/shaded/gcs-connector-hadoop3-2.2.18-shaded.jar
 gmetric4j/1.0.10//gmetric4j-1.0.10.jar
 gson/2.2.4//gson-2.2.4.jar
diff --git a/pom.xml b/pom.xml
index e290273543c6..5f33dd7d8ebc 100644
--- a/pom.xml
+++ b/pom.xml
@@ -230,7 +230,7 @@
 If you are changing Arrow version specification, please check
 ./python/pyspark/sql/pandas/utils.py, and ./python/setup.py too.
 -->
-14.0.2
+15.0.0
 2.5.11
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46805][BUILD] Upgrade `scalafmt` to 3.7.17

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 34beb02827ff [SPARK-46805][BUILD] Upgrade `scalafmt` to 3.7.17
34beb02827ff is described below

commit 34beb02827ffe14e3ed0407bed3f434098340ce4
Author: panbingkun 
AuthorDate: Mon Jan 22 21:23:10 2024 -0800

[SPARK-46805][BUILD] Upgrade `scalafmt` to 3.7.17

### What changes were proposed in this pull request?
The pr aims to upgrade `scalafmt` from `3.7.13` to `3.7.17`.

### Why are the changes needed?
- Regular upgrade, the last upgrade occurred 5 months ago.

- The full release notes:
  https://github.com/scalameta/scalafmt/releases/tag/v3.7.17
  https://github.com/scalameta/scalafmt/releases/tag/v3.7.16
  https://github.com/scalameta/scalafmt/releases/tag/v3.7.15
  https://github.com/scalameta/scalafmt/releases/tag/v3.7.14

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44845 from panbingkun/SPARK-46805.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 dev/.scalafmt.conf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/.scalafmt.conf b/dev/.scalafmt.conf
index 721dec289900..b3a43a03651a 100644
--- a/dev/.scalafmt.conf
+++ b/dev/.scalafmt.conf
@@ -32,4 +32,4 @@ fileOverride {
 runner.dialect = scala213
   }
 }
-version = 3.7.13
+version = 3.7.17


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (00a9b94d1827 -> 31ffb2d99900)

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 00a9b94d1827 [SPARK-46804][DOCS][TESTS] Recover the generated documents
 add 31ffb2d99900 [SPARK-46800][CORE] Support 
`spark.deploy.spreadOutDrivers`

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/deploy/master/Master.scala| 66 ++
 .../org/apache/spark/internal/config/Deploy.scala  |  5 ++
 .../apache/spark/deploy/master/MasterSuite.scala   | 38 +
 docs/spark-standalone.md   | 10 
 4 files changed, 96 insertions(+), 23 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46804][DOCS][TESTS] Recover the generated documents

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 00a9b94d1827 [SPARK-46804][DOCS][TESTS] Recover the generated documents
00a9b94d1827 is described below

commit 00a9b94d18279cc75259c46b67cbb3da0078327b
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 22 17:57:05 2024 -0800

[SPARK-46804][DOCS][TESTS] Recover the generated documents

### What changes were proposed in this pull request?

This PR regenerated the documents with the following.
```
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly *SparkThrowableSuite 
-- -t \"Error classes match with document\""
```

### Why are the changes needed?

The following PR broke CIs by manually fixing the generated docs.

- #44825

Currently, CI is broken like the following.
- https://github.com/apache/spark/actions/runs/7619269448/job/20752056653
- https://github.com/apache/spark/actions/runs/7619199659/job/20751858197

```
[info] - Error classes match with document *** FAILED *** (24 milliseconds)
[info]   "...lstates.html#class-0[A]-feature-not-support..." did not equal 
"...lstates.html#class-0[a]-feature-not-support..." The error class document is 
not up to date. Please regenerate it. (SparkThrowableSuite.scala:322)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually check.
```
$ build/sbt "core/testOnly *.SparkThrowableSuite"
...
[info] SparkThrowableSuite:
[info] - No duplicate error classes (31 milliseconds)
[info] - Error classes are correctly formatted (47 milliseconds)
[info] - SQLSTATE is mandatory (2 milliseconds)
[info] - SQLSTATE invariants (26 milliseconds)
[info] - Message invariants (8 milliseconds)
[info] - Message format invariants (7 milliseconds)
[info] - Error classes match with document (65 milliseconds)
[info] - Round trip (28 milliseconds)
[info] - Error class names should contain only capital letters, numbers and 
underscores (7 milliseconds)
[info] - Check if error class is missing (15 milliseconds)
[info] - Check if message parameters match message format (4 milliseconds)
[info] - Error message is formatted (1 millisecond)
[info] - Error message does not do substitution on values (1 millisecond)
[info] - Try catching legacy SparkError (0 milliseconds)
[info] - Try catching SparkError with error class (1 millisecond)
[info] - Try catching internal SparkError (0 milliseconds)
[info] - Get message in the specified format (6 milliseconds)
[info] - overwrite error classes (61 milliseconds)
[info] - prohibit dots in error class names (23 milliseconds)
[info] Run completed in 1 second, 357 milliseconds.
[info] Total number of tests run: 19
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44843 from dongjoon-hyun/SPARK-46804.

    Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 ...r-conditions-cannot-update-field-error-class.md |   4 +-
 ...ions-insufficient-table-property-error-class.md |   4 +-
 ...-internal-error-metadata-catalog-error-class.md |   4 +-
 ...-error-conditions-invalid-cursor-error-class.md |   4 +-
 ...-error-conditions-invalid-handle-error-class.md |   4 +-
 ...or-conditions-missing-attributes-error-class.md |   6 +-
 ...ns-not-supported-in-jdbc-catalog-error-class.md |   4 +-
 ...-conditions-unsupported-add-file-error-class.md |   4 +-
 ...itions-unsupported-default-value-error-class.md |   4 +-
 ...ditions-unsupported-deserializer-error-class.md |   4 +-
 ...r-conditions-unsupported-feature-error-class.md |   4 +-
 ...conditions-unsupported-save-mode-error-class.md |   4 +-
 ...ted-subquery-expression-category-error-class.md |   4 +-
 docs/sql-error-conditions.md   | 102 ++---
 14 files changed, 92 insertions(+), 64 deletions(-)

diff --git a/docs/sql-error-conditions-cannot-update-field-error-class.md 
b/docs/sql-error-conditions-cannot-update-field-error-class.md
index 3d7152e499c9..42f952a403be 100644
--- a/docs/sql-error-conditions-cannot-update-field-error-class.md
+++ b/docs/sql-error-conditions-cannot-update-field-error-class.md
@@ -19,7 +19,7 @@ license: |
   limitations under the License.
 ---
 
-[SQLSTATE: 
0A000](sql-error-conditions-sqlstates.html#class-0a-feature-not-supported)
+[SQLSTATE: 
0A000](sql-error-conditions-sqlstates.html#class-0A-feature-not-supported)
 
 Cannot update `` field `` type:
 
@@ -44

(spark) branch master updated: [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 52b62921cadb [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as 
a test failure in Python testing script
52b62921cadb is described below

commit 52b62921cadb05da5b1183f979edf7d608256f2e
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 22 17:06:59 2024 -0800

[SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in 
Python testing script

### What changes were proposed in this pull request?

This PR proposes to avoid treating the exit code 5 as a test failure in 
Python testing script.

### Why are the changes needed?

```
...

Running PySpark tests

Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.12']
Will test the following Python modules: ['pyspark-core', 
'pyspark-streaming', 'pyspark-errors']
python3.12 python_implementation is CPython
python3.12 version is: Python 3.12.1
Starting test(python3.12): pyspark.streaming.tests.test_context (temp 
output: 
/__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log)
Finished test(python3.12): pyspark.streaming.tests.test_context (12s)
Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp 
output: 
/__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log)
Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s)
Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp 
output: 
/__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log)
test_kinesis_stream 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) 
... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."
test_kinesis_stream_api 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api)
 ... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."

--
Ran 0 tests in 0.000s

NO TESTS RAN (skipped=2)

Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; 
see logs.
Error:  running /__w/spark/spark/python/run-tests 
--modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 
--python-executables=python3.12 ; received return code 255
Error: Process completed with exit code 19.
```

Scheduled job fails because of exit 5, see 
https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No,

Closes #44841 from HyukjinKwon/SPARK-46801.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 python/run-tests.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index 97fbf9be320b..4cd3569efce3 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -147,8 +147,8 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
 # this code is invoked from a thread other than the main thread.
 os._exit(1)
 duration = time.time() - start_time
-# Exit on the first failure.
-if retcode != 0:
+# Exit on the first failure but exclude the code 5 for no test ran, see 
SPARK-46801.
+if retcode != 0 and retcode != 5:
 try:
 with FAILURE_REPORTING_LOCK:
 with open(LOG_FILE, 'ab') as log_file:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 2621882da3ef [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as 
a test failure in Python testing script
2621882da3ef is described below

commit 2621882da3effe2c9e0b3aedbcb26942e165a09f
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 22 17:06:59 2024 -0800

[SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in 
Python testing script

### What changes were proposed in this pull request?

This PR proposes to avoid treating the exit code 5 as a test failure in 
Python testing script.

### Why are the changes needed?

```
...

Running PySpark tests

Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.12']
Will test the following Python modules: ['pyspark-core', 
'pyspark-streaming', 'pyspark-errors']
python3.12 python_implementation is CPython
python3.12 version is: Python 3.12.1
Starting test(python3.12): pyspark.streaming.tests.test_context (temp 
output: 
/__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log)
Finished test(python3.12): pyspark.streaming.tests.test_context (12s)
Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp 
output: 
/__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log)
Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s)
Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp 
output: 
/__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log)
test_kinesis_stream 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) 
... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."
test_kinesis_stream_api 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api)
 ... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."

--
Ran 0 tests in 0.000s

NO TESTS RAN (skipped=2)

Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; 
see logs.
Error:  running /__w/spark/spark/python/run-tests 
--modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 
--python-executables=python3.12 ; received return code 255
Error: Process completed with exit code 19.
```

Scheduled job fails because of exit 5, see 
https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No,

Closes #44841 from HyukjinKwon/SPARK-46801.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 52b62921cadb05da5b1183f979edf7d608256f2e)
Signed-off-by: Dongjoon Hyun 
---
 python/run-tests.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index 19e39c822cbb..b9031765d943 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -147,8 +147,8 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
 # this code is invoked from a thread other than the main thread.
 os._exit(1)
 duration = time.time() - start_time
-# Exit on the first failure.
-if retcode != 0:
+# Exit on the first failure but exclude the code 5 for no test ran, see 
SPARK-46801.
+if retcode != 0 and retcode != 5:
 try:
 with FAILURE_REPORTING_LOCK:
 with open(LOG_FILE, 'ab') as log_file:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new a6869b25fb9a [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as 
a test failure in Python testing script
a6869b25fb9a is described below

commit a6869b25fb9a7ac0e7e5015d342435e5c1b5f044
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 22 17:06:59 2024 -0800

[SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in 
Python testing script

### What changes were proposed in this pull request?

This PR proposes to avoid treating the exit code 5 as a test failure in 
Python testing script.

### Why are the changes needed?

```
...

Running PySpark tests

Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.12']
Will test the following Python modules: ['pyspark-core', 
'pyspark-streaming', 'pyspark-errors']
python3.12 python_implementation is CPython
python3.12 version is: Python 3.12.1
Starting test(python3.12): pyspark.streaming.tests.test_context (temp 
output: 
/__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log)
Finished test(python3.12): pyspark.streaming.tests.test_context (12s)
Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp 
output: 
/__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log)
Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s)
Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp 
output: 
/__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log)
test_kinesis_stream 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) 
... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."
test_kinesis_stream_api 
(pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api)
 ... skipped "Skipping all Kinesis Python tests as environmental variable 
'ENABLE_KINESIS_TESTS' was not set."

--
Ran 0 tests in 0.000s

NO TESTS RAN (skipped=2)

Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; 
see logs.
Error:  running /__w/spark/spark/python/run-tests 
--modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 
--python-executables=python3.12 ; received return code 255
Error: Process completed with exit code 19.
```

Scheduled job fails because of exit 5, see 
https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No,

Closes #44841 from HyukjinKwon/SPARK-46801.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 52b62921cadb05da5b1183f979edf7d608256f2e)
Signed-off-by: Dongjoon Hyun 
---
 python/run-tests.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index 19e39c822cbb..b9031765d943 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -147,8 +147,8 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
 # this code is invoked from a thread other than the main thread.
 os._exit(1)
 duration = time.time() - start_time
-# Exit on the first failure.
-if retcode != 0:
+# Exit on the first failure but exclude the code 5 for no test ran, see 
SPARK-46801.
+if retcode != 0 and retcode != 5:
 try:
 with FAILURE_REPORTING_LOCK:
 with open(LOG_FILE, 'ab') as log_file:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46799][CORE][TESTS] Improve `MasterSuite` to use nanoTime-based appIDs and workerIDs

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f9feddfbc9de [SPARK-46799][CORE][TESTS] Improve `MasterSuite` to use 
nanoTime-based appIDs and workerIDs
f9feddfbc9de is described below

commit f9feddfbc9de8e87f7a2e9d8abade7e687335b84
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 22 16:34:26 2024 -0800

[SPARK-46799][CORE][TESTS] Improve `MasterSuite` to use nanoTime-based 
appIDs and workerIDs

### What changes were proposed in this pull request?

This PR aims to improve `MasterSuite` to use nanoTime-based appIDs and 
workerIDs.

### Why are the changes needed?

During testing, I hit a case where two workers have the same ID. This PR 
will prevent the duplicated IDs in Apps and Workers in `MasterSuite`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

```
$ build/sbt "core/testOnly *.MasterSuite"
[info] MasterSuite:
[info] - can use a custom recovery mode factory (443 milliseconds)
[info] - SPARK-46664: master should recover quickly in case of zero workers 
and apps (38 milliseconds)
[info] - master correctly recover the application (41 milliseconds)
[info] - SPARK-46205: Recovery with Kryo Serializer (27 milliseconds)
[info] - SPARK-46216: Recovery without compression (19 milliseconds)
[info] - SPARK-46216: Recovery with compression (20 milliseconds)
[info] - SPARK-46258: Recovery with RocksDB (306 milliseconds)
[info] - master/worker web ui available (197 milliseconds)
[info] - master/worker web ui available with reverseProxy (30 seconds, 123 
milliseconds)
[info] - master/worker web ui available behind front-end reverseProxy (30 
seconds, 113 milliseconds)
[info] - basic scheduling - spread out (23 milliseconds)
[info] - basic scheduling - no spread out (14 milliseconds)
[info] - basic scheduling with more memory - spread out (10 milliseconds)
[info] - basic scheduling with more memory - no spread out (10 milliseconds)
[info] - scheduling with max cores - spread out (9 milliseconds)
[info] - scheduling with max cores - no spread out (9 milliseconds)
[info] - scheduling with cores per executor - spread out (9 milliseconds)
[info] - scheduling with cores per executor - no spread out (8 milliseconds)
[info] - scheduling with cores per executor AND max cores - spread out (8 
milliseconds)
[info] - scheduling with cores per executor AND max cores - no spread out 
(7 milliseconds)
[info] - scheduling with executor limit - spread out (8 milliseconds)
[info] - scheduling with executor limit - no spread out (7 milliseconds)
[info] - scheduling with executor limit AND max cores - spread out (8 
milliseconds)
[info] - scheduling with executor limit AND max cores - no spread out (9 
milliseconds)
[info] - scheduling with executor limit AND cores per executor - spread out 
(8 milliseconds)
[info] - scheduling with executor limit AND cores per executor - no spread 
out (13 milliseconds)
[info] - scheduling with executor limit AND cores per executor AND max 
cores - spread out (8 milliseconds)
[info] - scheduling with executor limit AND cores per executor AND max 
cores - no spread out (7 milliseconds)
[info] - scheduling for app with multiple resource profiles (44 
milliseconds)
[info] - scheduling for app with multiple resource profiles with max cores 
(37 milliseconds)
[info] - SPARK-45174: scheduling with max drivers (9 milliseconds)
[info] - SPARK-13604: Master should ask Worker kill unknown executors and 
drivers (15 milliseconds)
[info] - SPARK-20529: Master should reply the address received from worker 
(20 milliseconds)
[info] - SPARK-27510: Master should avoid dead loop while launching 
executor failed in Worker (34 milliseconds)
[info] - All workers on a host should be decommissioned (28 milliseconds)
[info] - No workers should be decommissioned with invalid host (25 
milliseconds)
[info] - Only worker on host should be decommissioned (19 milliseconds)
[info] - SPARK-19900: there should be a corresponding driver for the app 
after relaunching driver (2 seconds, 60 milliseconds)
[info] - assign/recycle resources to/from driver (33 milliseconds)
[info] - assign/recycle resources to/from executor (27 milliseconds)
[info] - resource description with multiple resource profiles (1 
millisecond)
[info] - SPARK-45753: Support driver id pattern (7 milliseconds)
[info] - SPARK-45753: Prevent invalid driver id patterns (6 milliseconds)
[info] - SPARK-45754: Support app id pattern (7 milliseconds)
[info] - SPARK-45754: Prevent invalid app id patterns (7 milliseconds)
[info] - SPARK-45785:

(spark) branch master updated: [SPARK-46797][CORE] Rename `spark.deploy.spreadOut` to `spark.deploy.spreadOutApps`

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8d1212837538 [SPARK-46797][CORE] Rename `spark.deploy.spreadOut` to 
`spark.deploy.spreadOutApps`
8d1212837538 is described below

commit 8d121283753894d4969d8ff9e09bb487f76e82e1
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 22 16:26:43 2024 -0800

[SPARK-46797][CORE] Rename `spark.deploy.spreadOut` to 
`spark.deploy.spreadOutApps`

### What changes were proposed in this pull request?

This PR aims to rename `spark.deploy.spreadOut` to 
`spark.deploy.spreadOutApps`.

### Why are the changes needed?

Although Apache Spark documentation clearly says it's about `applications`, 
this still misleads users to forget `Driver` JVMs which will be spread out 
always independently from this configuration.


https://github.com/apache/spark/blob/b80e8cb4552268b771fc099457b9186807081c4a/docs/spark-standalone.md?plain=1#L282-L285

### Does this PR introduce _any_ user-facing change?

No, the behavior is the same. Only it will show warnings for old config 
name usages.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44838 from dongjoon-hyun/SPARK-46797.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/internal/config/Deploy.scala | 3 ++-
 docs/spark-standalone.md  | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala 
b/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala
index 6585d62b3b9c..31ac07621176 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala
@@ -97,8 +97,9 @@ private[spark] object Deploy {
 .intConf
 .createWithDefault(10)
 
-  val SPREAD_OUT_APPS = ConfigBuilder("spark.deploy.spreadOut")
+  val SPREAD_OUT_APPS = ConfigBuilder("spark.deploy.spreadOutApps")
 .version("0.6.1")
+.withAlternative("spark.deploy.spreadOut")
 .booleanConf
 .createWithDefault(true)
 
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index b9e3bb5d3f7f..6e454dff1bde 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -279,7 +279,7 @@ SPARK_MASTER_OPTS supports the following system properties:
   1.1.0
 
 
-  spark.deploy.spreadOut
+  spark.deploy.spreadOutApps
   true
   
 Whether the standalone cluster manager should spread applications out 
across nodes or try


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 97536c6673bb [SPARK-46779][SQL] `InMemoryRelation` instances of the 
same cached plan should be semantically equivalent
97536c6673bb is described below

commit 97536c6673bb08ba8741a6a6f697b6880ca629ce
Author: Bruce Robbins 
AuthorDate: Mon Jan 22 11:09:01 2024 -0800

[SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan 
should be semantically equivalent

When canonicalizing `output` in `InMemoryRelation`, use `output` itself as 
the schema for determining the ordinals, rather than `cachedPlan.output`.

`InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't 
necessarily use the same exprIds. E.g.:
```
+- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, 
deserialized, 1 replicas)
  +- LocalTableScan [c1#254, c2#255]

```
Because of this, `InMemoryRelation` will sometimes fail to fully 
canonicalize, resulting in cases where two semantically equivalent 
`InMemoryRelation` instances appear to be semantically nonequivalent.

Example:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) 
from data d2 group by all;
```
If plan change validation checking is on (i.e., 
`spark.sql.planChangeValidation=true`), the failure is:
```
[PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: 
Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, 
scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS 
count(c2)#83L]
...
is not a valid aggregate expression: 
[SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar 
subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an 
aggregate function.
```
If plan change validation checking is off, the failure is more mysterious:
```
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find 
count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
```
If you remove the cache command, the query succeeds.

The above failures happen because the subquery in the aggregate expressions 
and the subquery in the grouping expressions seem semantically nonequivalent 
since the `InMemoryRelation` in one of the subquery plans failed to completely 
canonicalize.

In `CacheManager#useCachedData`, two lookups for the same cached plan may 
create `InMemoryRelation` instances that have different exprIds in `output`. 
That's because the plan fragments used as lookup keys  may have been 
deduplicated by `DeduplicateRelations`, and thus have different exprIds in 
their respective output schemas. When `CacheManager#useCachedData` creates an 
`InMemoryRelation` instance, it borrows the output schema of the plan fragment 
used as the lookup key.

The failure to fully canonicalize has other effects. For example, this 
query fails to reuse the exchange:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(2, 4),
(3, 7),
(7, 22);

cache table data;

set spark.sql.autoBroadcastJoinThreshold=-1;
set spark.sql.adaptive.enabled=false;

select *
from data l
join data r
on l.c1 = r.c1;
```

No.

New tests.

No.

Closes #44806 from bersprockets/plan_validation_issue.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/columnar/InMemoryRelation.scala   |  2 +-
 .../org/apache/spark/sql/DataFrameAggregateSuite.scala| 15 +++
 .../sql/execution/columnar/InMemoryRelationSuite.scala|  7 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
index 4df9915dc96e..119e9e0a188f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
@@ -391,7 +391,7 @@ case class InMemoryRelation(
   }
 
   override def doCanonicalize(): logical.LogicalPlan =
-copy(output = output.map(QueryPlan.normalizeExpres

(spark) branch branch-3.5 updated: [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent

2024-01-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 68d9f353300e [SPARK-46779][SQL] `InMemoryRelation` instances of the 
same cached plan should be semantically equivalent
68d9f353300e is described below

commit 68d9f353300ed7de0b47c26cb30236bada896d25
Author: Bruce Robbins 
AuthorDate: Mon Jan 22 11:09:01 2024 -0800

[SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan 
should be semantically equivalent

When canonicalizing `output` in `InMemoryRelation`, use `output` itself as 
the schema for determining the ordinals, rather than `cachedPlan.output`.

`InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't 
necessarily use the same exprIds. E.g.:
```
+- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, 
deserialized, 1 replicas)
  +- LocalTableScan [c1#254, c2#255]

```
Because of this, `InMemoryRelation` will sometimes fail to fully 
canonicalize, resulting in cases where two semantically equivalent 
`InMemoryRelation` instances appear to be semantically nonequivalent.

Example:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) 
from data d2 group by all;
```
If plan change validation checking is on (i.e., 
`spark.sql.planChangeValidation=true`), the failure is:
```
[PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: 
Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, 
scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS 
count(c2)#83L]
...
is not a valid aggregate expression: 
[SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar 
subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an 
aggregate function.
```
If plan change validation checking is off, the failure is more mysterious:
```
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find 
count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
```
If you remove the cache command, the query succeeds.

The above failures happen because the subquery in the aggregate expressions 
and the subquery in the grouping expressions seem semantically nonequivalent 
since the `InMemoryRelation` in one of the subquery plans failed to completely 
canonicalize.

In `CacheManager#useCachedData`, two lookups for the same cached plan may 
create `InMemoryRelation` instances that have different exprIds in `output`. 
That's because the plan fragments used as lookup keys  may have been 
deduplicated by `DeduplicateRelations`, and thus have different exprIds in 
their respective output schemas. When `CacheManager#useCachedData` creates an 
`InMemoryRelation` instance, it borrows the output schema of the plan fragment 
used as the lookup key.

The failure to fully canonicalize has other effects. For example, this 
query fails to reuse the exchange:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(2, 4),
(3, 7),
(7, 22);

cache table data;

set spark.sql.autoBroadcastJoinThreshold=-1;
set spark.sql.adaptive.enabled=false;

select *
from data l
join data r
on l.c1 = r.c1;
```

No.

New tests.

No.

Closes #44806 from bersprockets/plan_validation_issue.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/columnar/InMemoryRelation.scala   |  2 +-
 .../org/apache/spark/sql/DataFrameAggregateSuite.scala| 15 +++
 .../sql/execution/columnar/InMemoryRelationSuite.scala|  7 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
index 65f7835b42cf..5bab8e53eb16 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
@@ -405,7 +405,7 @@ case class InMemoryRelation(
   }
 
   override def doCanonicalize(): logical.LogicalPlan =
-copy(output = output.map(QueryPlan.normalizeExpres

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 4544 matches

Mail list logo