(spark) branch master updated: [SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java code
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7e911cdd0344 [SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java code 7e911cdd0344 is described below commit 7e911cdd0344f164cc6a2976fa832d50589b3a2c Author: Dongjoon Hyun AuthorDate: Wed Feb 14 09:41:09 2024 -0800 [SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java code ### What changes were proposed in this pull request? This PR aims to add a checkstyle rule to ban `commons-lang` in Java code in favor of `commons-lang3`. ### Why are the changes needed? SPARK-16129 banned `commons-lang` in Scala code since Apache Spark 2.0.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45097 from dongjoon-hyun/SPARK-47039. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/checkstyle-suppressions.xml | 2 ++ dev/checkstyle.xml | 1 + 2 files changed, 3 insertions(+) diff --git a/dev/checkstyle-suppressions.xml b/dev/checkstyle-suppressions.xml index 37c03759ad5e..7b20dfb6bce5 100644 --- a/dev/checkstyle-suppressions.xml +++ b/dev/checkstyle-suppressions.xml @@ -62,4 +62,6 @@ files="sql/api/src/main/java/org/apache/spark/sql/streaming/Trigger.java"/> + diff --git a/dev/checkstyle.xml b/dev/checkstyle.xml index 5af15318081a..b9997d2050d1 100644 --- a/dev/checkstyle.xml +++ b/dev/checkstyle.xml @@ -186,6 +186,7 @@ + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a6bed5e9bcc5 [SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait a6bed5e9bcc5 is described below commit a6bed5e9bcc54dac51421263d5ef73c0b6e0b12c Author: Martin Grund AuthorDate: Wed Feb 14 03:03:30 2024 -0800 [SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait ### What changes were proposed in this pull request? Add an option to the command line of `./sbin/start-connect-server.sh` that leaves it running in the foreground for easier debugging. ``` ./sbin/start-connect-server.sh --wait ``` ### Why are the changes needed? Usability ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Manual ### Was this patch authored or co-authored using generative AI tooling? No Closes #45090 from grundprinzip/start_server_wait. Authored-by: Martin Grund Signed-off-by: Dongjoon Hyun --- sbin/start-connect-server.sh | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/sbin/start-connect-server.sh b/sbin/start-connect-server.sh index a347f43db8b1..fecda717eb34 100755 --- a/sbin/start-connect-server.sh +++ b/sbin/start-connect-server.sh @@ -38,4 +38,10 @@ fi . "${SPARK_HOME}/bin/load-spark-env.sh" -exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark Connect server" "$@" +if [ "$1" == "--wait" ]; then + shift + exec "${SPARK_HOME}"/bin/spark-submit --class $CLASS 1 --name "Spark Connect Server" "$@" +else + exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark Connect server" "$@" +fi + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new a8c62d3f9a8d [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26 a8c62d3f9a8d is described below commit a8c62d3f9a8de22f92e0e0ca1a5770f373b0b142 Author: Dongjoon Hyun AuthorDate: Mon Feb 12 10:37:49 2024 -0800 [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26 This PR aims to upgrade `aircompressor` to 1.26. `aircompressor` v1.26 has the following bug fixes. - [Fix out of bounds read/write in Snappy decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) - [Fix ZstdOutputStream corruption on double close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) No. Pass the CIs. No. Closes #45084 from dongjoon-hyun/SPARK-47023. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml | 5 + 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 9ab51dfa011a..c76702cd0af0 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -4,7 +4,7 @@ JTransforms/3.1//JTransforms-3.1.jar RoaringBitmap/0.9.45//RoaringBitmap-0.9.45.jar ST4/4.0.4//ST4-4.0.4.jar activation/1.1.1//activation-1.1.1.jar -aircompressor/0.25//aircompressor-0.25.jar +aircompressor/0.26//aircompressor-0.26.jar algebra_2.12/2.0.1//algebra_2.12-2.0.1.jar aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar diff --git a/pom.xml b/pom.xml index 52505e6e1200..5db3c78e00eb 100644 --- a/pom.xml +++ b/pom.xml @@ -2555,6 +2555,11 @@ + +io.airlift +aircompressor +0.26 + org.apache.orc orc-mapreduce - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47030][TESTS] Add `WebBrowserTest`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 33a153a6bbcb [SPARK-47030][TESTS] Add `WebBrowserTest` 33a153a6bbcb is described below commit 33a153a6bbcba0d9b2ab20404c7d3b6db86d7b4a Author: Dongjoon Hyun AuthorDate: Mon Feb 12 17:01:35 2024 -0800 [SPARK-47030][TESTS] Add `WebBrowserTest` ### What changes were proposed in this pull request? This PR aims to add a new test tag, `WebBrowserTest`. ### Why are the changes needed? Currently, several browser-based tests exist in multiple modules like the following. It's difficult to find and run them. ``` common/tags/src/test/java/org/apache/spark/tags/WebBrowserTest.java core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/ui/UISeleniumSuite.scala sql/core/src/test/scala/org/apache/spark/sql/streaming/ui/UISeleniumSuite.scala sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/UISeleniumSuite.scala streaming/src/test/scala/org/apache/spark/streaming/UISeleniumSuite.scala ``` In addition, the previous `ChromeUITest` is designed to disable `ChromeUI*` suite and doesn't cover all `WebBroser` based test suite. ### Does this PR introduce _any_ user-facing change? No, this is a new test tag. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45089 from dongjoon-hyun/SPARK-47030. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../java/org/apache/spark/tags/WebBrowserTest.java | 36 +- .../history/ChromeUIHistoryServerSuite.scala | 4 ++- .../spark/deploy/history/HistoryServerSuite.scala | 4 ++- .../apache/spark/ui/ChromeUISeleniumSuite.scala| 3 +- .../org/apache/spark/ui/UISeleniumSuite.scala | 2 ++ .../spark/sql/execution/ui/UISeleniumSuite.scala | 2 ++ .../spark/sql/streaming/ui/UISeleniumSuite.scala | 4 ++- .../sql/hive/thriftserver/UISeleniumSuite.scala| 2 ++ .../apache/spark/streaming/UISeleniumSuite.scala | 2 ++ 9 files changed, 26 insertions(+), 33 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala b/common/tags/src/test/java/org/apache/spark/tags/WebBrowserTest.java similarity index 50% copy from core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala copy to common/tags/src/test/java/org/apache/spark/tags/WebBrowserTest.java index 459af6748e0e..715dcbf3b747 100644 --- a/core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala +++ b/common/tags/src/test/java/org/apache/spark/tags/WebBrowserTest.java @@ -15,35 +15,13 @@ * limitations under the License. */ -package org.apache.spark.ui +package org.apache.spark.tags; -import org.openqa.selenium.{JavascriptExecutor, WebDriver} -import org.openqa.selenium.chrome.{ChromeDriver, ChromeOptions} +import java.lang.annotation.*; -import org.apache.spark.tags.ChromeUITest +import org.scalatest.TagAnnotation; -/** - * Selenium tests for the Spark Web UI with Chrome. - */ -@ChromeUITest -class ChromeUISeleniumSuite extends RealBrowserUISeleniumSuite("webdriver.chrome.driver") { - - override var webDriver: WebDriver with JavascriptExecutor = _ - - override def beforeAll(): Unit = { -super.beforeAll() -val chromeOptions = new ChromeOptions -chromeOptions.addArguments("--headless", "--disable-gpu") -webDriver = new ChromeDriver(chromeOptions) - } - - override def afterAll(): Unit = { -try { - if (webDriver != null) { -webDriver.quit() - } -} finally { - super.afterAll() -} - } -} +@TagAnnotation +@Retention(RetentionPolicy.RUNTIME) +@Target({ElementType.METHOD, ElementType.TYPE}) +public @interface WebBrowserTest { } diff --git a/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala b/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala index ec910e9bf343..ec9278f81b6c 100644 --- a/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/history/ChromeUIHistoryServerSuite.scala @@ -21,7 +21,7 @@ import org.openqa.selenium.WebDriver import org.openqa.selenium.chrome.{ChromeDriver, ChromeOptions} import org.apache.spark.internal.config.History.HybridStoreD
(spark) branch master updated: [SPARK-47027][PYTHON][TESTS] Use temporary directories for profiler test outputs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 24a9d25358f7 [SPARK-47027][PYTHON][TESTS] Use temporary directories for profiler test outputs 24a9d25358f7 is described below commit 24a9d25358f71e5634240aa29c600588b838edb2 Author: Takuya UESHIN AuthorDate: Mon Feb 12 13:35:45 2024 -0800 [SPARK-47027][PYTHON][TESTS] Use temporary directories for profiler test outputs ### What changes were proposed in this pull request? Use temporary directories for profiler test outputs instead of `tempfile.gettempdir()`. ### Why are the changes needed? Directly using `tempfile.gettempdir()` can leave the files there after each test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45087 from ueshin/issues/SPARK-47027/tempdir. Authored-by: Takuya UESHIN Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/tests/test_udf_profiler.py | 28 +-- python/pyspark/tests/test_memory_profiler.py | 6 +++--- python/pyspark/tests/test_profiler.py | 6 +++--- 3 files changed, 20 insertions(+), 20 deletions(-) diff --git a/python/pyspark/sql/tests/test_udf_profiler.py b/python/pyspark/sql/tests/test_udf_profiler.py index 4f767d274414..764a860026f8 100644 --- a/python/pyspark/sql/tests/test_udf_profiler.py +++ b/python/pyspark/sql/tests/test_udf_profiler.py @@ -82,20 +82,20 @@ class UDFProfilerTests(unittest.TestCase): finally: sys.stdout = old_stdout -d = tempfile.gettempdir() -self.sc.dump_profiles(d) - -for i, udf_name in enumerate(["add1", "add2", "add1", "add2"]): -id, profiler, _ = profilers[i] -with self.subTest(id=id, udf_name=udf_name): -stats = profiler.stats() -self.assertTrue(stats is not None) -width, stat_list = stats.get_print_list([]) -func_names = [func_name for fname, n, func_name in stat_list] -self.assertTrue(udf_name in func_names) - -self.assertTrue(udf_name in io.getvalue()) -self.assertTrue("udf_%d.pstats" % id in os.listdir(d)) +with tempfile.TemporaryDirectory() as d: +self.sc.dump_profiles(d) + +for i, udf_name in enumerate(["add1", "add2", "add1", "add2"]): +id, profiler, _ = profilers[i] +with self.subTest(id=id, udf_name=udf_name): +stats = profiler.stats() +self.assertTrue(stats is not None) +width, stat_list = stats.get_print_list([]) +func_names = [func_name for fname, n, func_name in stat_list] +self.assertTrue(udf_name in func_names) + +self.assertTrue(udf_name in io.getvalue()) +self.assertTrue("udf_%d.pstats" % id in os.listdir(d)) def test_custom_udf_profiler(self): class TestCustomProfiler(UDFBasicProfiler): diff --git a/python/pyspark/tests/test_memory_profiler.py b/python/pyspark/tests/test_memory_profiler.py index 536f38679c3e..aa3541620446 100644 --- a/python/pyspark/tests/test_memory_profiler.py +++ b/python/pyspark/tests/test_memory_profiler.py @@ -106,9 +106,9 @@ class MemoryProfilerTests(PySparkTestCase): self.sc.show_profiles() self.assertTrue("plus_one" in fake_out.getvalue()) -d = tempfile.gettempdir() -self.sc.dump_profiles(d) -self.assertTrue("udf_%d_memory.txt" % id in os.listdir(d)) +with tempfile.TemporaryDirectory() as d: +self.sc.dump_profiles(d) +self.assertTrue("udf_%d_memory.txt" % id in os.listdir(d)) def test_profile_pandas_udf(self): udfs = [self.exec_pandas_udf_ser_to_ser, self.exec_pandas_udf_ser_to_scalar] diff --git a/python/pyspark/tests/test_profiler.py b/python/pyspark/tests/test_profiler.py index b7797ead2adb..a12bc99c54ae 100644 --- a/python/pyspark/tests/test_profiler.py +++ b/python/pyspark/tests/test_profiler.py @@ -54,9 +54,9 @@ class ProfilerTests(PySparkTestCase): self.assertTrue("heavy_foo" in io.getvalue()) sys.stdout = old_stdout -d = tempfile.gettempdir() -self.sc.dump_profiles(d) -self.assertTrue("rdd_%d.pstats" % id in os.listdir(d)) +with tempfile.TemporaryDirectory() as d: +self.sc.dump_profiles(d) +
(spark) branch master updated: [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b425cd866334 [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26 b425cd866334 is described below commit b425cd86633402b764ea90449853610e98963a54 Author: Dongjoon Hyun AuthorDate: Mon Feb 12 10:37:49 2024 -0800 [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26 ### What changes were proposed in this pull request? This PR aims to upgrade `aircompressor` to 1.26. ### Why are the changes needed? `aircompressor` v1.26 has the following bug fixes. - [Fix out of bounds read/write in Snappy decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) - [Fix ZstdOutputStream corruption on double close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45084 from dongjoon-hyun/SPARK-47023. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml | 5 + 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index dd8d74888c6a..0b619a249e96 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -4,7 +4,7 @@ JTransforms/3.1//JTransforms-3.1.jar RoaringBitmap/1.0.1//RoaringBitmap-1.0.1.jar ST4/4.0.4//ST4-4.0.4.jar activation/1.1.1//activation-1.1.1.jar -aircompressor/0.25//aircompressor-0.25.jar +aircompressor/0.26//aircompressor-0.26.jar algebra_2.13/2.8.0//algebra_2.13-2.8.0.jar aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar diff --git a/pom.xml b/pom.xml index ed6f48262570..0b6a6955b18b 100644 --- a/pom.xml +++ b/pom.xml @@ -2596,6 +2596,11 @@ + +io.airlift +aircompressor +0.26 + org.apache.orc orc-mapreduce - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r67309 - /dev/spark/v3.3.4-rc1-docs/
Author: dongjoon Date: Mon Feb 12 17:26:47 2024 New Revision: 67309 Log: Remove Apache Spark 3.3.4 RC1 docs after releasing Removed: dev/spark/v3.3.4-rc1-docs/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f5b1e37c9e6a [SPARK-5][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` f5b1e37c9e6a is described below commit f5b1e37c9e6a4ec2fd897f97cd4526415e6c0e49 Author: Dongjoon Hyun AuthorDate: Mon Feb 12 09:12:10 2024 -0800 [SPARK-5][BUILD][TESTS] Use `org.seleniumhq.selenium.htmlunit3-driver` instead of `net.sourceforge.htmlunit` ### What changes were proposed in this pull request? This PR aims to use `org.seleniumhq.selenium.htmlunit3-driver` 4.17.0 instead of old `net.sourceforge.htmlunit` test dependencies and `org.seleniumhq.selenium.htmlunit-driver` test dependency; - Remove `net.sourceforge.htmlunit.htmlunit` `2.70.0`. - Remove `net.sourceforge.htmlunit.htmlunit-core-js` `2.70.0`. - Remove `org.seleniumhq.selenium.htmlunit-driver` `4.12.0`. - Remove `xml-apis:xml-apis` `1.4.01`. ### Why are the changes needed? To help browser-based test suites. ### Does this PR introduce _any_ user-facing change? No. This is a test-only dependency and code change. ### How was this patch tested? Manual tests. ``` build/sbt -Dguava.version=32.1.2-jre -Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \ "core/testOnly *HistoryServerSuite" ``` ``` build/sbt -Dguava.version=32.1.2-jre -Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \ "core/testOnly *UISeleniumSuite" ``` ``` build/sbt -Dguava.version=32.1.2-jre -Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \ "sql/testOnly *UISeleniumSuite" ``` ``` build/sbt -Dguava.version=32.1.2-jre -Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \ "streaming/testOnly *UISeleniumSuite" ``` ``` build/sbt -Dguava.version=32.1.2-jre -Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \ "hive-thriftserver/testOnly *UISeleniumSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45079 from dongjoon-hyun/SPARK-5. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/pom.xml | 19 +--- .../spark/deploy/history/HistoryServerSuite.scala | 4 +++- .../history/RealBrowserUIHistoryServerSuite.scala | 6 - .../org/apache/spark/ui/UISeleniumSuite.scala | 4 ++-- pom.xml| 26 +++--- sql/core/pom.xml | 2 +- .../spark/sql/execution/ui/UISeleniumSuite.scala | 8 --- sql/hive-thriftserver/pom.xml | 2 +- streaming/pom.xml | 2 +- 9 files changed, 22 insertions(+), 51 deletions(-) diff --git a/core/pom.xml b/core/pom.xml index f780551fb555..9b5297cb8543 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -354,18 +354,7 @@ org.seleniumhq.selenium - htmlunit-driver - test - - - - net.sourceforge.htmlunit - htmlunit - test - - - net.sourceforge.htmlunit - htmlunit-core-js + htmlunit3-driver test @@ -384,12 +373,6 @@ httpcore test - - - xml-apis - xml-apis - test - org.mockito mockito-core diff --git a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala index 1ca1e8fefd06..b3d7315e169b 100644 --- a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala @@ -561,7 +561,9 @@ abstract class HistoryServerSuite extends SparkFunSuite with BeforeAndAfter with // app is no longer incomplete listApplications(false) should not contain(appId) -assert(jobcount === getNumJobs("/jobs")) +eventually(stdTimeout, stdInterval) { + assert(jobcount === getNumJobs("/jobs")) +} // no need to retain the test dir now the tests complete ShutdownHookManager.registerShutdownDeleteDir(logDir) diff --git a/core
(spark) branch master updated: [SPARK-46991][SQL] Replace `IllegalArgumentException` by `SparkIllegalArgumentException` in `catalyst`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f5b0de07eff4 [SPARK-46991][SQL] Replace `IllegalArgumentException` by `SparkIllegalArgumentException` in `catalyst` f5b0de07eff4 is described below commit f5b0de07eff49bc1d076c4a1dc59c8672beff99e Author: Max Gekk AuthorDate: Sun Feb 11 15:25:28 2024 -0800 [SPARK-46991][SQL] Replace `IllegalArgumentException` by `SparkIllegalArgumentException` in `catalyst` ### What changes were proposed in this pull request? In the PR, I propose to replace all `IllegalArgumentException` by `SparkIllegalArgumentException` in `Catalyst` code base, and introduce new legacy error classes with the `_LEGACY_ERROR_TEMP_` prefix. ### Why are the changes needed? To unify Spark SQL exception, and port Java exceptions on Spark exceptions with error classes. ### Does this PR introduce _any_ user-facing change? Yes, it can if user's code assumes some particular format of `IllegalArgumentException` messages. ### How was this patch tested? By running existing test suites like: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" $ build/sbt "test:testOnly *BufferHolderSparkSubmitSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45033 from MaxGekk/migrate-IllegalArgumentException-catalyst. Authored-by: Max Gekk Signed-off-by: Dongjoon Hyun --- .../src/main/resources/error/error-classes.json| 255 + .../scala/org/apache/spark/SparkException.scala| 25 +- .../connect/client/GrpcExceptionConverter.scala| 1 + .../sql/catalyst/expressions/ExpressionInfo.java | 38 +-- .../catalyst/expressions/codegen/BufferHolder.java | 12 +- .../sql/catalyst/expressions/xml/UDFXPathUtil.java | 4 +- .../catalog/SupportsPartitionManagement.java | 5 +- .../sql/connector/util/V2ExpressionSQLBuilder.java | 4 +- .../spark/sql/util/CaseInsensitiveStringMap.java | 3 +- .../sql/catalyst/CatalystTypeConverters.scala | 75 -- .../spark/sql/catalyst/csv/CSVExprUtils.scala | 18 +- .../spark/sql/catalyst/csv/CSVHeaderChecker.scala | 5 +- .../spark/sql/catalyst/expressions/Cast.scala | 6 +- .../sql/catalyst/expressions/TimeWindow.scala | 6 +- .../expressions/codegen/CodeGenerator.scala| 6 +- .../expressions/collectionOperations.scala | 25 +- .../sql/catalyst/expressions/csvExpressions.scala | 7 +- .../catalyst/expressions/datetimeExpressions.scala | 14 +- .../sql/catalyst/expressions/xmlExpressions.scala | 7 +- .../ReplaceNullWithFalseInPredicate.scala | 11 +- .../spark/sql/catalyst/parser/AstBuilder.scala | 8 +- .../spark/sql/catalyst/plans/joinTypes.scala | 16 +- .../sql/catalyst/plans/logical/v2Commands.scala| 6 +- .../spark/sql/catalyst/util/DateTimeUtils.scala| 8 +- .../spark/sql/catalyst/util/IntervalUtils.scala| 39 ++-- .../spark/sql/catalyst/xml/StaxXmlGenerator.scala | 9 +- .../spark/sql/catalyst/xml/StaxXmlParser.scala | 22 +- .../spark/sql/catalyst/xml/XmlInferSchema.scala| 5 +- .../sql/connector/catalog/CatalogV2Util.scala | 32 ++- .../spark/sql/errors/QueryExecutionErrors.scala| 28 +-- .../sql/catalyst/CatalystTypeConvertersSuite.scala | 74 +++--- .../spark/sql/catalyst/csv/CSVExprUtilsSuite.scala | 42 ++-- .../sql/catalyst/expressions/TimeWindowSuite.scala | 18 +- .../codegen/BufferHolderSparkSubmitSuite.scala | 12 +- .../expressions/codegen/BufferHolderSuite.scala| 22 +- .../sql/catalyst/util/DateTimeUtilsSuite.scala | 24 +- .../sql/util/CaseInsensitiveStringMapSuite.scala | 11 +- .../execution/datasources/v2/AlterTableExec.scala | 3 +- .../resources/sql-tests/results/ansi/date.sql.out | 5 +- .../sql/expressions/ExpressionInfoSuite.scala | 84 --- 40 files changed, 729 insertions(+), 266 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 4fcf9248d3e2..5884c9267119 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -7512,6 +7512,261 @@ "Failed to create column family with reserved name=" ] }, + "_LEGACY_ERROR_TEMP_3198" : { +"message" : [ + "Cannot grow BufferHolder by size because the size is negative" +] + }, + "_LEGACY_ERROR_TEMP_3199" : { +"message" : [ + "Cannot grow BufferHolder by size because the size after growing exceeds size limitation " +] + }, + "_LEGACY_ERROR_TEMP
(spark) branch branch-3.5 updated: [SPARK-47022][CONNECT][TESTS][3.5] Fix `connect/client/jvm` to have explicit `commons-(io|lang3)` test dependency
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 4e4d9f07d095 [SPARK-47022][CONNECT][TESTS][3.5] Fix `connect/client/jvm` to have explicit `commons-(io|lang3)` test dependency 4e4d9f07d095 is described below commit 4e4d9f07d0954357e85a6e2b0da47746a4b08501 Author: Dongjoon Hyun AuthorDate: Sun Feb 11 14:38:48 2024 -0800 [SPARK-47022][CONNECT][TESTS][3.5] Fix `connect/client/jvm` to have explicit `commons-(io|lang3)` test dependency ### What changes were proposed in this pull request? This PR aims to add `commons-io` and `commons-lang3` test dependency to `connector/client/jvm` module. ### Why are the changes needed? `connector/client/jvm` module uses `commons-io` and `commons-lang3` during testing like the following. https://github.com/apache/spark/blob/9700da7bfc1abb607f3cb916b96724d0fb8f2eba/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala#L26-L28 Currently, it's broken due to that. - https://github.com/apache/spark/actions?query=branch%3Abranch-3.5 ### Does this PR introduce _any_ user-facing change? No, this is a test-dependency only change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45081 from dongjoon-hyun/SPARK-47022. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- connector/connect/client/jvm/pom.xml | 10 ++ 1 file changed, 10 insertions(+) diff --git a/connector/connect/client/jvm/pom.xml b/connector/connect/client/jvm/pom.xml index 236e5850b762..0c0d4cdad3a9 100644 --- a/connector/connect/client/jvm/pom.xml +++ b/connector/connect/client/jvm/pom.xml @@ -71,6 +71,16 @@ ${ammonite.version} provided + + commons-io + commons-io + test + + + org.apache.commons + commons-lang3 + test + org.scalacheck scalacheck_${scala.binary.version} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 29adf32acdac [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency 29adf32acdac is described below commit 29adf32acdacb56fd399b8945d7e049db5810ca1 Author: Dongjoon Hyun AuthorDate: Sun Feb 11 10:38:00 2024 -0800 [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency ### What changes were proposed in this pull request? This PR aims to fix `kvstore` module by adding explicit `commons-lang3` test dependency and excluding `htmlunit-driver` from `org.scalatestplus` to use Apache Spark's explicit declaration. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/pom.xml#L711-L716 ### Why are the changes needed? Since Spark 3.3.0 (SPARK-37282), `kvstore` uses `commons-lang3` test dependency like the following, but we didn't declare it explicitly so far. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java#L33 https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBIteratorSuite.java#L23 Previously, it was provided by some unused `htmlunit-driver`'s transitive dependency accidentally. This causes a weird situation which `kvstore` module starts to fail to compile when we upgrade `htmlunit-driver`. We need to fix this first. ``` $ mvn dependency:tree -pl common/kvstore ... [INFO] | \- org.seleniumhq.selenium:htmlunit-driver:jar:4.12.0:test ... [INFO] |+- org.apache.commons:commons-lang3:jar:3.14.0:test ``` ### Does this PR introduce _any_ user-facing change? No. This is only a test dependency fix. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45080 from dongjoon-hyun/SPARK-47021. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit a926c7912a78f1a2fb71c5ffd21b5c2f723a0128) Signed-off-by: Dongjoon Hyun --- common/kvstore/pom.xml | 5 + pom.xml| 6 ++ 2 files changed, 11 insertions(+) diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml index 4184e499221a..c72e08056937 100644 --- a/common/kvstore/pom.xml +++ b/common/kvstore/pom.xml @@ -66,6 +66,11 @@ commons-io test + + org.apache.commons + commons-lang3 + test + org.apache.logging.log4j diff --git a/pom.xml b/pom.xml index acc23ab2d8ed..26f0b71a5114 100644 --- a/pom.xml +++ b/pom.xml @@ -1152,6 +1152,12 @@ selenium-4-7_${scala.binary.version} 3.2.15.0 test + + +org.seleniumhq.selenium +htmlunit-driver + + org.mockito - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 9700da7bfc1a [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency 9700da7bfc1a is described below commit 9700da7bfc1abb607f3cb916b96724d0fb8f2eba Author: Dongjoon Hyun AuthorDate: Sun Feb 11 10:38:00 2024 -0800 [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency ### What changes were proposed in this pull request? This PR aims to fix `kvstore` module by adding explicit `commons-lang3` test dependency and excluding `htmlunit-driver` from `org.scalatestplus` to use Apache Spark's explicit declaration. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/pom.xml#L711-L716 ### Why are the changes needed? Since Spark 3.3.0 (SPARK-37282), `kvstore` uses `commons-lang3` test dependency like the following, but we didn't declare it explicitly so far. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java#L33 https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBIteratorSuite.java#L23 Previously, it was provided by some unused `htmlunit-driver`'s transitive dependency accidentally. This causes a weird situation which `kvstore` module starts to fail to compile when we upgrade `htmlunit-driver`. We need to fix this first. ``` $ mvn dependency:tree -pl common/kvstore ... [INFO] | \- org.seleniumhq.selenium:htmlunit-driver:jar:4.12.0:test ... [INFO] |+- org.apache.commons:commons-lang3:jar:3.14.0:test ``` ### Does this PR introduce _any_ user-facing change? No. This is only a test dependency fix. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45080 from dongjoon-hyun/SPARK-47021. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit a926c7912a78f1a2fb71c5ffd21b5c2f723a0128) Signed-off-by: Dongjoon Hyun --- common/kvstore/pom.xml | 5 + pom.xml| 6 ++ 2 files changed, 11 insertions(+) diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml index 1b1a8d0066f8..7dece9de699c 100644 --- a/common/kvstore/pom.xml +++ b/common/kvstore/pom.xml @@ -66,6 +66,11 @@ commons-io test + + org.apache.commons + commons-lang3 + test + org.apache.logging.log4j diff --git a/pom.xml b/pom.xml index 9e945f8d959a..d0cfdaa1496b 100644 --- a/pom.xml +++ b/pom.xml @@ -1146,6 +1146,12 @@ selenium-4-9_${scala.binary.version} 3.2.16.0 test + + +org.seleniumhq.selenium +htmlunit-driver + + org.mockito - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a926c7912a78 [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency a926c7912a78 is described below commit a926c7912a78f1a2fb71c5ffd21b5c2f723a0128 Author: Dongjoon Hyun AuthorDate: Sun Feb 11 10:38:00 2024 -0800 [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency ### What changes were proposed in this pull request? This PR aims to fix `kvstore` module by adding explicit `commons-lang3` test dependency and excluding `htmlunit-driver` from `org.scalatestplus` to use Apache Spark's explicit declaration. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/pom.xml#L711-L716 ### Why are the changes needed? Since Spark 3.3.0 (SPARK-37282), `kvstore` uses `commons-lang3` test dependency like the following, but we didn't declare it explicitly so far. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java#L33 https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBIteratorSuite.java#L23 Previously, it was provided by some unused `htmlunit-driver`'s transitive dependency accidentally. This causes a weird situation which `kvstore` module starts to fail to compile when we upgrade `htmlunit-driver`. We need to fix this first. ``` $ mvn dependency:tree -pl common/kvstore ... [INFO] | \- org.seleniumhq.selenium:htmlunit-driver:jar:4.12.0:test ... [INFO] |+- org.apache.commons:commons-lang3:jar:3.14.0:test ``` ### Does this PR introduce _any_ user-facing change? No. This is only a test dependency fix. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45080 from dongjoon-hyun/SPARK-47021. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- common/kvstore/pom.xml | 5 + pom.xml| 6 ++ 2 files changed, 11 insertions(+) diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml index a9b5a4634717..3820d1b8e395 100644 --- a/common/kvstore/pom.xml +++ b/common/kvstore/pom.xml @@ -70,6 +70,11 @@ commons-io test + + org.apache.commons + commons-lang3 + test + org.apache.logging.log4j diff --git a/pom.xml b/pom.xml index f0eb164d0c45..79d572f1b8bf 100644 --- a/pom.xml +++ b/pom.xml @@ -1182,6 +1182,12 @@ selenium-4-12_${scala.binary.version} 3.2.17.0 test + + +org.seleniumhq.selenium +htmlunit-driver + + org.mockito - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47020][CORE][TESTS] Fix `RealBrowserUISeleniumSuite`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fa23d276e7e4 [SPARK-47020][CORE][TESTS] Fix `RealBrowserUISeleniumSuite` fa23d276e7e4 is described below commit fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5 Author: Dongjoon Hyun AuthorDate: Fri Feb 9 19:23:17 2024 -0800 [SPARK-47020][CORE][TESTS] Fix `RealBrowserUISeleniumSuite` ### What changes were proposed in this pull request? This PR aims to fix `RealBrowserUISeleniumSuite` which has been broken after SPARK-45274. - #43053 ### Why are the changes needed? To recover `RealBrowserUISeleniumSuite` according to the latest HTML structure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual ``` $ build/sbt -Dguava.version=32.1.2-jre \ -Dspark.test.webdriver.chrome.driver=/opt/homebrew/bin/chromedriver \ -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver \ "core/testOnly org.apache.spark.ui.ChromeUISeleniumSuite" ``` **BEFORE** ``` [info] ChromeUISeleniumSuite: [info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (12 seconds, 752 milliseconds) [info] - SPARK-31882: Link URL for Stage DAGs should not depend on paged table. (2 seconds, 363 milliseconds) [info] - SPARK-31886: Color barrier execution mode RDD correctly *** FAILED *** (12 seconds, 143 milliseconds) [info] - Search text for paged tables should not be saved (3 seconds, 47 milliseconds) [info] Run completed in 32 seconds, 54 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0 [info] *** 2 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.ui.ChromeUISeleniumSuite [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 42 s, completed Feb 9, 2024, 5:32:52 PM ``` **AFTER** ``` [info] ChromeUISeleniumSuite: [info] - SPARK-31534: text for tooltip should be escaped (3 seconds, 135 milliseconds) [info] - SPARK-31882: Link URL for Stage DAGs should not depend on paged table. (2 seconds, 395 milliseconds) [info] - SPARK-31886: Color barrier execution mode RDD correctly (2 seconds, 144 milliseconds) [info] - Search text for paged tables should not be saved (2 seconds, 958 milliseconds) [info] Run completed in 12 seconds, 377 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 22 s, completed Feb 9, 2024, 5:34:24 PM ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45078 from dongjoon-hyun/SPARK-47020. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../spark/ui/RealBrowserUISeleniumSuite.scala | 32 ++ 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala b/core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala index b0f1fcab63be..709ee98be1e3 100644 --- a/core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala +++ b/core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala @@ -73,8 +73,8 @@ abstract class RealBrowserUISeleniumSuite(val driverProp: String) // Open DAG Viz. webDriver.findElement(By.id("job-dag-viz")).click() -val nodeDesc = webDriver.findElement(By.cssSelector("g[class='node_0 node']")) -nodeDesc.getAttribute("name") should include ("collect at <console>:25") +val nodeDesc = webDriver.findElement(By.cssSelector("g[id='node_0']")) +nodeDesc.getAttribute("innerHTML") should include ("collect at <console>:25") } } } @@ -109,22 +109,20 @@ abstract class RealBrowserUISeleniumSuite(val driverProp: String) goToUi(sc, "/jobs/job/?id=0") webDriver.findElement(By.id("job-dag-viz")).click() -val stage0 = webDriver.findElement(By.cssSelector("g[id='graph_0']")) -val stage1 = webDriver.findElement(By.cssSelector("g[id='graph_1']")) +val stage0 = webDriver.findElement(By.cssSelector("g[id='graph_stage_0']")) + .findElement(By.xpath("..")) +val stage1 = webDriver.findElement
(spark) branch master updated: [MINOR][DOCS] Remove outdated `antlr4` version comment in `pom.xml`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f80a83e98668 [MINOR][DOCS] Remove outdated `antlr4` version comment in `pom.xml` f80a83e98668 is described below commit f80a83e986682e1ac0dcada4f538f4e050728bbe Author: Dongjoon Hyun AuthorDate: Fri Feb 9 03:31:43 2024 -0800 [MINOR][DOCS] Remove outdated `antlr4` version comment in `pom.xml` ### What changes were proposed in this pull request? This PR aims to remove an outdated `antlr4` comment in `pom.xml`. ### Why are the changes needed? This was missed when SPARK-44366 upgraded `antlr4` from 4.9.3 to 4.13.1. - https://github.com/apache/spark/pull/43075 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45076 from dongjoon-hyun/SPARK_ANTLR. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- pom.xml | 1 - 1 file changed, 1 deletion(-) diff --git a/pom.xml b/pom.xml index 35452ba0d734..f0eb164d0c45 100644 --- a/pom.xml +++ b/pom.xml @@ -212,7 +212,6 @@ 3.5.2 3.0.0 0.12.0 - 4.13.1 1.1 4.12.1 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (3f5faaa24e3a -> d179f7564541)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 3f5faaa24e3a [SPARK-46641][SS] Add maxBytesPerTrigger threshold add d179f7564541 [SPARK-46355][SQL][TESTS][FOLLOWUP] Test to check number of open files No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/xml/XmlSuite.scala | 92 +- 1 file changed, 91 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46641][SS] Add maxBytesPerTrigger threshold
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3f5faaa24e3a [SPARK-46641][SS] Add maxBytesPerTrigger threshold 3f5faaa24e3a is described below commit 3f5faaa24e3ab4d9cc8f996bd1938573dd057e20 Author: maxim_konstantinov AuthorDate: Thu Feb 8 23:16:17 2024 -0800 [SPARK-46641][SS] Add maxBytesPerTrigger threshold ### What changes were proposed in this pull request? This PR adds [Input Streaming Source's](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources) option `maxBytesPerTrigger` for limiting the total size of input files in a streaming batch. Semantics of `maxBytesPerTrigger` is very close to already existing one `maxFilesPerTrigger` option. How a feature was implemented? Because `maxBytesPerTrigger` is semantically close to `maxFilesPerTrigger` I used all the `maxFilesPerTrigger` usages in the whole repository as a potential places that requires changes, that includes: - Option paramater definition - Option related logic - Option related ScalaDoc and MD files - Option related test I went over the usage of all usages of `maxFilesPerTrigger` in `FileStreamSourceSuite` and implemented `maxBytesPerTrigger` in the same fashion as those two are pretty close in their nature. From the structure and elements of ReadLimit I've concluded that current design implies only one simple rule for ReadLimit, so I openly prohibited the setting of both maxFilesPerTrigger and maxBytesPerTrigger at the same time. ### Why are the changes needed? This feature is useful for our and our sister teams and we expect it will find a broad acceptance among Spark users. We have a use-case in a few of the Spark pipelines we support when we use Available-now trigger for periodic processing using Spark Streaming. We use `maxFilesPerTrigger` threshold for now, but this is not ideal as Input file size might change with the time which requires periodic configuration adjustment of `maxFilesPerTrigger`. Computational complexity of the job depe [...] ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? New unit tests were added or existing `maxFilesPerTrigger` test were extended. I searched `maxFilesPerTrigger` related test and added new tests or extended existing ones trying to minimize and simplify the changes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44636 from MaxNevermind/streaming-add-maxBytesPerTrigger-option. Lead-authored-by: maxim_konstantinov Co-authored-by: Max Konstantinov Signed-off-by: Dongjoon Hyun --- .../spark/sql/streaming/DataStreamReader.scala | 24 +++- docs/structured-streaming-programming-guide.md | 8 +- .../sql/connector/read/streaming/ReadLimit.java| 2 + .../{ReadLimit.java => ReadMaxBytes.java} | 39 +++--- .../execution/streaming/FileStreamOptions.scala| 18 ++- .../sql/execution/streaming/FileStreamSource.scala | 87 +++--- .../spark/sql/streaming/DataStreamReader.scala | 12 ++ .../sql/streaming/FileStreamSourceSuite.scala | 133 +++-- 8 files changed, 247 insertions(+), 76 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala index bc8e30cd300c..789425c9daea 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala @@ -159,7 +159,9 @@ final class DataStreamReader private[sql] (sparkSession: SparkSession) extends L * schema in advance, use the version that specifies the schema to avoid the extra scan. * * You can set the following option(s): `maxFilesPerTrigger` (default: no max limit): - * sets the maximum number of new files to be considered in every trigger. + * sets the maximum number of new files to be considered in every trigger. + * `maxBytesPerTrigger` (default: no max limit): sets the maximum total size of new files to + * be considered in every trigger. * * You can find the JSON-specific options for reading JSON file stream in https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option";> @@ -179,7 +181,9 @@ final class DataStreamReader private[sql] (sparkSession: SparkSession) extends L * specify the schema explicitly using `schema`. * * You can set the following option(s): `maxFilesPerTrigger` (default: no max limit): - * sets the
(spark) branch master updated: [SPARK-46831][INFRA][FOLLOWUP] Fix a wrong JIRA ID in MimaExcludes
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8603ed58a34d [SPARK-46831][INFRA][FOLLOWUP] Fix a wrong JIRA ID in MimaExcludes 8603ed58a34d is described below commit 8603ed58a34d42dd14a82d8950ef5943114c3a8d Author: Dongjoon Hyun AuthorDate: Thu Feb 8 12:00:23 2024 -0800 [SPARK-46831][INFRA][FOLLOWUP] Fix a wrong JIRA ID in MimaExcludes ### What changes were proposed in this pull request? This is a follow-up of - #44901 ### Why are the changes needed? To fix the wrong JIRA ID information. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45071 from dongjoon-hyun/SPARK-46831. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- project/MimaExcludes.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala index 5674eba0bea0..3e1391317eab 100644 --- a/project/MimaExcludes.scala +++ b/project/MimaExcludes.scala @@ -76,7 +76,7 @@ object MimaExcludes { // SPARK-46410: Assign error classes/subclasses to JdbcUtils.classifyException ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.JdbcDialect.classifyException"), -// [SPARK-464878][CORE][SQL] (false alert). Invalid rule for StringType extension. +// TODO(SPARK-46878): Invalid Mima report for StringType extension ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.StringType.this"), // SPARK-47011: Remove deprecated BinaryClassificationMetrics.scoreLabelsWeight ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.scoreLabelsWeight") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47011][MLLIB] Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a5e741e60ac9 [SPARK-47011][MLLIB] Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight` a5e741e60ac9 is described below commit a5e741e60ac97a395ce80d9fb39709e143ada721 Author: Dongjoon Hyun AuthorDate: Thu Feb 8 10:59:17 2024 -0800 [SPARK-47011][MLLIB] Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight` ### What changes were proposed in this pull request? This PR aims to a planned removal of the deprecated `BinaryClassificationMetrics.scoreLabelsWeight`. ### Why are the changes needed? Apache Spark 3.4.0 deprecated this via SPARK-39533 and announced the removal of this. - #36926 https://github.com/apache/spark/blob/b7edc5fac0f4e479cbc869d54a9490c553ba2613/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L49-L50 ### Does this PR introduce _any_ user-facing change? Yes, but this is a planned change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45070 from dongjoon-hyun/SPARK-47011. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala | 4 +--- project/MimaExcludes.scala| 4 +++- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala index a74800f7b189..869fe7155a26 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala @@ -46,9 +46,7 @@ class BinaryClassificationMetrics @Since("3.0.0") ( @Since("1.3.0") val numBins: Int = 1000) extends Logging { - @deprecated("The variable `scoreLabelsWeight` should be private and " + -"will be removed in 4.0.0.", "3.4.0") - val scoreLabelsWeight: RDD[(Double, (Double, Double))] = scoreAndLabels.map { + private val scoreLabelsWeight: RDD[(Double, (Double, Double))] = scoreAndLabels.map { case (prediction: Double, label: Double, weight: Double) => require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") (prediction, (label, weight)) diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala index 64c5599919a6..5674eba0bea0 100644 --- a/project/MimaExcludes.scala +++ b/project/MimaExcludes.scala @@ -77,7 +77,9 @@ object MimaExcludes { // SPARK-46410: Assign error classes/subclasses to JdbcUtils.classifyException ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.JdbcDialect.classifyException"), // [SPARK-464878][CORE][SQL] (false alert). Invalid rule for StringType extension. - ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.StringType.this") + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.StringType.this"), +// SPARK-47011: Remove deprecated BinaryClassificationMetrics.scoreLabelsWeight + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.scoreLabelsWeight") ) // Default exclude rules - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46997][CORE] Enable `spark.worker.cleanup.enabled` by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 338bb31c2fac [SPARK-46997][CORE] Enable `spark.worker.cleanup.enabled` by default 338bb31c2fac is described below commit 338bb31c2fac79fbc3482c23310b77d5306bd6c8 Author: Dongjoon Hyun AuthorDate: Wed Feb 7 22:51:14 2024 -0800 [SPARK-46997][CORE] Enable `spark.worker.cleanup.enabled` by default ### What changes were proposed in this pull request? This PR aims to enable `spark.worker.cleanup.enabled` by default as a part of Apache Spark 4.0.0. ### Why are the changes needed? Apache Spark community has been recommending (from Apache Spark 3.0 to 3.5) to enable `spark.worker.cleanup.enabled` when `spark.shuffle.service.db.enabled` is true. And, `spark.shuffle.service.db.enabled` has been `true` since SPARK-26288. https://github.com/apache/spark/blob/dc73a8d7e96ead55053096971c838908b7c90527/docs/spark-standalone.md?plain=1#L443 https://github.com/apache/spark/blob/dc73a8d7e96ead55053096971c838908b7c90527/docs/spark-standalone.md?plain=1#L473 https://github.com/apache/spark/blob/dc73a8d7e96ead55053096971c838908b7c90527/core/src/main/scala/org/apache/spark/internal/config/package.scala#L718-L724 Although `spark.shuffle.service.enabled` is disabled by default, `spark.worker.cleanup.enabled` is crucial for long-standing Spark Standalone clusters to avoid the disk full situation. https://github.com/apache/spark/blob/dc73a8d7e96ead55053096971c838908b7c90527/core/src/main/scala/org/apache/spark/internal/config/package.scala#L692-L696 ### Does this PR introduce _any_ user-facing change? Yes, but this has been a long-standing recommended configuration in the real production-level Spark Standalone clusters. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45055 from dongjoon-hyun/SPARK-46997. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/internal/config/Worker.scala | 2 +- docs/core-migration-guide.md | 2 ++ docs/spark-standalone.md | 2 +- 3 files changed, 4 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/internal/config/Worker.scala b/core/src/main/scala/org/apache/spark/internal/config/Worker.scala index c53e181df002..5a67f3398a7d 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/Worker.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/Worker.scala @@ -62,7 +62,7 @@ private[spark] object Worker { val WORKER_CLEANUP_ENABLED = ConfigBuilder("spark.worker.cleanup.enabled") .version("1.0.0") .booleanConf -.createWithDefault(false) +.createWithDefault(true) val WORKER_CLEANUP_INTERVAL = ConfigBuilder("spark.worker.cleanup.interval") .version("1.0.0") diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md index 7a5b17397bec..26e6b0f1f444 100644 --- a/docs/core-migration-guide.md +++ b/docs/core-migration-guide.md @@ -28,6 +28,8 @@ license: | - Since Spark 4.0, Spark will compress event logs. To restore the behavior before Spark 4.0, you can set `spark.eventLog.compress` to `false`. +- Since Spark 4.0, Spark workers will clean up worker and stopped application directories periodically. To restore the behavior before Spark 4.0, you can set `spark.worker.cleanup.enabled` to `false`. + - Since Spark 4.0, `spark.shuffle.service.db.backend` is set to `ROCKSDB` by default which means Spark will use RocksDB store for shuffle service. To restore the behavior before Spark 4.0, you can set `spark.shuffle.service.db.backend` to `LEVELDB`. - In Spark 4.0, support for Apache Mesos as a resource manager was removed. diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index fbc83180d6b6..1eab3158e2e5 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -436,7 +436,7 @@ SPARK_WORKER_OPTS supports the following system properties: spark.worker.cleanup.enabled - false + true Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46998][SQL] Deprecate the SQL config `spark.sql.legacy.allowZeroIndexInFormatString`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a8ad71dbe417 [SPARK-46998][SQL] Deprecate the SQL config `spark.sql.legacy.allowZeroIndexInFormatString` a8ad71dbe417 is described below commit a8ad71dbe417c16af13d46783e13fba0c2280268 Author: Max Gekk AuthorDate: Wed Feb 7 18:20:20 2024 -0800 [SPARK-46998][SQL] Deprecate the SQL config `spark.sql.legacy.allowZeroIndexInFormatString` ### What changes were proposed in this pull request? In the PR, I propose to deprecate the SQL config `spark.sql.legacy.allowZeroIndexInFormatString` and put it to the list `deprecatedSQLConfigs` in `SQLConf`. ### Why are the changes needed? After migration on JDK 17+, the SQL config `spark.sql.legacy.allowZeroIndexInFormatString` doesn't work anymore, and doesn't allow to use the zero index. Even users set the config to `true`, they get the error: ```java Illegal format argument index = 0 java.util.IllegalFormatArgumentIndexException: Illegal format argument index = 0 at java.base/java.util.Formatter$FormatSpecifier.index(Formatter.java:2808) at java.base/java.util.Formatter$FormatSpecifier.(Formatter.java:2879) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *QueryCompilationErrorsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45057 from MaxGekk/deprecate-ALLOW_ZERO_INDEX_IN_FORMAT_STRING. Authored-by: Max Gekk Signed-off-by: Dongjoon Hyun --- docs/sql-migration-guide.md| 1 + .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 7 +-- .../org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala | 7 +++ 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index cb5e697f871c..3d0c7280496a 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -38,6 +38,7 @@ license: | - `spark.sql.avro.datetimeRebaseModeInRead` instead of `spark.sql.legacy.avro.datetimeRebaseModeInRead` - Since Spark 4.0, the default value of `spark.sql.orc.compression.codec` is changed from `snappy` to `zstd`. To restore the previous behavior, set `spark.sql.orc.compression.codec` to `snappy`. - Since Spark 4.0, `SELECT (*)` is equivalent to `SELECT struct(*)` instead of `SELECT *`. To restore the previous behavior, set `spark.sql.legacy.ignoreParenthesesAroundStar` to `true`. +- Since Spark 4.0, the SQL config `spark.sql.legacy.allowZeroIndexInFormatString` is deprecated. Consider to change `strfmt` of the `format_string` function to use 1-based indexes. The first argument must be referenced by "1$", the second by "2$", etc. ## Upgrading from Spark SQL 3.4 to 3.5 diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 8f86a1c8a1f3..59db3e71a135 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -2339,7 +2339,8 @@ object SQLConf { .doc("When false, the `strfmt` in `format_string(strfmt, obj, ...)` and " + "`printf(strfmt, obj, ...)` will no longer support to use \"0$\" to specify the first " + "argument, the first argument should always reference by \"1$\" when use argument index " + -"to indicating the position of the argument in the argument list.") +"to indicating the position of the argument in the argument list. " + +"This config will be removed in the future releases.") .version("3.3") .booleanConf .createWithDefault(false) @@ -4718,7 +4719,9 @@ object SQLConf { DeprecatedConfig(ESCAPED_STRING_LITERALS.key, "4.0", "Use raw string literals with the `r` prefix instead. "), DeprecatedConfig("spark.connect.copyFromLocalToFs.allowDestLocal", "4.0", -s"Use '${ARTIFACT_COPY_FROM_LOCAL_TO_FS_ALLOW_DEST_LOCAL.key}' instead.") +s"Use '${ARTIFACT_COPY_FROM_LOCAL_TO_FS_ALLOW_DEST_LOCAL.key}' instead."), + DeprecatedConfig(ALLOW_ZERO_INDEX_IN_FORMAT_STRING.key, "4.0", "Increase indexes by 1 " + +"in `strfmt` of the `format_string` function. Refer to the first argument by \"1$\".")
(spark) branch master updated: [SPARK-47003][K8S] Detect and fail on invalid volume sizes (< 1KiB) in K8s
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cdd41a9f2c4f [SPARK-47003][K8S] Detect and fail on invalid volume sizes (< 1KiB) in K8s cdd41a9f2c4f is described below commit cdd41a9f2c4f278c5da7e1826c5e0ca0db7ec548 Author: Dongjoon Hyun AuthorDate: Wed Feb 7 14:58:20 2024 -0800 [SPARK-47003][K8S] Detect and fail on invalid volume sizes (< 1KiB) in K8s ### What changes were proposed in this pull request? This PR aims to detect and fails on invalid volume size. ### Why are the changes needed? This happens when the user forget the unit of volume size. For example, `100` instead of `100Gi`. ### Does this PR introduce _any_ user-facing change? For K8s volumes, the system is trying to use the system default minimum volume size. However it totally depends on the underlying system. And, this misconfiguration misleads the users in many cases because the job is started and running in unhealthy status. - First, the executor pods will be killed by the K8s control plane due to the out of disk situation. - Second, Spark is trying to create new executors (still with small disks) and to retry multiple times. We had better detect the missed-unit situation and make those jobs fail as early as possible. ### How was this patch tested? Pass the CIs with newly added test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45061 from dongjoon-hyun/SPARK-47003. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../spark/deploy/k8s/KubernetesVolumeUtils.scala | 13 +++ .../deploy/k8s/KubernetesVolumeUtilsSuite.scala| 26 ++ 2 files changed, 39 insertions(+) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala index 18fda708d9bb..baa519658c2e 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala @@ -16,6 +16,8 @@ */ package org.apache.spark.deploy.k8s +import java.lang.Long.parseLong + import org.apache.spark.SparkConf import org.apache.spark.deploy.k8s.Config._ @@ -76,6 +78,7 @@ private[spark] object KubernetesVolumeUtils { s"$volumeType.$volumeName.$KUBERNETES_VOLUMES_OPTIONS_CLAIM_STORAGE_CLASS_KEY" val sizeLimitKey = s"$volumeType.$volumeName.$KUBERNETES_VOLUMES_OPTIONS_SIZE_LIMIT_KEY" verifyOptionKey(options, claimNameKey, KUBERNETES_VOLUMES_PVC_TYPE) +verifySize(options.get(sizeLimitKey)) KubernetesPVCVolumeConf( options(claimNameKey), options.get(storageClassKey), @@ -84,6 +87,7 @@ private[spark] object KubernetesVolumeUtils { case KUBERNETES_VOLUMES_EMPTYDIR_TYPE => val mediumKey = s"$volumeType.$volumeName.$KUBERNETES_VOLUMES_OPTIONS_MEDIUM_KEY" val sizeLimitKey = s"$volumeType.$volumeName.$KUBERNETES_VOLUMES_OPTIONS_SIZE_LIMIT_KEY" +verifySize(options.get(sizeLimitKey)) KubernetesEmptyDirVolumeConf(options.get(mediumKey), options.get(sizeLimitKey)) case KUBERNETES_VOLUMES_NFS_TYPE => @@ -105,4 +109,13 @@ private[spark] object KubernetesVolumeUtils { throw new NoSuchElementException(key + s" is required for $msg") } } + + private def verifySize(size: Option[String]): Unit = { +size.foreach { v => + if (v.forall(_.isDigit) && parseLong(v) < 1024) { +throw new IllegalArgumentException( + s"Volume size `$v` is smaller than 1KiB. Missing units?") + } +} + } } diff --git a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtilsSuite.scala b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtilsSuite.scala index 156740d7c8ae..fdc1aae0d410 100644 --- a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtilsSuite.scala +++ b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtilsSuite.scala @@ -182,4 +182,30 @@ class KubernetesVolumeUtilsSuite extends SparkFunSuite { } assert(e.getMessage.contains("nfs.volumeName.options.server")) } + + test("SPARK-47003: Check emptyDir volume size") { +val sparkConf = new SparkConf(false) +sparkConf.set("test.emptyDir.volumeName
(spark) branch master updated: [SPARK-47000][CORE] Use `getTotalMemorySize` in `WorkerArguments`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d8e402db2aa7 [SPARK-47000][CORE] Use `getTotalMemorySize` in `WorkerArguments` d8e402db2aa7 is described below commit d8e402db2aa71835d087f84173463c7346c7176b Author: Dongjoon Hyun AuthorDate: Wed Feb 7 14:32:41 2024 -0800 [SPARK-47000][CORE] Use `getTotalMemorySize` in `WorkerArguments` ### What changes were proposed in this pull request? This PR aims to use `getTotalMemorySize` instead of deprecated `getTotalPhysicalMemorySize` (OpenJDK) or `getTotalPhysicalMemory` (IBM Java) in `WorkerArguments`. ### Why are the changes needed? `getTotalPhysicalMemorySize` is deprecated at Java 14 in OpenJDK. - https://docs.oracle.com/en/java/javase/17/docs/api/jdk.management/com/sun/management/OperatingSystemMXBean.html#getTotalPhysicalMemorySize() `getTotalPhysicalMemory` is deprecated since 1.8 in IBM. - https://eclipse.dev/openj9/docs/api/jdk17/jdk.management/com/ibm/lang/management/OperatingSystemMXBean.html#getTotalPhysicalMemory() `getTotalMemorySize` is recommended in both environments for Apache Spark 4.0.0 because the minimum Java version is 17. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the existing test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45060 from dongjoon-hyun/SPARK-47000. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/deploy/worker/WorkerArguments.scala| 13 +++-- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala index 42f684c0a197..94a27e1a3e6d 100644 --- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala @@ -148,20 +148,13 @@ private[worker] class WorkerArguments(args: Array[String], conf: SparkConf) { } def inferDefaultMemory(): Int = { -val ibmVendor = System.getProperty("java.vendor").contains("IBM") var totalMb = 0 try { // scalastyle:off classforname val bean = ManagementFactory.getOperatingSystemMXBean() - if (ibmVendor) { -val beanClass = Class.forName("com.ibm.lang.management.OperatingSystemMXBean") -val method = beanClass.getDeclaredMethod("getTotalPhysicalMemory") -totalMb = (method.invoke(bean).asInstanceOf[Long] / 1024 / 1024).toInt - } else { -val beanClass = Class.forName("com.sun.management.OperatingSystemMXBean") -val method = beanClass.getDeclaredMethod("getTotalPhysicalMemorySize") -totalMb = (method.invoke(bean).asInstanceOf[Long] / 1024 / 1024).toInt - } + val beanClass = Class.forName("com.sun.management.OperatingSystemMXBean") + val method = beanClass.getDeclaredMethod("getTotalMemorySize") + totalMb = (method.invoke(bean).asInstanceOf[Long] / 1024 / 1024).toInt // scalastyle:on classforname } catch { case e: Exception => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46977][CORE] A failed request to obtain a token from one NameNode should not skip subsequent token requests
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 006c2dca6d87 [SPARK-46977][CORE] A failed request to obtain a token from one NameNode should not skip subsequent token requests 006c2dca6d87 is described below commit 006c2dca6d87e29a69e30124e8320c275859d148 Author: Cheng Pan AuthorDate: Mon Feb 5 12:18:20 2024 -0800 [SPARK-46977][CORE] A failed request to obtain a token from one NameNode should not skip subsequent token requests ### What changes were proposed in this pull request? This PR enhances the `HadoopFSDelegationTokenProvider` to tolerate failures when fetching tokens from multiple NameNodes. ### Why are the changes needed? Let's say we are going to access 3 HDFS, `nn-1`, `nn-2`, `nn-3` in YARN cluster mode with TGT cache, while the `nn-1` is the `defaultFs` which is used by YARN to store aggregated logs, and there are issues in `nn-2` which can not issue the token. ``` spark-submit \ --master yarn \ --deployMode cluster \ --conf spark.kerberos.access.hadoopFileSystems=hdfs://nn-1,hdfs://nn-2,hdfs://nn-3 \ ... ``` During the submitting phase, Spark is going to call `HadoopFSDelegationTokenProvider` to fetch tokens from all declared NameNodes one by one, in **indeterminate** order (`HadoopFSDelegationTokenProvider.hadoopFSsToAccess` process and return a `Set[FileSystem]`), so the order may not respect the user declared order in `spark.kerberos.access.hadoopFileSystems`. If the order is [`nn-1`, `nn-2`, `nn-3`], then we are going to request a token from `nn-1` successfully, but fail for `nn-2` with the below error, the left `nn-3` is going to be skipped. But such failure WON'T block the whole submitting process, the Spark app is going to be submitted with only `nn-1` token. ``` 2024-01-03 12:41:36 [WARN] [main] org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider#94 - Failed to get token from service hadoopfs org.apache.hadoop.ipc.RemoteException: at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1507) ~[hadoop-common-2.9.2.2.jar:?] ... at org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2604) ~[hadoop-hdfs-client-2.9.2.2.jar:?] at org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.$anonfun$fetchDelegationTokens$1(HadoopFSDelegationTokenProvider.scala:122) ~[spark-core_2.12-3.3.1.27.jar:3.3.1.27] at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:335) ~[scala-library-2.12.15.jar:?] at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:) ~[scala-library-2.12.15.jar:?] at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:) ~[scala-library-2.12.15.jar:?] at org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.fetchDelegationTokens(HadoopFSDelegationTokenProvider.scala:115) ~[spark-core_2.12-3.3.1.27.jar:3.3.1.27] ... at org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:146) ~[spark-core_2.12-3.3.1.27.jar:3.3.1.27] at org.apache.spark.deploy.yarn.Client.setupSecurityToken(Client.scala:352) ~[spark-yarn_2.12-3.3.1.27.jar:3.3.1.27] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:1140) ~[spark-yarn_2.12-3.3.1.27.jar:3.3.1.27] ... at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ~[spark-core_2.12-3.3.1.27.jar:3.3.1.27] ``` when the Spark app access `nn-2` and `nn-3`, it will fail with `o.a.h.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]` Things become worse if the FS order is [`nn-3`, `nn-2`, `nn-1`], the Spark app will be submitted to YARN with only `nn-3` token, it even has no chance to allow NodeManager to upload aggregated logs after the application exit because it requires the app to provide a token to access `nn-1`. the log from NodeManager ``` 2024-01-03 08:08:14,028 [3173570620] - WARN [NM ContainerManager dispatcher:Client$Connection1$772] - Exception encountered while connecting to the server Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:179) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:392) ... at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1768) ...
(spark) branch master updated: [SPARK-46972][SQL] Fix asymmetrical replacement for char/varchar in V2SessionCatalog.createTable
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f8ff18223b36 [SPARK-46972][SQL] Fix asymmetrical replacement for char/varchar in V2SessionCatalog.createTable f8ff18223b36 is described below commit f8ff18223b365afa59ee077dc7535f1190073069 Author: Kent Yao AuthorDate: Mon Feb 5 12:01:08 2024 -0800 [SPARK-46972][SQL] Fix asymmetrical replacement for char/varchar in V2SessionCatalog.createTable ### What changes were proposed in this pull request? This PR removes the asymmetrical replacement for char/varchar in V2SessionCatalog.createTable ### Why are the changes needed? Replacement for char/varchar shall happen in both sizes of the equation `DataType.equalsIgnoreNullability(tableSchema, schema))` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45019 from yaooqinn/SPARK-46972. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../spark/sql/execution/datasources/v2/V2SessionCatalog.scala| 9 +++-- .../org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala| 9 + 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala index e7445e970fa5..0cb3f8dca38f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala @@ -27,7 +27,6 @@ import org.apache.spark.SparkUnsupportedOperationException import org.apache.spark.sql.catalyst.{FunctionIdentifier, SQLConfHelper, TableIdentifier} import org.apache.spark.sql.catalyst.analysis.{NoSuchDatabaseException, NoSuchTableException, TableAlreadyExistsException} import org.apache.spark.sql.catalyst.catalog.{CatalogDatabase, CatalogStorageFormat, CatalogTable, CatalogTableType, CatalogUtils, ClusterBySpec, SessionCatalog} -import org.apache.spark.sql.catalyst.util.CharVarcharUtils import org.apache.spark.sql.catalyst.util.TypeUtils._ import org.apache.spark.sql.connector.catalog.{CatalogManager, CatalogV2Util, Column, FunctionCatalog, Identifier, NamespaceChange, SupportsNamespaces, Table, TableCatalog, TableCatalogCapability, TableChange, V1Table} import org.apache.spark.sql.connector.catalog.NamespaceChange.RemoveProperty @@ -206,11 +205,9 @@ class V2SessionCatalog(catalog: SessionCatalog) } val table = tableProvider.getTable(schema, partitions, dsOptions) // Check if the schema of the created table matches the given schema. - val tableSchema = CharVarcharUtils.replaceCharVarcharWithStringInSchema( -table.columns().asSchema) - if (!DataType.equalsIgnoreNullability(tableSchema, schema)) { -throw QueryCompilationErrors.dataSourceTableSchemaMismatchError( - tableSchema, schema) + val tableSchema = table.columns().asSchema + if (!DataType.equalsIgnoreNullability(table.columns().asSchema, schema)) { +throw QueryCompilationErrors.dataSourceTableSchemaMismatchError(tableSchema, schema) } (schema, partitioning) } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala index f92a9a827b1c..2701738351b1 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala @@ -1735,6 +1735,15 @@ class DataSourceV2SQLSuiteV1Filter } } + test("SPARK-46972: asymmetrical replacement for char/varchar in V2SessionCatalog.createTable") { +// unset this config to use the default v2 session catalog. +spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key) +withTable("t") { + sql(s"CREATE TABLE t(c char(1), v varchar(2)) USING $v2Source") + assert(!spark.table("t").isEmpty) +} + } + test("ShowCurrentNamespace: basic tests") { def testShowCurrentNamespace(expectedCatalogName: String, expectedNamespace: String): Unit = { val schema = new StructType() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46978][PYTHON][DOCS] Refine docstring of `sum_distinct/array_agg/count_if`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2e917fb91924 [SPARK-46978][PYTHON][DOCS] Refine docstring of `sum_distinct/array_agg/count_if` 2e917fb91924 is described below commit 2e917fb919244b421c5a2770403c0fd91336f65d Author: yangjie01 AuthorDate: Mon Feb 5 11:58:25 2024 -0800 [SPARK-46978][PYTHON][DOCS] Refine docstring of `sum_distinct/array_agg/count_if` ### What changes were proposed in this pull request? This pr refine docstring of `sum_distinct/array_agg/count_if` and add some new examples. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45031 from LuciferYang/agg-functions. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/functions/builtin.py | 134 +--- 1 file changed, 123 insertions(+), 11 deletions(-) diff --git a/python/pyspark/sql/functions/builtin.py b/python/pyspark/sql/functions/builtin.py index 0932ac1c2843..cb872fdb8180 100644 --- a/python/pyspark/sql/functions/builtin.py +++ b/python/pyspark/sql/functions/builtin.py @@ -1472,13 +1472,51 @@ def sum_distinct(col: "ColumnOrName") -> Column: Examples ->>> df = spark.createDataFrame([(None,), (1,), (1,), (2,)], schema=["numbers"]) ->>> df.select(sum_distinct(col("numbers"))).show() +Example 1: Using sum_distinct function on a column with all distinct values + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([(1,), (2,), (3,), (4,)], ["numbers"]) +>>> df.select(sf.sum_distinct('numbers')).show() ++-+ +|sum(DISTINCT numbers)| ++-+ +| 10| ++-+ + +Example 2: Using sum_distinct function on a column with no distinct values + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([(1,), (1,), (1,), (1,)], ["numbers"]) +>>> df.select(sf.sum_distinct('numbers')).show() ++-+ +|sum(DISTINCT numbers)| ++-+ +|1| ++-+ + +Example 3: Using sum_distinct function on a column with null and duplicate values + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([(None,), (1,), (1,), (2,)], ["numbers"]) +>>> df.select(sf.sum_distinct('numbers')).show() +-+ |sum(DISTINCT numbers)| +-+ |3| +-+ + +Example 4: Using sum_distinct function on a column with all None values + +>>> from pyspark.sql import functions as sf +>>> from pyspark.sql.types import StructType, StructField, IntegerType +>>> schema = StructType([StructField("numbers", IntegerType(), True)]) +>>> df = spark.createDataFrame([(None,), (None,), (None,), (None,)], schema=schema) +>>> df.select(sf.sum_distinct('numbers')).show() ++-+ +|sum(DISTINCT numbers)| ++-+ +| NULL| ++-+ """ return _invoke_function_over_columns("sum_distinct", col) @@ -4122,9 +4160,49 @@ def array_agg(col: "ColumnOrName") -> Column: Examples +Example 1: Using array_agg function on an int column + +>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([[1],[1],[2]], ["c"]) ->>> df.agg(array_agg('c').alias('r')).collect() -[Row(r=[1, 1, 2])] +>>> df.agg(sf.sort_array(sf.array_agg('c'))).show() ++-+ +|sort_array(collect_list(c), true)| ++-+ +|[1, 1, 2]| ++-+ + +Example 2: Using array_agg function on a string column + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([["apple"],["apple"],["banana"]], ["c"]) +>>> df.agg(sf.sort_array(sf.array_agg('c'))).show(truncate=Fa
(spark) branch master updated: [MINOR][TEST] Add output/exception to error message when schema not matched in `TPCDSQueryTestSuite`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e9697130a1ac [MINOR][TEST] Add output/exception to error message when schema not matched in `TPCDSQueryTestSuite` e9697130a1ac is described below commit e9697130a1acd2293d52ce72b9bddf6a203e3e8c Author: Liang-Chi Hsieh AuthorDate: Mon Feb 5 09:43:44 2024 -0800 [MINOR][TEST] Add output/exception to error message when schema not matched in `TPCDSQueryTestSuite` ### What changes were proposed in this pull request? This patch adds output/exception string to the error message when output schema not match expected schema in `TPCDSQueryTestSuite`. ### Why are the changes needed? We have used `TPCDSQueryTestSuite` for testing TPCDS query results. The test suite checks output schema and then output result. If any exception happens during query execution, it will handle the exception and return an empty schema and exception class + message as output. So, when any exception happens, the test suite just fails on schema check and never uses/shows the exception, e.g., ``` java.lang.Exception: Expected "struct<[count(1):bigint]>", but got "struct<[]>" Schema did not match ``` We cannot see what exception was happened there from the log. It is somehow inconvenient for debugging. ### Does this PR introduce _any_ user-facing change? No, test only. ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45025 from viirya/minor_ouput_exception. Authored-by: Liang-Chi Hsieh Signed-off-by: Dongjoon Hyun --- .../test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala| 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala index ef7bdc2b079e..bde615552987 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala @@ -139,7 +139,14 @@ class TPCDSQueryTestSuite extends QueryTest with TPCDSBase with SQLQueryTestHelp (segments(1).trim, segments(2).replaceAll("\\s+$", "")) } -assertResult(expectedSchema, s"Schema did not match\n$queryString") { +val notMatchedSchemaOutput = if (schema == emptySchema) { + // There might be exception. See `handleExceptions`. + s"Schema did not match\n$queryString\nOutput/Exception: $outputString" +} else { + s"Schema did not match\n$queryString" +} + +assertResult(expectedSchema, notMatchedSchemaOutput) { schema } if (shouldSortResults) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46970][CORE] Rewrite `OpenHashSet#hasher` with `pattern matching`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7ca355cbc225 [SPARK-46970][CORE] Rewrite `OpenHashSet#hasher` with `pattern matching` 7ca355cbc225 is described below commit 7ca355cbc225653b090020271117a763ec59536d Author: yangjie01 AuthorDate: Sat Feb 3 21:07:16 2024 -0800 [SPARK-46970][CORE] Rewrite `OpenHashSet#hasher` with `pattern matching` ### What changes were proposed in this pull request? The proposed changes in this pr involve refactoring the method of creating a `Hasher[T]` instance in the code. The original code used a series of if-else statements to check the class type of `T` and create the corresponding `Hasher[T]` instance. The proposed change simplifies this process by using Scala's pattern matching feature. The new code is more concise and easier to read. ### Why are the changes needed? The changes are needed for several reasons. Firstly, the use of pattern matching makes the code more idiomatic to Scala, which is beneficial for readability and maintainability. Secondly, the original code contains a comment about a bug in the Scala 2.9.x compiler that prevented the use of pattern matching in this context. However, Apache Spark 4.0 has switched to using Scala 2.13, and the new code has passed all tests, it appears that the bug no longer exists in the new version of Sc [...] ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44998 from LuciferYang/openhashset-hasher. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: Dongjoon Hyun --- .../apache/spark/util/collection/OpenHashSet.scala | 28 +- 1 file changed, 6 insertions(+), 22 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala index 6815e47a198d..faee9ce56a0a 100644 --- a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala +++ b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala @@ -62,28 +62,12 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag]( // specialization to work (specialized class extends the non-specialized one and needs access // to the "private" variables). - protected val hasher: Hasher[T] = { -// It would've been more natural to write the following using pattern matching. But Scala 2.9.x -// compiler has a bug when specialization is used together with this pattern matching, and -// throws: -// scala.tools.nsc.symtab.Types$TypeError: type mismatch; -// found : scala.reflect.AnyValManifest[Long] -// required: scala.reflect.ClassTag[Int] -// at scala.tools.nsc.typechecker.Contexts$Context.error(Contexts.scala:298) -// at scala.tools.nsc.typechecker.Infer$Inferencer.error(Infer.scala:207) -// ... -val mt = classTag[T] -if (mt == ClassTag.Long) { - (new LongHasher).asInstanceOf[Hasher[T]] -} else if (mt == ClassTag.Int) { - (new IntHasher).asInstanceOf[Hasher[T]] -} else if (mt == ClassTag.Double) { - (new DoubleHasher).asInstanceOf[Hasher[T]] -} else if (mt == ClassTag.Float) { - (new FloatHasher).asInstanceOf[Hasher[T]] -} else { - new Hasher[T] -} + protected val hasher: Hasher[T] = classTag[T] match { +case ClassTag.Long => new LongHasher().asInstanceOf[Hasher[T]] +case ClassTag.Int => new IntHasher().asInstanceOf[Hasher[T]] +case ClassTag.Double => new DoubleHasher().asInstanceOf[Hasher[T]] +case ClassTag.Float => new FloatHasher().asInstanceOf[Hasher[T]] +case _ => new Hasher[T] } protected var _capacity = nextPowerOf2(initialCapacity) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46967][CORE][UI] Hide `Thread Dump` and `Heap Histogram` of `Dead` executors in `Executors` UI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 062522e96a50 [SPARK-46967][CORE][UI] Hide `Thread Dump` and `Heap Histogram` of `Dead` executors in `Executors` UI 062522e96a50 is described below commit 062522e96a50b8b46b313aae62668717ba88639f Author: Dongjoon Hyun AuthorDate: Sat Feb 3 19:17:33 2024 -0800 [SPARK-46967][CORE][UI] Hide `Thread Dump` and `Heap Histogram` of `Dead` executors in `Executors` UI ### What changes were proposed in this pull request? This PR aims to hide `Thread Dump` and `Heap Histogram` links of `Dead` executors in Spark Driver `Executors` UI. **BEFORE** ![Screenshot 2024-02-02 at 11 40 46 PM](https://github.com/apache/spark/assets/9700541/9fb45667-b25c-44cc-9c7c-c2ff981c5a2f) **AFTER** ![Screenshot 2024-02-02 at 11 40 03 PM](https://github.com/apache/spark/assets/9700541/9963452a-773c-4f8b-b025-9362853d3cae) ### Why are the changes needed? Since both `Thread Dump` and `Heap Histogram` requires a live JVM, those links are broken and leads to the following pages. **Broken Thread Dump Link** ![Screenshot 2024-02-02 at 11 36 55 PM](https://github.com/apache/spark/assets/9700541/2cfff1b1-dc00-4fef-ab68-5e3fad5df7a0) **Broken Heap Histogram Link** ![Screenshot 2024-02-02 at 11 37 12 PM](https://github.com/apache/spark/assets/9700541/8450cb3e-3756-4755-896f-7ced682f09b0) We had better hide them. ### Does this PR introduce _any_ user-facing change? Yes, but this PR only hides the broken links. ### How was this patch tested? Manual. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45009 from dongjoon-hyun/SPARK-46967. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../main/resources/org/apache/spark/ui/static/executorspage.js | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/core/src/main/resources/org/apache/spark/ui/static/executorspage.js b/core/src/main/resources/org/apache/spark/ui/static/executorspage.js index 41164c7997bb..1b02fc0493e7 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/executorspage.js +++ b/core/src/main/resources/org/apache/spark/ui/static/executorspage.js @@ -587,14 +587,16 @@ $(document).ready(function () { {name: 'executorLogsCol', data: 'executorLogs', render: formatLogsCells}, { name: 'threadDumpCol', - data: 'id', render: function (data, type) { -return type === 'display' ? ("Thread Dump" ) : data; + data: function (row) { return row.isActive ? row.id : '' }, + render: function (data, type) { +return data != '' && type === 'display' ? ("Thread Dump" ) : data; } }, { name: 'heapHistogramCol', - data: 'id', render: function (data, type) { -return type === 'display' ? ("Heap Histogram") : data; + data: function (row) { return row.isActive ? row.id : '' }, + render: function (data, type) { +return data != '' && type === 'display' ? ("Heap Histogram") : data; } }, { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [MINOR][DOCS] Remove Java 8/11 at `IgnoreUnrecognizedVMOptions` description
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0154c059cddb [MINOR][DOCS] Remove Java 8/11 at `IgnoreUnrecognizedVMOptions` description 0154c059cddb is described below commit 0154c059cddba7cafe74243b3f9eedd9db367b72 Author: Dongjoon Hyun AuthorDate: Sat Feb 3 18:47:30 2024 -0800 [MINOR][DOCS] Remove Java 8/11 at `IgnoreUnrecognizedVMOptions` description ### What changes were proposed in this pull request? This PR aims to remove old Java 8 and Java 11 from `IgnoreUnrecognizedVMOptions` JVM option description. ### Why are the changes needed? From Apache Spark 4.0.0, we use `IgnoreUnrecognizedVMOptions` JVM option to be robust, not for Java 8 and Java 11 support. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45012 from dongjoon-hyun/IgnoreUnrecognizedVMOptions. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../src/main/java/org/apache/spark/launcher/JavaModuleOptions.java | 2 +- .../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java b/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java index a7a6891746c2..8893f4bcb85a 100644 --- a/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java +++ b/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java @@ -20,7 +20,7 @@ package org.apache.spark.launcher; /** * This helper class is used to place the all `--add-opens` options * required by Spark when using Java 17. `DEFAULT_MODULE_OPTIONS` has added - * `-XX:+IgnoreUnrecognizedVMOptions` to be compatible with Java 8 and Java 11. + * `-XX:+IgnoreUnrecognizedVMOptions` to be robust. * * @since 3.3.0 */ diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index 22037ad5..6e3e0a1e644e 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -1031,8 +1031,7 @@ private[spark] class Client( javaOpts += s"-Djava.net.preferIPv6Addresses=${Utils.preferIPv6}" // SPARK-37106: To start AM with Java 17, `JavaModuleOptions.defaultModuleOptions` -// is added by default. It will not affect Java 8 and Java 11 due to existence of -// `-XX:+IgnoreUnrecognizedVMOptions`. +// is added by default. javaOpts += JavaModuleOptions.defaultModuleOptions() // Set the environment variable through a command prefix - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-45276][INFRA][FOLLOWUP] Fix Java version comment from 11 to 17
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f4525ff978a7 [SPARK-45276][INFRA][FOLLOWUP] Fix Java version comment from 11 to 17 f4525ff978a7 is described below commit f4525ff978a7626d93311cb45425cbd591c0454e Author: Dongjoon Hyun AuthorDate: Sat Feb 3 18:33:59 2024 -0800 [SPARK-45276][INFRA][FOLLOWUP] Fix Java version comment from 11 to 17 ### What changes were proposed in this pull request? This is a follow-up of - #43076 ### Why are the changes needed? To match with the code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45013 from dongjoon-hyun/SPARK-45276. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- connector/docker/spark-test/base/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/connector/docker/spark-test/base/Dockerfile b/connector/docker/spark-test/base/Dockerfile index 0e8593f8af5b..c397abc211e2 100644 --- a/connector/docker/spark-test/base/Dockerfile +++ b/connector/docker/spark-test/base/Dockerfile @@ -18,7 +18,7 @@ FROM ubuntu:20.04 # Upgrade package index -# install a few other useful packages plus Open Java 11 +# install a few other useful packages plus Open Java 17 # Remove unneeded /var/lib/apt/lists/* after install to reduce the # docker image size (by ~30MB) RUN apt-get update && \ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46968][SQL] Replace `UnsupportedOperationException` by `SparkUnsupportedOperationException` in `sql`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6e60b232c769 [SPARK-46968][SQL] Replace `UnsupportedOperationException` by `SparkUnsupportedOperationException` in `sql` 6e60b232c769 is described below commit 6e60b232c7693738b1d005858e5dac24e7bafcaf Author: Max Gekk AuthorDate: Sat Feb 3 00:22:06 2024 -0800 [SPARK-46968][SQL] Replace `UnsupportedOperationException` by `SparkUnsupportedOperationException` in `sql` ### What changes were proposed in this pull request? In the PR, I propose to replace all `UnsupportedOperationException` by `SparkUnsupportedOperationException` in `sql` code base, and introduce new legacy error classes with the `_LEGACY_ERROR_TEMP_` prefix. ### Why are the changes needed? To unify Spark SQL exception, and port Java exceptions on Spark exceptions with error classes. ### Does this PR introduce _any_ user-facing change? Yes, it can if user's code assumes some particular format of `UnsupportedOperationException` messages. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44937 from MaxGekk/migrate-UnsupportedOperationException-api. Authored-by: Max Gekk Signed-off-by: Dongjoon Hyun --- common/utils/src/main/resources/error/error-classes.json | 10 ++ .../org/apache/spark/sql/catalyst/trees/QueryContexts.scala | 12 ++-- .../scala/org/apache/spark/sql/catalyst/util/UDTUtils.scala | 3 ++- .../org/apache/spark/sql/execution/UnsafeRowSerializer.scala | 2 +- .../sql/execution/streaming/CompactibleFileStreamLog.scala | 4 ++-- .../spark/sql/execution/streaming/ValueStateImpl.scala | 2 -- .../streaming/state/HDFSBackedStateStoreProvider.scala | 5 ++--- .../apache/spark/sql/execution/streaming/state/RocksDB.scala | 7 --- 8 files changed, 27 insertions(+), 18 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 8399311cbfc4..ef9e81c98e05 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -7489,6 +7489,16 @@ "Datatype not supported " ] }, + "_LEGACY_ERROR_TEMP_3193" : { +"message" : [ + "Creating multiple column families with HDFSBackedStateStoreProvider is not supported" +] + }, + "_LEGACY_ERROR_TEMP_3197" : { +"message" : [ + "Failed to create column family with reserved name=" +] + }, "_LEGACY_ERROR_USER_RAISED_EXCEPTION" : { "message" : [ "" diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala index 57271e535afb..c716002ef35c 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.catalyst.trees -import org.apache.spark.{QueryContext, QueryContextType} +import org.apache.spark.{QueryContext, QueryContextType, SparkUnsupportedOperationException} /** The class represents error context of a SQL query. */ case class SQLQueryContext( @@ -131,16 +131,16 @@ case class SQLQueryContext( originStartIndex.get <= originStopIndex.get } - override def callSite: String = throw new UnsupportedOperationException + override def callSite: String = throw SparkUnsupportedOperationException() } case class DataFrameQueryContext(stackTrace: Seq[StackTraceElement]) extends QueryContext { override val contextType = QueryContextType.DataFrame - override def objectType: String = throw new UnsupportedOperationException - override def objectName: String = throw new UnsupportedOperationException - override def startIndex: Int = throw new UnsupportedOperationException - override def stopIndex: Int = throw new UnsupportedOperationException + override def objectType: String = throw SparkUnsupportedOperationException() + override def objectName: String = throw SparkUnsupportedOperationException() + override def startIndex: Int = throw SparkUnsupportedOperationException() + override def stopIndex: Int = throw SparkUnsupportedOperationException() override val fragment: String = { stackTrace.headOption.map { firstElem => diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/UDTUtils.scala b/s
(spark) branch master updated: [SPARK-46965][CORE] Check `logType` in `Utils.getLog`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 84387394c387 [SPARK-46965][CORE] Check `logType` in `Utils.getLog` 84387394c387 is described below commit 84387394c387c7a6c171714f5d45d517b6bec7af Author: Dongjoon Hyun AuthorDate: Fri Feb 2 17:22:32 2024 -0800 [SPARK-46965][CORE] Check `logType` in `Utils.getLog` ### What changes were proposed in this pull request? This PR aims to check `logType` in `Utils.getLog`. ### Why are the changes needed? To prevent security vulnerability. ### Does this PR introduce _any_ user-facing change? No. This is a new module which is not released yet. ### How was this patch tested? Manually. **BEFORE** ``` $ sbin/start-master.sh $ curl -s 'http://localhost:8080/logPage/self?logType=../../../../../../etc/nfs.conf' | grep NFS # nfs.conf: the NFS configuration file ``` **AFTER** ``` $ sbin/start-master.sh $ curl -s 'http://localhost:8080/logPage/self?logType=../../../../../../etc/nfs.conf' | grep NFS ``` For `Spark History Server`, the same check with 18080 port. ``` $ curl -s 'http://localhost:18080/logPage/self?logType=../../../../../../../etc/nfs.conf' | grep NFS ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #45006 from dongjoon-hyun/SPARK-46965. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/Utils.scala | 4 1 file changed, 4 insertions(+) diff --git a/core/src/main/scala/org/apache/spark/deploy/Utils.scala b/core/src/main/scala/org/apache/spark/deploy/Utils.scala index 9bbcc9f314b2..32328ae1e07a 100644 --- a/core/src/main/scala/org/apache/spark/deploy/Utils.scala +++ b/core/src/main/scala/org/apache/spark/deploy/Utils.scala @@ -32,6 +32,7 @@ import org.apache.spark.util.logging.RollingFileAppender */ private[deploy] object Utils extends Logging { val DEFAULT_BYTES = 100 * 1024 + val SUPPORTED_LOG_TYPES = Set("stderr", "stdout", "out") def addRenderLogHandler(page: WebUI, conf: SparkConf): Unit = { page.attachHandler(createServletHandler("/log", @@ -58,6 +59,9 @@ private[deploy] object Utils extends Logging { logType: String, offsetOption: Option[Long], byteLength: Int): (String, Long, Long, Long) = { +if (!SUPPORTED_LOG_TYPES.contains(logType)) { + return ("Error: Log type must be one of " + SUPPORTED_LOG_TYPES.mkString(", "), 0, 0, 0) +} try { // Find a log file name val fileName = if (logType.equals("out")) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [MINOR][SQL] Clean up outdated comments from `hash` function in `Metadata`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 704e9a0785f4 [MINOR][SQL] Clean up outdated comments from `hash` function in `Metadata` 704e9a0785f4 is described below commit 704e9a0785f4fc4dd86b950a649114e807a826a1 Author: yangjie01 AuthorDate: Thu Feb 1 09:31:24 2024 -0800 [MINOR][SQL] Clean up outdated comments from `hash` function in `Metadata` ### What changes were proposed in this pull request? This pr just clean up outdated comments from `hash` function in `Metadata` ### Why are the changes needed? Clean up outdated comments ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44978 from LuciferYang/minior-remove-comments. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala | 2 -- 1 file changed, 2 deletions(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala b/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala index 17be8cfa12b5..2ffd0f13ca10 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala @@ -208,8 +208,6 @@ object Metadata { /** Computes the hash code for the types we support. */ private def hash(obj: Any): Int = { obj match { - // `map.mapValues` return `Map` in Scala 2.12 and return `MapView` in Scala 2.13, call - // `toMap` for Scala version compatibility. case map: Map[_, _] => map.transform((_, v) => hash(v)).## case arr: Array[_] => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46940][CORE] Remove unused `updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d5ca61692c34 [SPARK-46940][CORE] Remove unused `updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils` d5ca61692c34 is described below commit d5ca61692c34449bc602db6cf0919010ec5a50a3 Author: panbingkun AuthorDate: Thu Feb 1 09:30:07 2024 -0800 [SPARK-46940][CORE] Remove unused `updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils` ### What changes were proposed in this pull request? The pr aims to remove unused `updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils`. ### Why are the changes needed? Keep the code cleanly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44979 from panbingkun/SPARK-46940. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/util/Utils.scala | 25 -- 1 file changed, 25 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala b/core/src/main/scala/org/apache/spark/util/Utils.scala index a55539c0a235..b49f97aed05e 100644 --- a/core/src/main/scala/org/apache/spark/util/Utils.scala +++ b/core/src/main/scala/org/apache/spark/util/Utils.scala @@ -1884,17 +1884,6 @@ private[spark] object Utils } } - /** Check whether a path is an absolute URI. */ - def isAbsoluteURI(path: String): Boolean = { -try { - val uri = new URI(path: String) - uri.isAbsolute -} catch { - case _: URISyntaxException => -false -} - } - /** Return all non-local paths from a comma-separated list of paths. */ def nonLocalPaths(paths: String, testWindows: Boolean = false): Array[String] = { val windows = isWindows || testWindows @@ -1931,20 +1920,6 @@ private[spark] object Utils path } - /** - * Updates Spark config with properties from a set of Properties. - * Provided properties have the highest priority. - */ - def updateSparkConfigFromProperties( - conf: SparkConf, - properties: Map[String, String]) : Unit = { -properties.filter { case (k, v) => - k.startsWith("spark.") -}.foreach { case (k, v) => - conf.set(k, v) -} - } - /** * Implements the same logic as JDK `java.lang.String#trim` by removing leading and trailing * non-printable characters less or equal to '\u0020' (SPACE) but preserves natural line - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger with int
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c63dea8f4235 [SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger with int c63dea8f4235 is described below commit c63dea8f42357ecfd4fe41f04732e2cb0d0d53ae Author: beliefer AuthorDate: Wed Jan 31 17:47:50 2024 -0800 [SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger with int ### What changes were proposed in this pull request? This PR propose to replace unnecessary `AtomicInteger` with int. ### Why are the changes needed? The variable `value` of `GetMaxCounter` always guarded by itself. So we can replace the unnecessary `AtomicInteger` with int. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44907 from beliefer/SPARK-46882. Authored-by: beliefer Signed-off-by: Dongjoon Hyun --- .../apache/spark/streaming/util/WriteAheadLogSuite.scala| 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala b/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala index 3a9fffec13cf..cf9d5b7387f7 100644 --- a/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala +++ b/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala @@ -20,7 +20,6 @@ import java.io._ import java.nio.ByteBuffer import java.util.{Iterator => JIterator} import java.util.concurrent.{CountDownLatch, RejectedExecutionException, ThreadPoolExecutor, TimeUnit} -import java.util.concurrent.atomic.AtomicInteger import scala.collection.mutable.ArrayBuffer import scala.concurrent._ @@ -238,14 +237,14 @@ class FileBasedWriteAheadLogSuite val executionContext = ExecutionContext.fromExecutorService(fpool) class GetMaxCounter { - private val value = new AtomicInteger() - @volatile private var max: Int = 0 + private var value = 0 + private var max: Int = 0 def increment(): Unit = synchronized { -val atInstant = value.incrementAndGet() -if (atInstant > max) max = atInstant +value = value + 1 +if (value > max) max = value } - def decrement(): Unit = synchronized { value.decrementAndGet() } - def get(): Int = synchronized { value.get() } + def decrement(): Unit = synchronized { value = value - 1 } + def get(): Int = synchronized { value } def getMax(): Int = synchronized { max } } try { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46931][PS] Implement `{Frame, Series}.to_hdf`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 88f121c47778 [SPARK-46931][PS] Implement `{Frame, Series}.to_hdf` 88f121c47778 is described below commit 88f121c47778f0755862046d09484a83932cb30b Author: Ruifeng Zheng AuthorDate: Wed Jan 31 08:41:21 2024 -0800 [SPARK-46931][PS] Implement `{Frame, Series}.to_hdf` ### What changes were proposed in this pull request? Implement `{Frame, Series}.to_hdf` ### Why are the changes needed? pandas parity ### Does this PR introduce _any_ user-facing change? yes ``` In [3]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']) In [4]: df.to_hdf('/tmp/data.h5', key='df', mode='w') In [5]: psdf = ps.from_pandas(df) In [6]: psdf.to_hdf('/tmp/data2.h5', key='df', mode='w') /Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1015: PandasAPIOnSparkAdviceWarning: `to_hdf` loads all data into the driver's memory. It should only be used if the resulting DataFrame is expected to be small. warnings.warn(message, PandasAPIOnSparkAdviceWarning) In [7]: !ls /tmp/*h5 /tmp/data.h5/tmp/data2.h5 In [8]: !ls -lh /tmp/*h5 -rw-r--r-- 1 ruifeng.zheng wheel 6.9K Jan 31 12:21 /tmp/data.h5 -rw-r--r-- 1 ruifeng.zheng wheel 6.9K Jan 31 12:21 /tmp/data2.h5 ``` ### How was this patch tested? manually test, `hdf` requires additional library `pytables` which in turn needs [many prerequisites](https://www.pytables.org/usersguide/installation.html#prerequisites) since `pytables` is just a optional dep of `Pandas`, so I think we can avoid adding it to CI first. ### Was this patch authored or co-authored using generative AI tooling? no Closes #44966 from zhengruifeng/ps_to_hdf. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .../docs/source/reference/pyspark.pandas/frame.rst | 1 + .../source/reference/pyspark.pandas/series.rst | 1 + python/pyspark/pandas/generic.py | 120 + python/pyspark/pandas/missing/frame.py | 1 - python/pyspark/pandas/missing/series.py| 1 - 5 files changed, 122 insertions(+), 2 deletions(-) diff --git a/python/docs/source/reference/pyspark.pandas/frame.rst b/python/docs/source/reference/pyspark.pandas/frame.rst index 12cf6e7db12f..77b60468b8fb 100644 --- a/python/docs/source/reference/pyspark.pandas/frame.rst +++ b/python/docs/source/reference/pyspark.pandas/frame.rst @@ -286,6 +286,7 @@ Serialization / IO / Conversion DataFrame.to_json DataFrame.to_dict DataFrame.to_excel + DataFrame.to_hdf DataFrame.to_clipboard DataFrame.to_markdown DataFrame.to_records diff --git a/python/docs/source/reference/pyspark.pandas/series.rst b/python/docs/source/reference/pyspark.pandas/series.rst index 88d1861c6ccf..5606fa93a5f3 100644 --- a/python/docs/source/reference/pyspark.pandas/series.rst +++ b/python/docs/source/reference/pyspark.pandas/series.rst @@ -486,6 +486,7 @@ Serialization / IO / Conversion Series.to_json Series.to_csv Series.to_excel + Series.to_hdf Series.to_frame Pandas-on-Spark specific diff --git a/python/pyspark/pandas/generic.py b/python/pyspark/pandas/generic.py index 77cefb53fe5d..ed2aeb8ea6af 100644 --- a/python/pyspark/pandas/generic.py +++ b/python/pyspark/pandas/generic.py @@ -1103,6 +1103,126 @@ class Frame(object, metaclass=ABCMeta): psdf._to_internal_pandas(), self.to_excel, f, args ) +def to_hdf( +self, +path_or_buf: Union[str, pd.HDFStore], +key: str, +mode: str = "a", +complevel: Optional[int] = None, +complib: Optional[str] = None, +append: bool = False, +format: Optional[str] = None, +index: bool = True, +min_itemsize: Optional[Union[int, Dict[str, int]]] = None, +nan_rep: Optional[Any] = None, +dropna: Optional[bool] = None, +data_columns: Optional[Union[bool, List[str]]] = None, +errors: str = "strict", +encoding: str = "UTF-8", +) -> None: +""" +Write the contained data to an HDF5 file using HDFStore. + +.. note:: This method should only be used if the resulting DataFrame is expected + to be small, as all the data is loaded into the driver's memory. + +.. versionadded:: 4.0.0 + +Parameters +-- +path_or_buf : str or pandas.HDFStore +File path or HDFStore object. +key : str +
(spark) branch master updated: [SPARK-46930][SQL] Add support for a custom prefix for Union type fields in Avro
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d49265a170fb [SPARK-46930][SQL] Add support for a custom prefix for Union type fields in Avro d49265a170fb is described below commit d49265a170fb7bb06471d97f4483139529939ecd Author: Ivan Sadikov AuthorDate: Wed Jan 31 08:39:46 2024 -0800 [SPARK-46930][SQL] Add support for a custom prefix for Union type fields in Avro ### What changes were proposed in this pull request? This PR enhances stable ids functionality in Avro by allowing users to configure a custom prefix for Union type member fields when `enableStableIdentifiersForUnionType` is enabled. Without the patch, the fields are generated with `member_` prefix, e.g. `member_int`, `member_string`. This could become difficult to change for complex schemas. The solution is to add a new option `stableIdentifierPrefixForUnionType` which defaults to `member_` and allows users to configure whatever prefix they require, e.g. `member`, `tmp_`, or even an empty string. ### Why are the changes needed? Allows to customise the prefix of stable ids in Avro without the need to rename all of the columns which could be cumbersome for complex schemas. ### Does this PR introduce _any_ user-facing change? Yes. The PR adds a new option in Avro: `stableIdentifierPrefixForUnionType`. ### How was this patch tested? Existing tests + a new unit test to verify different prefixes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44964 from sadikovi/SPARK-46930. Authored-by: Ivan Sadikov Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/avro/AvroDataToCatalyst.scala | 12 +++-- .../apache/spark/sql/avro/AvroDeserializer.scala | 12 +++-- .../org/apache/spark/sql/avro/AvroFileFormat.scala | 3 +- .../org/apache/spark/sql/avro/AvroOptions.scala| 6 +++ .../org/apache/spark/sql/avro/AvroUtils.scala | 5 +- .../apache/spark/sql/avro/SchemaConverters.scala | 58 +++--- .../sql/v2/avro/AvroPartitionReaderFactory.scala | 3 +- .../sql/avro/AvroCatalystDataConversionSuite.scala | 7 +-- .../apache/spark/sql/avro/AvroFunctionsSuite.scala | 3 +- .../apache/spark/sql/avro/AvroRowReaderSuite.scala | 3 +- .../org/apache/spark/sql/avro/AvroSerdeSuite.scala | 3 +- .../org/apache/spark/sql/avro/AvroSuite.scala | 54 +++- docs/sql-data-sources-avro.md | 10 +++- 13 files changed, 133 insertions(+), 46 deletions(-) diff --git a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala index 9f31a2db55a5..7d80998d96eb 100644 --- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala +++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala @@ -40,7 +40,9 @@ private[sql] case class AvroDataToCatalyst( override lazy val dataType: DataType = { val dt = SchemaConverters.toSqlType( - expectedSchema, avroOptions.useStableIdForUnionType).dataType + expectedSchema, + avroOptions.useStableIdForUnionType, + avroOptions.stableIdPrefixForUnionType).dataType parseMode match { // With PermissiveMode, the output Catalyst row might contain columns of null values for // corrupt records, even if some of the columns are not nullable in the user-provided schema. @@ -62,8 +64,12 @@ private[sql] case class AvroDataToCatalyst( @transient private lazy val reader = new GenericDatumReader[Any](actualSchema, expectedSchema) @transient private lazy val deserializer = -new AvroDeserializer(expectedSchema, dataType, - avroOptions.datetimeRebaseModeInRead, avroOptions.useStableIdForUnionType) +new AvroDeserializer( + expectedSchema, + dataType, + avroOptions.datetimeRebaseModeInRead, + avroOptions.useStableIdForUnionType, + avroOptions.stableIdPrefixForUnionType) @transient private var decoder: BinaryDecoder = _ diff --git a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala index 9e10fac8bb55..139c45adb442 100644 --- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala +++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala @@ -50,20 +50,23 @@ private[sql] class AvroDeserializer( positionalFieldMatch: Boolean, datetimeRebaseSpec: RebaseSpec, filters: StructFilters, -useStableIdForUnionType: Boolean) { +useStableIdForUnionType: Boolean
(spark) branch master updated: [SPARK-46921][BUILD] Move `ProblemFilters` that do not belong to `defaultExcludes` to `v40excludes`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8e29c0d8c5cd [SPARK-46921][BUILD] Move `ProblemFilters` that do not belong to `defaultExcludes` to `v40excludes` 8e29c0d8c5cd is described below commit 8e29c0d8c5cdc87d0a7358e090af864c4f03b1a8 Author: yangjie01 AuthorDate: Tue Jan 30 08:44:34 2024 -0800 [SPARK-46921][BUILD] Move `ProblemFilters` that do not belong to `defaultExcludes` to `v40excludes` ### What changes were proposed in this pull request? This pr just move `ProblemFilters` that do not belong to `defaultExcludes` to `v40excludes`. ### Why are the changes needed? We should not arbitrarily add entries to `defaultExcludes`, as it represents never participating in the mima check ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Mima check passed ### Was this patch authored or co-authored using generative AI tooling? No Closes #44952 from LuciferYang/SPARK-46921. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- project/MimaExcludes.scala | 38 +++--- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala index 43723742be97..64c5599919a6 100644 --- a/project/MimaExcludes.scala +++ b/project/MimaExcludes.scala @@ -59,7 +59,25 @@ object MimaExcludes { // [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SparkEnv.this"), // [SPARK-46480][CORE][SQL] Fix NPE when table cache task attempt - ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.TaskContext.isFailed") + ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.TaskContext.isFailed"), + +// SPARK-43299: Convert StreamingQueryException in Scala Client + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.StreamingQueryException"), + +// SPARK-45856: Move ArtifactManager from Spark Connect into SparkSession (sql/core) + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.apply"), + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.userId"), + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.sessionId"), + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.copy"), + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.copy$default$3"), + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.this"), + ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.storage.CacheId$"), + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.apply"), + +// SPARK-46410: Assign error classes/subclasses to JdbcUtils.classifyException + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.JdbcDialect.classifyException"), +// [SPARK-464878][CORE][SQL] (false alert). Invalid rule for StringType extension. + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.StringType.this") ) // Default exclude rules @@ -92,24 +110,6 @@ object MimaExcludes { ProblemFilters.exclude[Problem]("org.sparkproject.spark_protobuf.protobuf.*"), ProblemFilters.exclude[Problem]("org.apache.spark.sql.protobuf.utils.SchemaConverters.*"), -// SPARK-43299: Convert StreamingQueryException in Scala Client - ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.StreamingQueryException"), - -// SPARK-45856: Move ArtifactManager from Spark Connect into SparkSession (sql/core) - ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.apply"), - ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.userId"), - ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.sessionId"), - ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.copy"), - ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.copy$default$3"), - ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.this"), - ProblemFilters.exclude[MissingTypesPro
(spark) branch branch-3.4 updated: [SPARK-46893][UI] Remove inline scripts from UI descriptions
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new edaa0fd8d096 [SPARK-46893][UI] Remove inline scripts from UI descriptions edaa0fd8d096 is described below commit edaa0fd8d096a3e57918e4b6e437337fcfdc8276 Author: Willi Raschkowski AuthorDate: Mon Jan 29 22:43:21 2024 -0800 [SPARK-46893][UI] Remove inline scripts from UI descriptions ### What changes were proposed in this pull request? This PR prevents malicious users from injecting inline scripts via job and stage descriptions. Spark's Web UI [already checks the security of job and stage descriptions](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L528-L545) before rendering them as HTML (or treating them as plain text). The UI already disallows `
(spark) branch branch-3.5 updated: [SPARK-46893][UI] Remove inline scripts from UI descriptions
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 343ae8226161 [SPARK-46893][UI] Remove inline scripts from UI descriptions 343ae8226161 is described below commit 343ae822616185022570f1c14b151e54ff54e265 Author: Willi Raschkowski AuthorDate: Mon Jan 29 22:43:21 2024 -0800 [SPARK-46893][UI] Remove inline scripts from UI descriptions ### What changes were proposed in this pull request? This PR prevents malicious users from injecting inline scripts via job and stage descriptions. Spark's Web UI [already checks the security of job and stage descriptions](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L528-L545) before rendering them as HTML (or treating them as plain text). The UI already disallows `
(spark) branch master updated (41a1426e9ee3 -> abd9d27e87b9)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 41a1426e9ee3 [SPARK-46914][UI] Shorten app name in the summary table on the History Page add abd9d27e87b9 [SPARK-46893][UI] Remove inline scripts from UI descriptions No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/ui/UIUtils.scala | 12 +--- core/src/test/scala/org/apache/spark/ui/UIUtilsSuite.scala | 14 ++ 2 files changed, 23 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46914][UI] Shorten app name in the summary table on the History Page
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 41a1426e9ee3 [SPARK-46914][UI] Shorten app name in the summary table on the History Page 41a1426e9ee3 is described below commit 41a1426e9ee318a9421fad11776eb6894bb1f04b Author: Kent Yao AuthorDate: Mon Jan 29 22:07:19 2024 -0800 [SPARK-46914][UI] Shorten app name in the summary table on the History Page ### What changes were proposed in this pull request? This Pull Request shortens long app names to prevent overflow in the app table. ### Why are the changes needed? better UX ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new js tests and built and tested locally: ![image](https://github.com/apache/spark/assets/8326978/f78bd580-74b1-4fe5-9d8b-f2d49ce85ed9) ![image](https://github.com/apache/spark/assets/8326978/10bca509-00e5-4d8f-bf11-324c1080190b) ### Was this patch authored or co-authored using generative AI tooling? no Closes #44944 from yaooqinn/SPARK-46914. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../resources/org/apache/spark/ui/static/historypage.js | 12 .../main/resources/org/apache/spark/ui/static/utils.js| 15 ++- ui-test/tests/utils.test.js | 7 +++ 3 files changed, 29 insertions(+), 5 deletions(-) diff --git a/core/src/main/resources/org/apache/spark/ui/static/historypage.js b/core/src/main/resources/org/apache/spark/ui/static/historypage.js index 85cd5a554750..8961140a4019 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/historypage.js +++ b/core/src/main/resources/org/apache/spark/ui/static/historypage.js @@ -17,7 +17,7 @@ /* global $, Mustache, jQuery, uiRoot */ -import {formatDuration, formatTimeMillis} from "./utils.js"; +import {formatDuration, formatTimeMillis, stringAbbreviate} from "./utils.js"; export {setAppLimit}; @@ -186,9 +186,13 @@ $(document).ready(function() { name: 'appId', type: "appid-numeric", data: 'id', -render: (id, type, row) => `${id}` +render: (id, type, row) => `${id}` + }, + { +name: 'appName', +data: 'name', +render: (name) => stringAbbreviate(name, 60) }, - {name: 'appName', data: 'name' }, { name: attemptIdColumnName, data: 'attemptId', @@ -200,7 +204,7 @@ $(document).ready(function() { name: durationColumnName, type: "title-numeric", data: 'duration', -render: (id, type, row) => `${row.duration}` +render: (id, type, row) => `${row.duration}` }, {name: 'user', data: 'sparkUser' }, {name: 'lastUpdated', data: 'lastUpdated' }, diff --git a/core/src/main/resources/org/apache/spark/ui/static/utils.js b/core/src/main/resources/org/apache/spark/ui/static/utils.js index 960640791fe5..2d4123bc75ab 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/utils.js +++ b/core/src/main/resources/org/apache/spark/ui/static/utils.js @@ -20,7 +20,7 @@ export { errorMessageCell, errorSummary, formatBytes, formatDate, formatDuration, formatLogsCells, formatTimeMillis, getBaseURI, getStandAloneAppId, getTimeZone, - setDataTableDefaults + setDataTableDefaults, stringAbbreviate }; /* global $, uiRoot */ @@ -272,3 +272,16 @@ function errorMessageCell(errorMessage) { const details = detailsUINode(isMultiline, errorMessage); return summary + details; } + +function stringAbbreviate(content, limit) { + if (content && content.length > limit) { +const summary = content.substring(0, limit) + '...'; +// TODO: Reused stacktrace-details* style for convenience, but it's not really a stacktrace +// Consider creating a new style for this case if stacktrace-details is not appropriate in +// the future. +const details = detailsUINode(true, content); +return summary + details; + } else { +return content; + } +} diff --git a/ui-test/tests/utils.test.js b/ui-test/tests/utils.test.js index ad3e87b76641..a6815577bd82 100644 --- a/ui-test/tests/utils.test.js +++ b/ui-test/tests/utils.test.js @@ -67,3 +67,10 @@ test('errorSummary', function () { const e2 = "java.lang.RuntimeException: random text"; expect(utils.errorSummary(e2).toString()).toBe('java.lang.RuntimeException,true'); }); + +test('stringAbbreviat
(spark) branch master updated: [SPARK-46916][PS][TESTS] Clean up `pyspark.pandas.tests.indexes.*`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e3143c4c2806 [SPARK-46916][PS][TESTS] Clean up `pyspark.pandas.tests.indexes.*` e3143c4c2806 is described below commit e3143c4c28068b80865c4ed9780a5a4beec0a7e8 Author: Ruifeng Zheng AuthorDate: Mon Jan 29 22:05:12 2024 -0800 [SPARK-46916][PS][TESTS] Clean up `pyspark.pandas.tests.indexes.*` ### What changes were proposed in this pull request? Clean up `pyspark.pandas.tests.indexes.*`: 1, delete unused imports, variables; 2, avoid double definition of the testing datasets; ### Why are the changes needed? code clean up ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44945 from zhengruifeng/ps_test_index_cleanup. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .../pandas/tests/connect/indexes/test_parity_align.py| 11 ++- .../pandas/tests/connect/indexes/test_parity_indexing.py | 11 ++- .../pandas/tests/connect/indexes/test_parity_reindex.py | 11 ++- .../pandas/tests/connect/indexes/test_parity_rename.py | 11 ++- .../pandas/tests/connect/indexes/test_parity_reset_index.py | 9 - python/pyspark/pandas/tests/indexes/test_align.py| 8 ++-- python/pyspark/pandas/tests/indexes/test_asof.py | 4 ++-- python/pyspark/pandas/tests/indexes/test_astype.py | 4 ++-- python/pyspark/pandas/tests/indexes/test_datetime.py | 6 +- python/pyspark/pandas/tests/indexes/test_delete.py | 4 ++-- python/pyspark/pandas/tests/indexes/test_diff.py | 4 ++-- python/pyspark/pandas/tests/indexes/test_drop.py | 4 ++-- python/pyspark/pandas/tests/indexes/test_indexing.py | 12 ++-- .../pandas/tests/indexes/test_indexing_loc_multi_idx.py | 1 - python/pyspark/pandas/tests/indexes/test_insert.py | 11 ++- python/pyspark/pandas/tests/indexes/test_map.py | 4 ++-- python/pyspark/pandas/tests/indexes/test_reindex.py | 8 ++-- python/pyspark/pandas/tests/indexes/test_rename.py | 8 ++-- python/pyspark/pandas/tests/indexes/test_reset_index.py | 8 ++-- python/pyspark/pandas/tests/indexes/test_sort.py | 4 ++-- python/pyspark/pandas/tests/indexes/test_symmetric_diff.py | 4 ++-- python/pyspark/pandas/tests/indexes/test_take.py | 4 ++-- python/pyspark/pandas/tests/indexes/test_timedelta.py| 6 +- 23 files changed, 92 insertions(+), 65 deletions(-) diff --git a/python/pyspark/pandas/tests/connect/indexes/test_parity_align.py b/python/pyspark/pandas/tests/connect/indexes/test_parity_align.py index 0bf84e6421f2..2bb56242ba34 100644 --- a/python/pyspark/pandas/tests/connect/indexes/test_parity_align.py +++ b/python/pyspark/pandas/tests/connect/indexes/test_parity_align.py @@ -16,16 +16,17 @@ # import unittest -from pyspark import pandas as ps from pyspark.pandas.tests.indexes.test_align import FrameAlignMixin from pyspark.testing.connectutils import ReusedConnectTestCase from pyspark.testing.pandasutils import PandasOnSparkTestUtils -class FrameParityAlignTests(FrameAlignMixin, PandasOnSparkTestUtils, ReusedConnectTestCase): -@property -def psdf(self): -return ps.from_pandas(self.pdf) +class FrameParityAlignTests( +FrameAlignMixin, +PandasOnSparkTestUtils, +ReusedConnectTestCase, +): +pass if __name__ == "__main__": diff --git a/python/pyspark/pandas/tests/connect/indexes/test_parity_indexing.py b/python/pyspark/pandas/tests/connect/indexes/test_parity_indexing.py index a76489314d25..5e52dd91474a 100644 --- a/python/pyspark/pandas/tests/connect/indexes/test_parity_indexing.py +++ b/python/pyspark/pandas/tests/connect/indexes/test_parity_indexing.py @@ -16,16 +16,17 @@ # import unittest -from pyspark import pandas as ps from pyspark.pandas.tests.indexes.test_indexing import FrameIndexingMixin from pyspark.testing.connectutils import ReusedConnectTestCase from pyspark.testing.pandasutils import PandasOnSparkTestUtils -class FrameParityIndexingTests(FrameIndexingMixin, PandasOnSparkTestUtils, ReusedConnectTestCase): -@property -def psdf(self): -return ps.from_pandas(self.pdf) +class FrameParityIndexingTests( +FrameIndexingMixin, +PandasOnSparkTestUtils, +ReusedConnectTestCase, +): +pass if __name__ == "__main__": diff --git a/python/pyspark/pandas/tests/connect/indexes/test_parity_reindex.py b/python/pyspark/pandas/tests/connect/indexes/test_parity_reindex.p
(spark) branch master updated: [SPARK-46907][CORE] Show driver log location in Spark History Server
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 29355c07580e [SPARK-46907][CORE] Show driver log location in Spark History Server 29355c07580e is described below commit 29355c07580e68d48546e9e210c876b69a8c10a2 Author: Dongjoon Hyun AuthorDate: Mon Jan 29 14:04:07 2024 -0800 [SPARK-46907][CORE] Show driver log location in Spark History Server ### What changes were proposed in this pull request? This PR aims to show `Driver Log Location` in Spark History Server UI if `spark.driver.log.dfsDir` is configured. ### Why are the changes needed? **BEFORE (or `spark.driver.log.dfsDir` is absent)** ![Screenshot 2024-01-29 at 10 11 06 AM](https://github.com/apache/spark/assets/9700541/6d709b4b-d002-422b-a1df-bb5e1b50b539) **AFTER** ![Screenshot 2024-01-29 at 10 10 25 AM](https://github.com/apache/spark/assets/9700541/83b35a7d-5fc9-443a-a6e5-7b6bd98dbdc6) ### Does this PR introduce _any_ user-facing change? No, this is a new additional UI information only for the users who uses `spark.driver.log.dfsDir` configurations. ### How was this patch tested? Manual. ``` $ mkdir /tmp/history $ mkdir /tmp/driver-logs $ SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/history -Dspark.driver.log.dfsDir=/tmp/driver-logs" sbin/start-history-server.sh ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44936 from dongjoon-hyun/SPARK-46907. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/deploy/history/FsHistoryProvider.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala index 8f64de0847ec..7c888a07263a 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala @@ -381,7 +381,12 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) } else { Map() } -Map("Event log directory" -> logDir) ++ safeMode +val driverLog = if (conf.contains(DRIVER_LOG_DFS_DIR)) { + Map("Driver log directory" -> conf.get(DRIVER_LOG_DFS_DIR).get) +} else { + Map() +} +Map("Event log directory" -> logDir) ++ safeMode ++ driverLog } override def start(): Unit = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (e211dbdee42c -> c468c3d5c685)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from e211dbdee42c [SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field add c468c3d5c685 [SPARK-46904][UI] Fix display issue of History UI summary No new revisions were added by this update. Summary of changes: .../apache/spark/deploy/history/HistoryPage.scala | 121 +++-- 1 file changed, 64 insertions(+), 57 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46903][CORE] Support Spark History Server Log UI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8dd395b2eabd [SPARK-46903][CORE] Support Spark History Server Log UI 8dd395b2eabd is described below commit 8dd395b2eabd2815982022b38a5287dae7af8b82 Author: Dongjoon Hyun AuthorDate: Mon Jan 29 01:32:45 2024 -0800 [SPARK-46903][CORE] Support Spark History Server Log UI ### What changes were proposed in this pull request? This PR aims to make `Spark History Server` to provide its server log view link and page. ### Why are the changes needed? To improve UX. - `Show server log` link is added at the bottom of page. ![Screenshot 2024-01-29 at 12 54 41 AM](https://github.com/apache/spark/assets/9700541/7e5cea9f-8ac8-4a60-a249-d1bb31f6e269) - The link opens the following log view page. ![Screenshot 2024-01-29 at 12 55 41 AM](https://github.com/apache/spark/assets/9700541/70cf0c77-fc67-4ad8-97db-b061fdd1ffd0) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44932 from dongjoon-hyun/SPARK-46903. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/deploy/history/HistoryPage.scala | 2 + .../spark/deploy/history/HistoryServer.scala | 1 + .../org/apache/spark/deploy/history/LogPage.scala | 126 + 3 files changed, 129 insertions(+) diff --git a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala index 7ba9b2c54937..03d880f47306 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala @@ -94,6 +94,8 @@ private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("") } } + + Show server log UIUtils.basicSparkPage(request, content, "History Server", true) diff --git a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala index 321f76923411..8ba610e0a13d 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala @@ -148,6 +148,7 @@ class HistoryServer( */ def initialize(): Unit = { attachPage(new HistoryPage(this)) +attachPage(new LogPage(conf)) attachHandler(ApiRootResource.getServletHandler(this)) diff --git a/core/src/main/scala/org/apache/spark/deploy/history/LogPage.scala b/core/src/main/scala/org/apache/spark/deploy/history/LogPage.scala new file mode 100644 index ..72d88e14a122 --- /dev/null +++ b/core/src/main/scala/org/apache/spark/deploy/history/LogPage.scala @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.history + +import java.io.File +import javax.servlet.http.HttpServletRequest + +import scala.xml.{Node, Unparsed} + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.ui.{UIUtils, WebUIPage} +import org.apache.spark.util.Utils +import org.apache.spark.util.logging.RollingFileAppender + +private[history] class LogPage(conf: SparkConf) extends WebUIPage("logPage") with Logging { + private val defaultBytes = 100 * 1024 + + def render(request: HttpServletRequest): Seq[Node] = { +val logDir = sys.env.getOrElse("SPARK_LOG_DIR", "logs/") +val logType = request.getParameter("logType") +val offset = Option(request.getParameter("offset")).map(_.toLong) +val byteLength = Option(request.getParameter("byteLength")).map(_.toInt) +
(spark) branch master updated: [SPARK-46902][UI] Fix Spark History Server UI for using un-exported setAppLimit
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1386b52f3eb6 [SPARK-46902][UI] Fix Spark History Server UI for using un-exported setAppLimit 1386b52f3eb6 is described below commit 1386b52f3eb624331345611ef1f6ecc44047f80f Author: Kent Yao AuthorDate: Mon Jan 29 01:26:57 2024 -0800 [SPARK-46902][UI] Fix Spark History Server UI for using un-exported setAppLimit ### What changes were proposed in this pull request? Fix Spark History Server UI for using un-exported `setAppLimit` to render the dataTables of app list close #44930 ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Locally built and tested ![image](https://github.com/apache/spark/assets/8326978/6899b1a2-0232-4f85-9389-e5c18db8d9d3) ### Was this patch authored or co-authored using generative AI tooling? no Closes #44931 from yaooqinn/SPARK-46902. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../main/resources/org/apache/spark/ui/static/historypage.js | 2 ++ .../scala/org/apache/spark/deploy/history/HistoryPage.scala | 11 +-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/core/src/main/resources/org/apache/spark/ui/static/historypage.js b/core/src/main/resources/org/apache/spark/ui/static/historypage.js index 08438e6eda61..85cd5a554750 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/historypage.js +++ b/core/src/main/resources/org/apache/spark/ui/static/historypage.js @@ -19,6 +19,8 @@ import {formatDuration, formatTimeMillis} from "./utils.js"; +export {setAppLimit}; + var appLimit = -1; /* eslint-disable no-unused-vars */ diff --git a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala index b8f064c68cdd..7ba9b2c54937 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala @@ -19,10 +19,11 @@ package org.apache.spark.deploy.history import javax.servlet.http.HttpServletRequest -import scala.xml.Node +import scala.xml.{Node, Unparsed} import org.apache.spark.status.api.v1.ApplicationInfo import org.apache.spark.ui.{UIUtils, WebUIPage} +import org.apache.spark.ui.UIUtils.formatImportJavaScript private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("") { @@ -63,12 +64,18 @@ private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("") { if (displayApplications) { + val js = +s""" + |${formatImportJavaScript(request, "/static/historypage.js", "setAppLimit")} + | + |setAppLimit(${parent.maxApplications}); + |""".stripMargin ++ ++ ++ -setAppLimit({parent.maxApplications}) +{Unparsed(js)} } else if (requestedIncomplete) { No incomplete applications found! } else if (eventLogsUnderProcessCount > 0) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46900][BUILD] Upgrade slf4j to 2.0.11
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a368280708dd [SPARK-46900][BUILD] Upgrade slf4j to 2.0.11 a368280708dd is described below commit a368280708dd3c6eb90bd3b09a36a68bdd096222 Author: yangjie01 AuthorDate: Sun Jan 28 23:42:37 2024 -0800 [SPARK-46900][BUILD] Upgrade slf4j to 2.0.11 ### What changes were proposed in this pull request? This pr aims to upgrade slf4j from 2.0.10 to 2.0.11 ### Why are the changes needed? This release reinstates the `renderLevel()` method in SimpleLogger which was removed by mistake. The full release notes as follows: - https://www.slf4j.org/news.html#2.0.11 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44928 from LuciferYang/SPARK-46900. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 6 +++--- pom.xml | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 09291de50350..06fb4d879db2 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -123,7 +123,7 @@ javassist/3.29.2-GA//javassist-3.29.2-GA.jar javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar javolution/5.5.1//javolution-5.5.1.jar jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar -jcl-over-slf4j/2.0.10//jcl-over-slf4j-2.0.10.jar +jcl-over-slf4j/2.0.11//jcl-over-slf4j-2.0.11.jar jdo-api/3.0.1//jdo-api-3.0.1.jar jdom2/2.0.6//jdom2-2.0.6.jar jersey-client/2.41//jersey-client-2.41.jar @@ -148,7 +148,7 @@ json4s-jackson_2.13/3.7.0-M11//json4s-jackson_2.13-3.7.0-M11.jar json4s-scalap_2.13/3.7.0-M11//json4s-scalap_2.13-3.7.0-M11.jar jsr305/3.0.0//jsr305-3.0.0.jar jta/1.1//jta-1.1.jar -jul-to-slf4j/2.0.10//jul-to-slf4j-2.0.10.jar +jul-to-slf4j/2.0.11//jul-to-slf4j-2.0.11.jar kryo-shaded/4.0.2//kryo-shaded-4.0.2.jar kubernetes-client-api/6.10.0//kubernetes-client-api-6.10.0.jar kubernetes-client/6.10.0//kubernetes-client-6.10.0.jar @@ -247,7 +247,7 @@ scala-parallel-collections_2.13/1.0.4//scala-parallel-collections_2.13-1.0.4.jar scala-parser-combinators_2.13/2.3.0//scala-parser-combinators_2.13-2.3.0.jar scala-reflect/2.13.12//scala-reflect-2.13.12.jar scala-xml_2.13/2.2.0//scala-xml_2.13-2.2.0.jar -slf4j-api/2.0.10//slf4j-api-2.0.10.jar +slf4j-api/2.0.11//slf4j-api-2.0.11.jar snakeyaml-engine/2.7//snakeyaml-engine-2.7.jar snakeyaml/2.2//snakeyaml-2.2.jar snappy-java/1.1.10.5//snappy-java-1.1.10.5.jar diff --git a/pom.xml b/pom.xml index a5f2b6f74b7a..b78f49499feb 100644 --- a/pom.xml +++ b/pom.xml @@ -119,7 +119,7 @@ 3.1.0 spark 9.6 -2.0.10 +2.0.11 2.22.1 3.3.6 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46901][PYTHON] Upgrade `pyarrow` to 15.0.0
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 487cbc086a30 [SPARK-46901][PYTHON] Upgrade `pyarrow` to 15.0.0 487cbc086a30 is described below commit 487cbc086a30ec4d58695336acbe8037a3d5ebe7 Author: Ruifeng Zheng AuthorDate: Sun Jan 28 23:41:49 2024 -0800 [SPARK-46901][PYTHON] Upgrade `pyarrow` to 15.0.0 ### What changes were proposed in this pull request? Upgrade `pyarrow` to 15.0.0 ### Why are the changes needed? to support latest pyarrow ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44924 from zhengruifeng/py_arrow_15. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- dev/infra/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index 976f94251d7a..fc515d4478ad 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -94,7 +94,7 @@ RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3 RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml -ARG BASIC_PIP_PKGS="numpy pyarrow>=14.0.0 six==1.16.0 pandas<=2.1.4 scipy plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" +ARG BASIC_PIP_PKGS="numpy pyarrow>=15.0.0 six==1.16.0 pandas<=2.1.4 scipy plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" # Python deps for Spark Connect ARG CONNECT_PIP_PKGS="grpcio==1.59.3 grpcio-status==1.59.3 protobuf==4.25.1 googleapis-common-protos==1.56.4" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46899][CORE] Remove `POST` APIs from `MasterWebUI` when `spark.ui.killEnabled` is `false`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 95a4abd5b5bc [SPARK-46899][CORE] Remove `POST` APIs from `MasterWebUI` when `spark.ui.killEnabled` is `false` 95a4abd5b5bc is described below commit 95a4abd5b5bcc36335be9af84b7bbddd7d0034ba Author: Dongjoon Hyun AuthorDate: Sun Jan 28 22:38:32 2024 -0800 [SPARK-46899][CORE] Remove `POST` APIs from `MasterWebUI` when `spark.ui.killEnabled` is `false` ### What changes were proposed in this pull request? This PR aims to remove `POST` APIs from `MasterWebUI` when `spark.ui.killEnabled` is false. ### Why are the changes needed? If `spark.ui.killEnabled` is false, we don't need to attach `POST`-related redirect or servlet handlers from the beginning because it will be ignored in `MasterPage`. https://github.com/apache/spark/blob/8cd0d1854da04334aff3188e4eca08a48f734579/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala#L64-L65 ### Does this PR introduce _any_ user-facing change? Previously, the user request is ignored silently after redirecting. Now, it will response with a correct HTTP error code, 405 `Method Not Allowed`. ### How was this patch tested? Pass the CIs with newly added test suite, `ReadOnlyMasterWebUISuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44926 from dongjoon-hyun/SPARK-46899. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../spark/deploy/master/ui/MasterWebUI.scala | 46 ++--- .../spark/deploy/master/ui/MasterWebUISuite.scala | 9 ++- .../master/ui/ReadOnlyMasterWebUISuite.scala | 75 ++ 3 files changed, 105 insertions(+), 25 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala index 3025c0bf468b..14ea6dbb3d20 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala @@ -54,31 +54,33 @@ class MasterWebUI( attachPage(new LogPage(this)) attachPage(masterPage) addStaticHandler(MasterWebUI.STATIC_RESOURCE_DIR) -attachHandler(createRedirectHandler( - "/app/kill", "/", masterPage.handleAppKillRequest, httpMethods = Set("POST"))) -attachHandler(createRedirectHandler( - "/driver/kill", "/", masterPage.handleDriverKillRequest, httpMethods = Set("POST"))) -attachHandler(createServletHandler("/workers/kill", new HttpServlet { - override def doPost(req: HttpServletRequest, resp: HttpServletResponse): Unit = { -val hostnames: Seq[String] = Option(req.getParameterValues("host")) - .getOrElse(Array[String]()).toImmutableArraySeq -if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) { - resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED) -} else { - val removedWorkers = masterEndpointRef.askSync[Integer]( -DecommissionWorkersOnHosts(hostnames)) - logInfo(s"Decommissioning of hosts $hostnames decommissioned $removedWorkers workers") - if (removedWorkers > 0) { -resp.setStatus(HttpServletResponse.SC_OK) - } else if (removedWorkers == 0) { -resp.sendError(HttpServletResponse.SC_NOT_FOUND) +if (killEnabled) { + attachHandler(createRedirectHandler( +"/app/kill", "/", masterPage.handleAppKillRequest, httpMethods = Set("POST"))) + attachHandler(createRedirectHandler( +"/driver/kill", "/", masterPage.handleDriverKillRequest, httpMethods = Set("POST"))) + attachHandler(createServletHandler("/workers/kill", new HttpServlet { +override def doPost(req: HttpServletRequest, resp: HttpServletResponse): Unit = { + val hostnames: Seq[String] = Option(req.getParameterValues("host")) +.getOrElse(Array[String]()).toImmutableArraySeq + if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) { +resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED) } else { -// We shouldn't even see this case. -resp.setStatus(HttpServletResponse.SC_INTERNAL_SERVER_ERROR) +val removedWorkers = masterEndpointRef.askSync[Integer]( + DecommissionWorkersOnHosts(hostnames)) +logInfo(s"Decommissioning of hosts $hostnames decommissioned $removedWorkers workers") +
(spark) branch master updated: [SPARK-46897][PYTHON][DOCS] Refine docstring of `bit_and/bit_or/bit_xor`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5056a17919ac [SPARK-46897][PYTHON][DOCS] Refine docstring of `bit_and/bit_or/bit_xor` 5056a17919ac is described below commit 5056a17919ac88d35475dd13ae4167e783f9504a Author: yangjie01 AuthorDate: Sun Jan 28 21:33:39 2024 -0800 [SPARK-46897][PYTHON][DOCS] Refine docstring of `bit_and/bit_or/bit_xor` ### What changes were proposed in this pull request? This pr refine docstring of `bit_and/bit_or/bit_xor` and add some new examples. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44923 from LuciferYang/SPARK-46897. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/functions/builtin.py | 138 ++-- 1 file changed, 132 insertions(+), 6 deletions(-) diff --git a/python/pyspark/sql/functions/builtin.py b/python/pyspark/sql/functions/builtin.py index d3a94fe4b9e9..0932ac1c2843 100644 --- a/python/pyspark/sql/functions/builtin.py +++ b/python/pyspark/sql/functions/builtin.py @@ -3790,9 +3790,51 @@ def bit_and(col: "ColumnOrName") -> Column: Examples +Example 1: Bitwise AND with all non-null values + +>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([[1],[1],[2]], ["c"]) ->>> df.select(bit_and("c")).first() -Row(bit_and(c)=0) +>>> df.select(sf.bit_and("c")).show() ++--+ +|bit_and(c)| ++--+ +| 0| ++--+ + +Example 2: Bitwise AND with null values + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([[1],[None],[2]], ["c"]) +>>> df.select(sf.bit_and("c")).show() ++--+ +|bit_and(c)| ++--+ +| 0| ++--+ + +Example 3: Bitwise AND with all null values + +>>> from pyspark.sql import functions as sf +>>> from pyspark.sql.types import IntegerType, StructType, StructField +>>> schema = StructType([StructField("c", IntegerType(), True)]) +>>> df = spark.createDataFrame([[None],[None],[None]], schema=schema) +>>> df.select(sf.bit_and("c")).show() ++--+ +|bit_and(c)| ++--+ +| NULL| ++--+ + +Example 4: Bitwise AND with single input value + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([[5]], ["c"]) +>>> df.select(sf.bit_and("c")).show() ++--+ +|bit_and(c)| ++--+ +| 5| ++--+ """ return _invoke_function_over_columns("bit_and", col) @@ -3816,9 +3858,51 @@ def bit_or(col: "ColumnOrName") -> Column: Examples +Example 1: Bitwise OR with all non-null values + +>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([[1],[1],[2]], ["c"]) ->>> df.select(bit_or("c")).first() -Row(bit_or(c)=3) +>>> df.select(sf.bit_or("c")).show() ++-+ +|bit_or(c)| ++-+ +|3| ++-+ + +Example 2: Bitwise OR with some null values + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([[1],[None],[2]], ["c"]) +>>> df.select(sf.bit_or("c")).show() ++-+ +|bit_or(c)| ++-+ +|3| ++-+ + +Example 3: Bitwise OR with all null values + +>>> from pyspark.sql import functions as sf +>>> from pyspark.sql.types import IntegerType, StructType, StructField +>>> schema = StructType([StructField("c", IntegerType(), True)]) +>>> df = spark.createDataFrame([[None],[None],[None]], schema=schema) +>>> df.select(sf.bit_or("c")).show() ++-+ +|bit_or(c)| ++-+ +| NULL| ++-+ + +Example 4: Bitwise OR with single input value + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([[5]], ["c"]) +>>> df.select(sf.bit_or("c")).show() ++-+ +|bit_or(c)| ++-+
(spark) branch master updated (f078998df2f3 -> bb2195554e6d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from f078998df2f3 [MINOR][DOCS] Miscellaneous documentation improvements add bb2195554e6d [SPARK-46874][PYTHON] Remove `pyspark.pandas` dependency from `assertDataFrameEqual` No new revisions were added by this update. Summary of changes: python/pyspark/testing/utils.py | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46892][BUILD] Upgrade dropwizard metrics 4.2.25
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d74aecd11dcd [SPARK-46892][BUILD] Upgrade dropwizard metrics 4.2.25 d74aecd11dcd is described below commit d74aecd11dcd1c8414b662457e49b6001395bb8d Author: panbingkun AuthorDate: Sun Jan 28 12:12:02 2024 -0800 [SPARK-46892][BUILD] Upgrade dropwizard metrics 4.2.25 ### What changes were proposed in this pull request? The pr aims to upgrade dropwizard metrics from `4.2.21` to `4.2.25`. ### Why are the changes needed? The last update occurred 3 months ago. - The new version bringes some bug fixes: Fix IndexOutOfBoundsException in Jetty 9, 10, 11, 12 InstrumentedHandler https://github.com/dropwizard/metrics/pull/3912 - The full version release notes: https://github.com/dropwizard/metrics/releases/tag/v4.2.25 https://github.com/dropwizard/metrics/releases/tag/v4.2.24 https://github.com/dropwizard/metrics/releases/tag/v4.2.23 https://github.com/dropwizard/metrics/releases/tag/v4.2.22 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44918 from panbingkun/SPARK-46892. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 10 +- pom.xml | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 71f9ac8665b0..09291de50350 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -185,11 +185,11 @@ log4j-core/2.22.1//log4j-core-2.22.1.jar log4j-slf4j2-impl/2.22.1//log4j-slf4j2-impl-2.22.1.jar logging-interceptor/3.12.12//logging-interceptor-3.12.12.jar lz4-java/1.8.0//lz4-java-1.8.0.jar -metrics-core/4.2.21//metrics-core-4.2.21.jar -metrics-graphite/4.2.21//metrics-graphite-4.2.21.jar -metrics-jmx/4.2.21//metrics-jmx-4.2.21.jar -metrics-json/4.2.21//metrics-json-4.2.21.jar -metrics-jvm/4.2.21//metrics-jvm-4.2.21.jar +metrics-core/4.2.25//metrics-core-4.2.25.jar +metrics-graphite/4.2.25//metrics-graphite-4.2.25.jar +metrics-jmx/4.2.25//metrics-jmx-4.2.25.jar +metrics-json/4.2.25//metrics-json-4.2.25.jar +metrics-jvm/4.2.25//metrics-jvm-4.2.25.jar minlog/1.3.0//minlog-1.3.0.jar netty-all/4.1.106.Final//netty-all-4.1.106.Final.jar netty-buffer/4.1.106.Final//netty-buffer-4.1.106.Final.jar diff --git a/pom.xml b/pom.xml index d4e8a7db71de..a5f2b6f74b7a 100644 --- a/pom.xml +++ b/pom.xml @@ -156,7 +156,7 @@ If you change codahale.metrics.version, you also need to change the link to metrics.dropwizard.io in docs/monitoring.md. --> -4.2.21 +4.2.25 1.11.3 1.12.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 51b021fdf915 [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled 51b021fdf915 is described below commit 51b021fdf915d4aab62056ee60e4098047bd9841 Author: Dongjoon Hyun AuthorDate: Sat Jan 27 20:24:15 2024 -0800 [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled This PR aims to fix `Master` to reject `/workers/kill/` request if `spark.decommission.enabled` is `false` in order to fix the dangling worker issue. Currently, `spark.decommission.enabled` is `false` by default. So, when a user asks to decommission, only Master marked it `DECOMMISSIONED` while the worker is alive. ``` $ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1 ``` **Master UI** ![Screenshot 2024-01-27 at 6 19 18 PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be) **Worker Log** ``` 24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled. ``` To be consistent with the existing `Worker` behavior which ignores the request. https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868 No, this is a bug fix. Pass the CI with the newly added test case. No. Closes #44915 from dongjoon-hyun/SPARK-46888. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 20b593811dc02c96c71978851e051d32bf8c3496) Signed-off-by: Dongjoon Hyun --- .../apache/spark/deploy/master/ui/MasterWebUI.scala | 4 +++- .../apache/spark/deploy/master/MasterSuite.scala| 21 + .../spark/deploy/master/ui/MasterWebUISuite.scala | 3 ++- 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala index af94bd6d9e0f..53e5c5ac2a8f 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala @@ -23,6 +23,7 @@ import javax.servlet.http.{HttpServlet, HttpServletRequest, HttpServletResponse} import org.apache.spark.deploy.DeployMessages.{DecommissionWorkersOnHosts, MasterStateResponse, RequestMasterState} import org.apache.spark.deploy.master.Master import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.DECOMMISSION_ENABLED import org.apache.spark.internal.config.UI.MASTER_UI_DECOMMISSION_ALLOW_MODE import org.apache.spark.internal.config.UI.UI_KILL_ENABLED import org.apache.spark.ui.{SparkUI, WebUI} @@ -40,6 +41,7 @@ class MasterWebUI( val masterEndpointRef = master.self val killEnabled = master.conf.get(UI_KILL_ENABLED) + val decommissionDisabled = !master.conf.get(DECOMMISSION_ENABLED) val decommissionAllowMode = master.conf.get(MASTER_UI_DECOMMISSION_ALLOW_MODE) initialize() @@ -58,7 +60,7 @@ class MasterWebUI( override def doPost(req: HttpServletRequest, resp: HttpServletResponse): Unit = { val hostnames: Seq[String] = Option(req.getParameterValues("host")) .getOrElse(Array[String]()).toSeq -if (!isDecommissioningRequestAllowed(req)) { +if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) { resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED) } else { val removedWorkers = masterEndpointRef.askSync[Integer]( diff --git a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala index 1cec863b1e7f..37874de98766 100644 --- a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala @@ -17,6 +17,7 @@ package org.apache.spark.deploy.master +import java.net.{HttpURLConnection, URL} import java.util.Date import java.util.concurrent.{ConcurrentLinkedQueue, CountDownLatch, TimeUnit} import java.util.concurrent.atomic.AtomicInteger @@ -325,6 +326,26 @@ class MasterSuite extends SparkFunSuite } } + test("SPARK-46888: master should reject worker kill request if decommision is disabled") { +implicit val formats = org.json4s.DefaultFormats +val conf = new SparkConf() + .set(DECOMMISSION_ENABLED, false) + .set(MASTER_UI_DECOMMISSION_ALLOW_MODE, "ALLOW") +val localCluster = LocalSparkCluster(1, 1, 512, conf) +localCluster.s
(spark) branch branch-3.5 updated: [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new accfb39e4ddf [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled accfb39e4ddf is described below commit accfb39e4ddf7f7b54396bd0e35256a04461c693 Author: Dongjoon Hyun AuthorDate: Sat Jan 27 20:24:15 2024 -0800 [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled This PR aims to fix `Master` to reject `/workers/kill/` request if `spark.decommission.enabled` is `false` in order to fix the dangling worker issue. Currently, `spark.decommission.enabled` is `false` by default. So, when a user asks to decommission, only Master marked it `DECOMMISSIONED` while the worker is alive. ``` $ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1 ``` **Master UI** ![Screenshot 2024-01-27 at 6 19 18 PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be) **Worker Log** ``` 24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled. ``` To be consistent with the existing `Worker` behavior which ignores the request. https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868 No, this is a bug fix. Pass the CI with the newly added test case. No. Closes #44915 from dongjoon-hyun/SPARK-46888. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 20b593811dc02c96c71978851e051d32bf8c3496) Signed-off-by: Dongjoon Hyun --- .../apache/spark/deploy/master/ui/MasterWebUI.scala | 4 +++- .../apache/spark/deploy/master/MasterSuite.scala| 21 + .../spark/deploy/master/ui/MasterWebUISuite.scala | 3 ++- 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala index af94bd6d9e0f..53e5c5ac2a8f 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala @@ -23,6 +23,7 @@ import javax.servlet.http.{HttpServlet, HttpServletRequest, HttpServletResponse} import org.apache.spark.deploy.DeployMessages.{DecommissionWorkersOnHosts, MasterStateResponse, RequestMasterState} import org.apache.spark.deploy.master.Master import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.DECOMMISSION_ENABLED import org.apache.spark.internal.config.UI.MASTER_UI_DECOMMISSION_ALLOW_MODE import org.apache.spark.internal.config.UI.UI_KILL_ENABLED import org.apache.spark.ui.{SparkUI, WebUI} @@ -40,6 +41,7 @@ class MasterWebUI( val masterEndpointRef = master.self val killEnabled = master.conf.get(UI_KILL_ENABLED) + val decommissionDisabled = !master.conf.get(DECOMMISSION_ENABLED) val decommissionAllowMode = master.conf.get(MASTER_UI_DECOMMISSION_ALLOW_MODE) initialize() @@ -58,7 +60,7 @@ class MasterWebUI( override def doPost(req: HttpServletRequest, resp: HttpServletResponse): Unit = { val hostnames: Seq[String] = Option(req.getParameterValues("host")) .getOrElse(Array[String]()).toSeq -if (!isDecommissioningRequestAllowed(req)) { +if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) { resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED) } else { val removedWorkers = masterEndpointRef.askSync[Integer]( diff --git a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala index 1cec863b1e7f..37874de98766 100644 --- a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala @@ -17,6 +17,7 @@ package org.apache.spark.deploy.master +import java.net.{HttpURLConnection, URL} import java.util.Date import java.util.concurrent.{ConcurrentLinkedQueue, CountDownLatch, TimeUnit} import java.util.concurrent.atomic.AtomicInteger @@ -325,6 +326,26 @@ class MasterSuite extends SparkFunSuite } } + test("SPARK-46888: master should reject worker kill request if decommision is disabled") { +implicit val formats = org.json4s.DefaultFormats +val conf = new SparkConf() + .set(DECOMMISSION_ENABLED, false) + .set(MASTER_UI_DECOMMISSION_ALLOW_MODE, "ALLOW") +val localCluster = LocalSparkCluster(1, 1, 512, conf) +localCluster.s
(spark) branch master updated: [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 20b593811dc0 [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled 20b593811dc0 is described below commit 20b593811dc02c96c71978851e051d32bf8c3496 Author: Dongjoon Hyun AuthorDate: Sat Jan 27 20:24:15 2024 -0800 [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled ### What changes were proposed in this pull request? This PR aims to fix `Master` to reject `/workers/kill/` request if `spark.decommission.enabled` is `false` in order to fix the dangling worker issue. ### Why are the changes needed? Currently, `spark.decommission.enabled` is `false` by default. So, when a user asks to decommission, only Master marked it `DECOMMISSIONED` while the worker is alive. ``` $ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1 ``` **Master UI** ![Screenshot 2024-01-27 at 6 19 18 PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be) **Worker Log** ``` 24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled. ``` To be consistent with the existing `Worker` behavior which ignores the request. https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868 ### Does this PR introduce _any_ user-facing change? No, this is a bug fix. ### How was this patch tested? Pass the CI with the newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44915 from dongjoon-hyun/SPARK-46888. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/deploy/master/ui/MasterWebUI.scala | 4 +++- .../apache/spark/deploy/master/MasterSuite.scala| 21 + .../spark/deploy/master/ui/MasterWebUISuite.scala | 3 ++- 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala index d71ef8b9e36e..3025c0bf468b 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala @@ -23,6 +23,7 @@ import javax.servlet.http.{HttpServlet, HttpServletRequest, HttpServletResponse} import org.apache.spark.deploy.DeployMessages.{DecommissionWorkersOnHosts, MasterStateResponse, RequestMasterState} import org.apache.spark.deploy.master.Master import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.DECOMMISSION_ENABLED import org.apache.spark.internal.config.UI.MASTER_UI_DECOMMISSION_ALLOW_MODE import org.apache.spark.internal.config.UI.UI_KILL_ENABLED import org.apache.spark.ui.{SparkUI, WebUI} @@ -41,6 +42,7 @@ class MasterWebUI( val masterEndpointRef = master.self val killEnabled = master.conf.get(UI_KILL_ENABLED) + val decommissionDisabled = !master.conf.get(DECOMMISSION_ENABLED) val decommissionAllowMode = master.conf.get(MASTER_UI_DECOMMISSION_ALLOW_MODE) initialize() @@ -60,7 +62,7 @@ class MasterWebUI( override def doPost(req: HttpServletRequest, resp: HttpServletResponse): Unit = { val hostnames: Seq[String] = Option(req.getParameterValues("host")) .getOrElse(Array[String]()).toImmutableArraySeq -if (!isDecommissioningRequestAllowed(req)) { +if (decommissionDisabled || !isDecommissioningRequestAllowed(req)) { resp.sendError(HttpServletResponse.SC_METHOD_NOT_ALLOWED) } else { val removedWorkers = masterEndpointRef.askSync[Integer]( diff --git a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala index 6966a7f660b2..0db58ae0c834 100644 --- a/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala @@ -17,6 +17,7 @@ package org.apache.spark.deploy.master +import java.net.{HttpURLConnection, URL} import java.util.Date import java.util.concurrent.{ConcurrentLinkedQueue, CountDownLatch, TimeUnit} import java.util.concurrent.atomic.AtomicInteger @@ -444,6 +445,26 @@ class MasterSuite extends SparkFunSuite } } + test("SPARK-46888: master should reject worker kill request if decommision is disabled") { +implicit val formats = org.json4s.DefaultFormats +val conf = new Spa
(spark) branch master updated: [SPARK-46883][CORE][FOLLOWUP] Fix `clusterutilization` API to handle 0 worker case
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a96f399d5094 [SPARK-46883][CORE][FOLLOWUP] Fix `clusterutilization` API to handle 0 worker case a96f399d5094 is described below commit a96f399d5094a3473ffc0e55390105d013a3d22f Author: Dongjoon Hyun AuthorDate: Sat Jan 27 19:02:37 2024 -0800 [SPARK-46883][CORE][FOLLOWUP] Fix `clusterutilization` API to handle 0 worker case ### What changes were proposed in this pull request? This PR is a follow-up of #44908 to fix `clusterutilization` API to handle 0 worker case. ### Why are the changes needed? To fix `ArithmeticException` ``` $ curl http://localhost:8080/json/clusterutilization Error 500 java.lang.ArithmeticException: / by zero HTTP ERROR 500 java.lang.ArithmeticException: / by zero URI:/json/clusterutilization ``` ### Does this PR introduce _any_ user-facing change? No, this feature and bug is not released yet. ### How was this patch tested? Pass the CIs with the newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44914 from dongjoon-hyun/SPARK-46883-2. Lead-authored-by: Dongjoon Hyun Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/deploy/JsonProtocol.scala | 4 ++-- .../apache/spark/deploy/JsonProtocolSuite.scala| 24 +- 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala b/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala index 9c73e84f4166..04302c77a398 100644 --- a/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala +++ b/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala @@ -312,9 +312,9 @@ private[deploy] object JsonProtocol { ("waitingDrivers" -> obj.activeDrivers.count(_.state == DriverState.SUBMITTED)) ~ ("cores" -> cores) ~ ("coresused" -> coresUsed) ~ -("coresutilization" -> 100 * coresUsed / cores) ~ +("coresutilization" -> (if (cores == 0) 100 else 100 * coresUsed / cores)) ~ ("memory" -> memory) ~ ("memoryused" -> memoryUsed) ~ -("memoryutilization" -> 100 * memoryUsed / memory) +("memoryutilization" -> (if (memory == 0) 100 else 100 * memoryUsed / memory)) } } diff --git a/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala b/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala index 6fca31234ee2..518a8c8b3d05 100644 --- a/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala @@ -25,7 +25,7 @@ import org.json4s.jackson.JsonMethods import org.apache.spark.{JsonTestUtils, SparkFunSuite} import org.apache.spark.deploy.DeployMessages.{MasterStateResponse, WorkerStateResponse} -import org.apache.spark.deploy.master.{ApplicationInfo, RecoveryState} +import org.apache.spark.deploy.master.{ApplicationInfo, RecoveryState, WorkerInfo} import org.apache.spark.deploy.worker.ExecutorRunner class JsonProtocolSuite extends SparkFunSuite with JsonTestUtils { @@ -119,6 +119,21 @@ class JsonProtocolSuite extends SparkFunSuite with JsonTestUtils { assertValidDataInJson(output, JsonMethods.parse(JsonConstants.clusterUtilizationJsonStr)) } + test("SPARK-46883: writeClusterUtilization without workers") { +val workers = Array.empty[WorkerInfo] +val activeApps = Array(createAppInfo()) +val completedApps = Array.empty[ApplicationInfo] +val activeDrivers = Array(createDriverInfo()) +val completedDrivers = Array(createDriverInfo()) +val stateResponse = new MasterStateResponse( + "host", 8080, None, workers, activeApps, completedApps, + activeDrivers, completedDrivers, RecoveryState.ALIVE) +val output = JsonProtocol.writeClusterUtilization(stateResponse) +assertValidJson(output) +assertValidDataInJson(output, + JsonMethods.parse(JsonConstants.clusterUtilizationWithoutWorkersJsonStr)) + } + def assertValidJson(json: JValue): Unit = { try { JsonMethods.parse(JsonMethods.compact(json)) @@ -227,4 +242,11 @@ object JsonConstants { |"cores":8,"coresused":0,"coresutilization":0, |"memory":2468,"memoryused":0,"memoryutilization":0} """.stripMargin + + val clusterUtilizationWithoutWorkersJsonStr = +""" + |{"waitingDrivers":1, + |"cores":0,"
(spark) branch master updated: [SPARK-46887][DOCS] Document a few missed `spark.ui.*` configs to `Configuration` page
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1787a5261e87 [SPARK-46887][DOCS] Document a few missed `spark.ui.*` configs to `Configuration` page 1787a5261e87 is described below commit 1787a5261e87e0214a3f803f6534c5e52a0138e6 Author: Dongjoon Hyun AuthorDate: Sat Jan 27 16:48:26 2024 -0800 [SPARK-46887][DOCS] Document a few missed `spark.ui.*` configs to `Configuration` page ### What changes were proposed in this pull request? This PR aims to document a few missed `spark.ui.*` configurations for Apache Spark 4. This PR focuses only public configurations and excludes `internal` configuration like `spark.ui.jettyStopTimeout`. ### Why are the changes needed? To improve documentations. After this PR, I verified the following configurations are documented at least once in `Configuration` or `Security` page. ``` $ git grep 'ConfigBuilder("spark.ui.' core/src/main/scala/org/apache/spark/internal/config/Status.scala: val LIVE_ENTITY_UPDATE_PERIOD = ConfigBuilder("spark.ui.liveUpdate.period") core/src/main/scala/org/apache/spark/internal/config/Status.scala: val LIVE_ENTITY_UPDATE_MIN_FLUSH_PERIOD = ConfigBuilder("spark.ui.liveUpdate.minFlushPeriod") core/src/main/scala/org/apache/spark/internal/config/Status.scala: val MAX_RETAINED_JOBS = ConfigBuilder("spark.ui.retainedJobs") core/src/main/scala/org/apache/spark/internal/config/Status.scala: val MAX_RETAINED_STAGES = ConfigBuilder("spark.ui.retainedStages") core/src/main/scala/org/apache/spark/internal/config/Status.scala: val MAX_RETAINED_TASKS_PER_STAGE = ConfigBuilder("spark.ui.retainedTasks") core/src/main/scala/org/apache/spark/internal/config/Status.scala: val MAX_RETAINED_DEAD_EXECUTORS = ConfigBuilder("spark.ui.retainedDeadExecutors") core/src/main/scala/org/apache/spark/internal/config/Status.scala: val MAX_RETAINED_ROOT_NODES = ConfigBuilder("spark.ui.dagGraph.retainedRootRDDs") core/src/main/scala/org/apache/spark/internal/config/Status.scala: val LIVE_UI_LOCAL_STORE_DIR = ConfigBuilder("spark.ui.store.path") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") core/src/main/scala/org/apache/spark/internal/config/UI.scala: ConfigBuilder("spark.ui.consoleProgress.update.interval") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_ENABLED = ConfigBuilder("spark.ui.enabled") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_PORT = ConfigBuilder("spark.ui.port") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_FILTERS = ConfigBuilder("spark.ui.filters") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_ALLOW_FRAMING_FROM = ConfigBuilder("spark.ui.allowFramingFrom") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_REVERSE_PROXY = ConfigBuilder("spark.ui.reverseProxy") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_REVERSE_PROXY_URL = ConfigBuilder("spark.ui.reverseProxyUrl") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_KILL_ENABLED = ConfigBuilder("spark.ui.killEnabled") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_THREAD_DUMPS_ENABLED = ConfigBuilder("spark.ui.threadDumpsEnabled") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_FLAMEGRAPH_ENABLED = ConfigBuilder("spark.ui.threadDump.flamegraphEnabled") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_HEAP_HISTOGRAM_ENABLED = ConfigBuilder("spark.ui.heapHistogramEnabled") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_PROMETHEUS_ENABLED = ConfigBuilder("spark.ui.prometheus.enabled") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_X_XSS_PROTECTION = ConfigBuilder("spark.ui.xXssProtection") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_X_CONTENT_TYPE_OPTIONS = ConfigBuilder("spark.ui.xContentTypeOptions.enabled") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_STRICT_TRANSPORT_SECURITY = ConfigBuilder("spark.ui.strictTransportSecurity") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val UI_REQUEST_HEADER_SIZE = ConfigBuilder("spark.ui.requestHeaderSize") core/src/main/scala/org/apache/spark/internal/config/UI.scala: val
(spark) branch master updated: [SPARK-46886][CORE] Enable `spark.ui.prometheus.enabled` by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c8d116bcfde9 [SPARK-46886][CORE] Enable `spark.ui.prometheus.enabled` by default c8d116bcfde9 is described below commit c8d116bcfde938a9e4b33ace1e8257c2798000f4 Author: Dongjoon Hyun AuthorDate: Sat Jan 27 15:15:26 2024 -0800 [SPARK-46886][CORE] Enable `spark.ui.prometheus.enabled` by default ### What changes were proposed in this pull request? `spark.ui.prometheus.enabled` has been used since Apache Spark 3.0.0. - https://github.com/apache/spark/pull/25770 This PR aims to enable `spark.ui.prometheus.enabled` by default like Driver `JSON` API in Apache Spark 4.0.0. | |JSON End Point| Prometheus End Point | | --- | | - | | Driver | /api/v1/applications/{id}/executors/ | /metrics/executors/prometheus/ | ### Why are the changes needed? **BEFORE** ``` $ bin/spark-shell $ curl -s http://localhost:4040/metrics/executors/prometheus | wc -l 0 ``` **AFTER** ``` $ bin/spark-shell $ curl -s http://localhost:4040/metrics/executors/prometheus | wc -l 20 ``` ### Does this PR introduce _any_ user-facing change? No, this is only a new endpoint. ### How was this patch tested? Pass the CIs and do manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44912 from dongjoon-hyun/SPARK-46886. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/internal/config/UI.scala | 2 +- .../main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala | 2 +- docs/monitoring.md | 1 - 3 files changed, 2 insertions(+), 3 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/internal/config/UI.scala b/core/src/main/scala/org/apache/spark/internal/config/UI.scala index 320808d5018c..086c83552732 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/UI.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/UI.scala @@ -114,7 +114,7 @@ private[spark] object UI { "For master/worker/driver metrics, you need to configure `conf/metrics.properties`.") .version("3.0.0") .booleanConf -.createWithDefault(false) +.createWithDefault(true) val UI_X_XSS_PROTECTION = ConfigBuilder("spark.ui.xXssProtection") .doc("Value for HTTP X-XSS-Protection response header") diff --git a/core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala b/core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala index 8cfed4a4bd39..c4e3bdc64ee3 100644 --- a/core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala +++ b/core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala @@ -31,7 +31,7 @@ import org.apache.spark.ui.SparkUI * :: Experimental :: * This aims to expose Executor metrics like REST API which is documented in * - *https://spark.apache.org/docs/3.0.0/monitoring.html#executor-metrics + *https://spark.apache.org/docs/latest/monitoring.html#executor-metrics * * Note that this is based on ExecutorSummary which is different from ExecutorSource. */ diff --git a/docs/monitoring.md b/docs/monitoring.md index 056543deb094..8d3dbe375b82 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -821,7 +821,6 @@ A list of the available metrics, with a short description: Executor-level metrics are sent from each executor to the driver as part of the Heartbeat to describe the performance metrics of Executor itself like JVM heap memory, GC information. Executor metric values and their measured memory peak values per executor are exposed via the REST API in JSON format and in Prometheus format. The JSON end point is exposed at: `/applications/[app-id]/executors`, and the Prometheus endpoint at: `/metrics/executors/prometheus`. -The Prometheus endpoint is conditional to a configuration parameter: `spark.ui.prometheus.enabled=true` (the default is `false`). In addition, aggregated per-stage peak values of the executor memory metrics are written to the event log if `spark.eventLog.logStageExecutorMetrics` is true. Executor memory metrics are also exposed via the Spark metrics system based on the [Dropwizard metrics library](https://metrics.dropwizard.io/4.2.0). - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org Fo
(spark) branch master updated: [SPARK-46883][CORE] Support `/json/clusterutilization` API
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f75c7a7b5240 [SPARK-46883][CORE] Support `/json/clusterutilization` API f75c7a7b5240 is described below commit f75c7a7b52402e4c8faa39b2f88623e9f0bca916 Author: Dongjoon Hyun AuthorDate: Sat Jan 27 09:21:17 2024 -0800 [SPARK-46883][CORE] Support `/json/clusterutilization` API ### What changes were proposed in this pull request? This PR aims to support new `/json/clusterutilization` API in `Master` JSON endpoint ### Why are the changes needed? The user can get CPU/Memory/Waiting apps in a single API call. ``` # Start Spark Cluster and Spark Shell $ sbin/start-master.sh $ sbin/start-worker.sh spark://$(hostname):7077; $ bin/spark-shell --master spark://$(hostname):7077 # Check `Cluster Utilization API` $ curl http://localhost:8080/json/clusterutilization { "waitingDrivers" : 0, "cores" : 10, "coresused" : 10, "coresutilization" : 100, "memory" : 31744, "memoryused" : 1024, "memoryutilization" : 3 } ``` ### Does this PR introduce _any_ user-facing change? No. This is a newly added API. ### How was this patch tested? Pass the CIs with the newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44908 from dongjoon-hyun/SPARK-46883. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/deploy/JsonProtocol.scala | 18 ++ .../apache/spark/deploy/master/ui/MasterPage.scala | 2 ++ .../org/apache/spark/deploy/JsonProtocolSuite.scala | 21 + 3 files changed, 41 insertions(+) diff --git a/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala b/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala index 8c356081b277..9c73e84f4166 100644 --- a/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala +++ b/core/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala @@ -299,4 +299,22 @@ private[deploy] object JsonProtocol { ("executors" -> obj.executors.map(writeExecutorRunner)) ~ ("finishedexecutors" -> obj.finishedExecutors.map(writeExecutorRunner)) } + + /** + * Export the cluster utilization based on the [[MasterStateResponse]] to a Json object. + */ + def writeClusterUtilization(obj: MasterStateResponse): JObject = { +val aliveWorkers = obj.workers.filter(_.isAlive()) +val cores = aliveWorkers.map(_.cores).sum +val coresUsed = aliveWorkers.map(_.coresUsed).sum +val memory = aliveWorkers.map(_.memory).sum +val memoryUsed = aliveWorkers.map(_.memoryUsed).sum +("waitingDrivers" -> obj.activeDrivers.count(_.state == DriverState.SUBMITTED)) ~ +("cores" -> cores) ~ +("coresused" -> coresUsed) ~ +("coresutilization" -> 100 * coresUsed / cores) ~ +("memory" -> memory) ~ +("memoryused" -> memoryUsed) ~ +("memoryutilization" -> 100 * memoryUsed / memory) + } } diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index 36a79e060f01..cbeda23013ac 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -41,6 +41,8 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { override def renderJson(request: HttpServletRequest): JValue = { jsonFieldPattern.findFirstMatchIn(request.getRequestURI()) match { case None => JsonProtocol.writeMasterState(getMasterState) + case Some(m) if m.group(1) == "clusterutilization" => +JsonProtocol.writeClusterUtilization(getMasterState) case Some(m) => JsonProtocol.writeMasterState(getMasterState, Some(m.group(1))) } } diff --git a/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala b/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala index 4a6ace6facde..6fca31234ee2 100644 --- a/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala @@ -105,6 +105,20 @@ class JsonProtocolSuite extends SparkFunSuite with JsonTestUtils { assertValidDataInJson(output, JsonMethods.parse(JsonConstants.workerStateJsonStr)) } + test("SPARK-46883: writeClusterUtilization") { +val workers = Array(createWorkerInfo(), create
(spark) branch master updated (e014248434ac -> ecdacf8e14c8)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from e014248434ac [SPARK-46880][PYTHON][CONNECT][TESTS] Improve and test warning for Arrow-optimized Python UDF add ecdacf8e14c8 [SPARK-46881][CORE] Support `spark.deploy.workerSelectionPolicy` No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/master/Master.scala| 13 +- .../org/apache/spark/internal/config/Deploy.scala | 18 + .../apache/spark/deploy/master/MasterSuite.scala | 47 ++ 3 files changed, 76 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46880][PYTHON][CONNECT][TESTS] Improve and test warning for Arrow-optimized Python UDF
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e014248434ac [SPARK-46880][PYTHON][CONNECT][TESTS] Improve and test warning for Arrow-optimized Python UDF e014248434ac is described below commit e014248434ac241b9681aceff79f900f0c41dd28 Author: Xinrong Meng AuthorDate: Fri Jan 26 15:43:32 2024 -0800 [SPARK-46880][PYTHON][CONNECT][TESTS] Improve and test warning for Arrow-optimized Python UDF ### What changes were proposed in this pull request? Improve and test warning for Arrow-optimized Python UDF ### Why are the changes needed? To improve usability and test coverage. ### Does this PR introduce _any_ user-facing change? Only a user warning changed. FROM ``` >>> udf(lambda: print("do"), useArrow=True) UserWarning: Arrow optimization for Python UDFs cannot be enabled. warnings.warn( at ..> ``` TO ``` >>> udf(lambda: print("do"), useArrow=True) UserWarning: Arrow optimization for Python UDFs cannot be enabled for functions without arguments. warnings.warn( at ..> ``` ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44905 from xinrong-meng/arr_udf_warn. Authored-by: Xinrong Meng Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/connect/udf.py | 3 ++- python/pyspark/sql/tests/test_arrow_python_udf.py | 9 + python/pyspark/sql/udf.py | 3 ++- 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/connect/udf.py b/python/pyspark/sql/connect/udf.py index 5386398bdca8..1c42f4d74b7a 100644 --- a/python/pyspark/sql/connect/udf.py +++ b/python/pyspark/sql/connect/udf.py @@ -85,7 +85,8 @@ def _create_py_udf( eval_type = PythonEvalType.SQL_ARROW_BATCHED_UDF else: warnings.warn( -"Arrow optimization for Python UDFs cannot be enabled.", +"Arrow optimization for Python UDFs cannot be enabled for functions" +" without arguments.", UserWarning, ) diff --git a/python/pyspark/sql/tests/test_arrow_python_udf.py b/python/pyspark/sql/tests/test_arrow_python_udf.py index c59326edc31a..114fdf602223 100644 --- a/python/pyspark/sql/tests/test_arrow_python_udf.py +++ b/python/pyspark/sql/tests/test_arrow_python_udf.py @@ -188,6 +188,15 @@ class PythonUDFArrowTestsMixin(BaseUDFTestsMixin): }, ) +def test_warn_no_args(self): +with self.assertWarns(UserWarning) as w: +udf(lambda: print("do"), useArrow=True) +self.assertEqual( +str(w.warning), +"Arrow optimization for Python UDFs cannot be enabled for functions" +" without arguments.", +) + class PythonUDFArrowTests(PythonUDFArrowTestsMixin, ReusedSQLTestCase): @classmethod diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py index ca38556431ad..0324bc678667 100644 --- a/python/pyspark/sql/udf.py +++ b/python/pyspark/sql/udf.py @@ -142,7 +142,8 @@ def _create_py_udf( eval_type = PythonEvalType.SQL_ARROW_BATCHED_UDF else: warnings.warn( -"Arrow optimization for Python UDFs cannot be enabled.", +"Arrow optimization for Python UDFs cannot be enabled for functions" +" without arguments.", UserWarning, ) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46849][SQL] Run optimizer on CREATE TABLE column defaults
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7ab7509fa124 [SPARK-46849][SQL] Run optimizer on CREATE TABLE column defaults 7ab7509fa124 is described below commit 7ab7509fa12418ff5f93782670b7e939c055703a Author: Daniel Tenedorio AuthorDate: Fri Jan 26 12:28:10 2024 -0800 [SPARK-46849][SQL] Run optimizer on CREATE TABLE column defaults ### What changes were proposed in this pull request? This PR updates Catalyst to run the optimizer over `CREATE TABLE` column default expressions. ### Why are the changes needed? This helps speed up future commands that assign default values within the table. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The functionality is covered by existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44876 from dtenedor/analyze-column-defaults. Authored-by: Daniel Tenedorio Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/parser/AstBuilder.scala | 19 +++- .../sql/catalyst/plans/logical/v2Commands.scala| 18 ++- .../catalyst/util/ResolveDefaultColumnsUtil.scala | 15 + .../sql/connector/catalog/CatalogV2Util.scala | 26 ++ .../spark/sql/catalyst/parser/DDLParserSuite.scala | 2 +- .../catalyst/analysis/ResolveSessionCatalog.scala | 2 +- .../datasources/v2/DataSourceV2Strategy.scala | 16 +++-- 7 files changed, 83 insertions(+), 15 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala index 54c4343e7ff9..d147d22e4b13 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala @@ -20,6 +20,7 @@ package org.apache.spark.sql.catalyst.parser import java.util.Locale import java.util.concurrent.TimeUnit +import scala.collection.mutable import scala.collection.mutable.{ArrayBuffer, Set} import scala.jdk.CollectionConverters._ import scala.util.{Left, Right} @@ -3997,6 +3998,22 @@ class AstBuilder extends DataTypeAstBuilder with SQLConfHelper with Logging { val tableSpec = UnresolvedTableSpec(properties, provider, options, location, comment, serdeInfo, external) +// Parse column defaults from the table into separate expressions in the CREATE TABLE operator. +val specifiedDefaults: mutable.Map[Int, Expression] = mutable.Map.empty +Option(ctx.createOrReplaceTableColTypeList()).foreach { + _.createOrReplaceTableColType().asScala.zipWithIndex.foreach { case (typeContext, index) => +typeContext.colDefinitionOption().asScala.foreach { option => + Option(option.defaultExpression()).foreach { defaultExprContext => +specifiedDefaults.update(index, expression(defaultExprContext.expression())) + } +} + } +} +val defaultValueExpressions: Seq[Option[Expression]] = + (0 until columns.size).map { index: Int => +specifiedDefaults.get(index) + } + Option(ctx.query).map(plan) match { case Some(_) if columns.nonEmpty => operationNotAllowed( @@ -4018,7 +4035,7 @@ class AstBuilder extends DataTypeAstBuilder with SQLConfHelper with Logging { // with data type. val schema = StructType(columns ++ partCols) CreateTable(withIdentClause(identifierContext, UnresolvedIdentifier(_)), - schema, partitioning, tableSpec, ignoreIfExists = ifNotExists) + schema, partitioning, tableSpec, ignoreIfExists = ifNotExists, defaultValueExpressions) } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala index b17926818900..30be30cb2e04 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala @@ -456,13 +456,16 @@ trait V2CreateTablePlan extends LogicalPlan { /** * Create a new table with a v2 catalog. + * The [[defaultValueExpressions]] hold optional default value expressions to use when creating the + * table, mapping 1:1 with the fields in [[tableSchema]]. */ case class CreateTable( name: LogicalPlan, tableSchema: StructType, partitioning: Seq[Transform], tableSpec: TableSpecBase, -ignoreIfExists: Boolean) +ignoreIfExists: Boolean, +defaultV
(spark) branch master updated: [SPARK-46871][PS][TESTS] Clean up the imports in `pyspark.pandas.tests.computation.*`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f9f413e5ff6a [SPARK-46871][PS][TESTS] Clean up the imports in `pyspark.pandas.tests.computation.*` f9f413e5ff6a is described below commit f9f413e5ff6abe00a664e2dc75fb0ade2ff2986a Author: Ruifeng Zheng AuthorDate: Thu Jan 25 22:40:35 2024 -0800 [SPARK-46871][PS][TESTS] Clean up the imports in `pyspark.pandas.tests.computation.*` ### What changes were proposed in this pull request? Clean up the imports in `pyspark.pandas.tests.computation.*` ### Why are the changes needed? 1, remove unused imports; 2, define the test dataset in the vanilla side, so that won't need to define it again in the parity tests; ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44895 from zhengruifeng/ps_test_comput_cleanup. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/pandas/tests/computation/test_any_all.py | 8 ++-- python/pyspark/pandas/tests/computation/test_apply_func.py | 12 ++-- python/pyspark/pandas/tests/computation/test_binary_ops.py | 12 ++-- python/pyspark/pandas/tests/computation/test_combine.py | 8 ++-- python/pyspark/pandas/tests/computation/test_compute.py | 8 ++-- python/pyspark/pandas/tests/computation/test_corr.py | 6 +- python/pyspark/pandas/tests/computation/test_corrwith.py | 8 ++-- python/pyspark/pandas/tests/computation/test_cov.py | 8 ++-- python/pyspark/pandas/tests/computation/test_cumulative.py | 8 ++-- python/pyspark/pandas/tests/computation/test_describe.py | 8 ++-- python/pyspark/pandas/tests/computation/test_eval.py | 8 ++-- python/pyspark/pandas/tests/computation/test_melt.py | 8 ++-- python/pyspark/pandas/tests/computation/test_missing_data.py | 8 ++-- python/pyspark/pandas/tests/computation/test_pivot.py| 4 ++-- python/pyspark/pandas/tests/computation/test_pivot_table.py | 4 ++-- .../pyspark/pandas/tests/computation/test_pivot_table_adv.py | 4 ++-- .../pandas/tests/computation/test_pivot_table_multi_idx.py | 4 ++-- .../tests/computation/test_pivot_table_multi_idx_adv.py | 4 ++-- python/pyspark/pandas/tests/computation/test_stats.py| 6 +- .../pandas/tests/connect/computation/test_parity_any_all.py | 11 ++- .../tests/connect/computation/test_parity_apply_func.py | 9 - .../tests/connect/computation/test_parity_binary_ops.py | 11 ++- .../pandas/tests/connect/computation/test_parity_combine.py | 6 +- .../pandas/tests/connect/computation/test_parity_compute.py | 6 +- .../pandas/tests/connect/computation/test_parity_corr.py | 7 +-- .../pandas/tests/connect/computation/test_parity_corrwith.py | 11 ++- .../pandas/tests/connect/computation/test_parity_cov.py | 11 ++- .../tests/connect/computation/test_parity_cumulative.py | 9 - .../pandas/tests/connect/computation/test_parity_describe.py | 5 + .../pandas/tests/connect/computation/test_parity_eval.py | 11 ++- .../pandas/tests/connect/computation/test_parity_melt.py | 11 ++- .../tests/connect/computation/test_parity_missing_data.py| 9 - 32 files changed, 164 insertions(+), 89 deletions(-) diff --git a/python/pyspark/pandas/tests/computation/test_any_all.py b/python/pyspark/pandas/tests/computation/test_any_all.py index 5e946be7b08b..784e355f3b58 100644 --- a/python/pyspark/pandas/tests/computation/test_any_all.py +++ b/python/pyspark/pandas/tests/computation/test_any_all.py @@ -20,7 +20,7 @@ import numpy as np import pandas as pd from pyspark import pandas as ps -from pyspark.testing.pandasutils import ComparisonTestBase +from pyspark.testing.pandasutils import PandasOnSparkTestCase from pyspark.testing.sqlutils import SQLTestUtils @@ -149,7 +149,11 @@ class FrameAnyAllMixin: psdf.any(axis=1) -class FrameAnyAllTests(FrameAnyAllMixin, ComparisonTestBase, SQLTestUtils): +class FrameAnyAllTests( +FrameAnyAllMixin, +PandasOnSparkTestCase, +SQLTestUtils, +): pass diff --git a/python/pyspark/pandas/tests/computation/test_apply_func.py b/python/pyspark/pandas/tests/computation/test_apply_func.py index de82c061b58c..ad43a2f2b270 100644 --- a/python/pyspark/pandas/tests/computation/test_apply_func.py +++ b/python/pyspark/pandas/tests/computation/test_apply_func.py @@ -25,7 +25,7 @@ import pandas as pd from pyspark import pandas as ps from pyspark.loose_version import LooseVe
(spark) branch branch-3.4 updated: [SPARK-46855][INFRA][3.4] Add `sketch` to the dependencies of the `catalyst` in `module.py`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 441c33da0dbb [SPARK-46855][INFRA][3.4] Add `sketch` to the dependencies of the `catalyst` in `module.py` 441c33da0dbb is described below commit 441c33da0dbba26c54d6a46805f8902605472007 Author: yangjie01 AuthorDate: Thu Jan 25 22:36:32 2024 -0800 [SPARK-46855][INFRA][3.4] Add `sketch` to the dependencies of the `catalyst` in `module.py` ### What changes were proposed in this pull request? This pr add `sketch` to the dependencies of the `catalyst` module in `module.py` due to `sketch` is direct dependency of `catalyst` module. ### Why are the changes needed? Ensure that when modifying the `sketch` module, both `catalyst` and cascading modules will trigger tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44894 from LuciferYang/SPARK-46855-34. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- dev/sparktestsupport/modules.py | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index ac24ea19d0e7..100dd236c81d 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -168,6 +168,15 @@ launcher = Module( ], ) +sketch = Module( +name="sketch", +dependencies=[tags], +source_file_regexes=[ +"common/sketch/", +], +sbt_test_goals=["sketch/test"], +) + core = Module( name="core", dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher], @@ -181,7 +190,7 @@ core = Module( catalyst = Module( name="catalyst", -dependencies=[tags, core], +dependencies=[tags, sketch, core], source_file_regexes=[ "sql/catalyst/", ], @@ -295,15 +304,6 @@ protobuf = Module( ], ) -sketch = Module( -name="sketch", -dependencies=[tags], -source_file_regexes=[ -"common/sketch/", -], -sbt_test_goals=["sketch/test"], -) - graphx = Module( name="graphx", dependencies=[tags, core], - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-46855][INFRA][3.5] Add `sketch` to the dependencies of the `catalyst` in `module.py`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new e5a654e818b4 [SPARK-46855][INFRA][3.5] Add `sketch` to the dependencies of the `catalyst` in `module.py` e5a654e818b4 is described below commit e5a654e818b4698260807a081e5cf3d71480ac13 Author: yangjie01 AuthorDate: Thu Jan 25 22:35:38 2024 -0800 [SPARK-46855][INFRA][3.5] Add `sketch` to the dependencies of the `catalyst` in `module.py` ### What changes were proposed in this pull request? This pr add `sketch` to the dependencies of the `catalyst` module in `module.py` due to `sketch` is direct dependency of `catalyst` module. ### Why are the changes needed? Ensure that when modifying the `sketch` module, both `catalyst` and cascading modules will trigger tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44893 from LuciferYang/SPARK-46855-35. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- dev/sparktestsupport/modules.py | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 33d253a47ea0..d29fc8726018 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -168,6 +168,15 @@ launcher = Module( ], ) +sketch = Module( +name="sketch", +dependencies=[tags], +source_file_regexes=[ +"common/sketch/", +], +sbt_test_goals=["sketch/test"], +) + core = Module( name="core", dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher], @@ -181,7 +190,7 @@ core = Module( catalyst = Module( name="catalyst", -dependencies=[tags, core], +dependencies=[tags, sketch, core], source_file_regexes=[ "sql/catalyst/", ], @@ -295,15 +304,6 @@ connect = Module( ], ) -sketch = Module( -name="sketch", -dependencies=[tags], -source_file_regexes=[ -"common/sketch/", -], -sbt_test_goals=["sketch/test"], -) - graphx = Module( name="graphx", dependencies=[tags, core], - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46872][CORE] Recover `log-view.js` to be non-module
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1eedb7507ae2 [SPARK-46872][CORE] Recover `log-view.js` to be non-module 1eedb7507ae2 is described below commit 1eedb7507ae23d069e65a40c202173a709c5e94d Author: Dongjoon Hyun AuthorDate: Thu Jan 25 22:31:56 2024 -0800 [SPARK-46872][CORE] Recover `log-view.js` to be non-module ### What changes were proposed in this pull request? This PR aims to recover `log-view.js` to be no-module to fix loading issue. ### Why are the changes needed? - #43903 ![Screenshot 2024-01-25 at 9 08 48 PM](https://github.com/apache/spark/assets/9700541/830fadc8-ab1c-4cf4-9e56-493f9553b3ae) ### Does this PR introduce _any_ user-facing change? No. This is a recovery to the status before SPARK-46003 which is not released yet. ### How was this patch tested? Manually. - Checkout SPARK-46003 commit and build. - Start Master and Worker. - Open `Incognito` or `Private` mode browser and go to Worker Log. - Check `initLogPage` error via the developer tools ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44896 from dongjoon-hyun/SPARK-46872. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/resources/org/apache/spark/ui/static/log-view.js | 4 +--- core/src/main/scala/org/apache/spark/ui/UIUtils.scala | 2 +- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/core/src/main/resources/org/apache/spark/ui/static/log-view.js b/core/src/main/resources/org/apache/spark/ui/static/log-view.js index eaf7130e974b..0b917ee5c8d8 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/log-view.js +++ b/core/src/main/resources/org/apache/spark/ui/static/log-view.js @@ -17,8 +17,6 @@ /* global $ */ -import {getBaseURI} from "./utils.js"; - var baseParams; var curLogLength; @@ -60,7 +58,7 @@ function getRESTEndPoint() { // If the worker is served from the master through a proxy (see doc on spark.ui.reverseProxy), // we need to retain the leading ../proxy// part of the URL when making REST requests. // Similar logic is contained in executorspage.js function createRESTEndPoint. - var words = getBaseURI().split('/'); + var words = (document.baseURI || document.URL).split('/'); var ind = words.indexOf("proxy"); if (ind > 0) { return words.slice(0, ind + 2).join('/') + "/log"; diff --git a/core/src/main/scala/org/apache/spark/ui/UIUtils.scala b/core/src/main/scala/org/apache/spark/ui/UIUtils.scala index d124717ea85a..14255d276d66 100644 --- a/core/src/main/scala/org/apache/spark/ui/UIUtils.scala +++ b/core/src/main/scala/org/apache/spark/ui/UIUtils.scala @@ -220,7 +220,7 @@ private[spark] object UIUtils extends Logging { - + setUIRoot('{UIUtils.uiRoot(request)}') } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46870][CORE] Support Spark Master Log UI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d6fc06bd4515 [SPARK-46870][CORE] Support Spark Master Log UI d6fc06bd4515 is described below commit d6fc06bd451586edc5e55068aabecb3dc7ec5849 Author: Dongjoon Hyun AuthorDate: Thu Jan 25 21:15:30 2024 -0800 [SPARK-46870][CORE] Support Spark Master Log UI ### What changes were proposed in this pull request? This PR aims to support `Spark Master` Log UI. ### Why are the changes needed? This is a new feature to allow the users to access the master log like the following. The value of `Status`, e.g., `ALIVE`, has a new link for log UI. **BEFORE** ![Screenshot 2024-01-25 at 7 30 07 PM](https://github.com/apache/spark/assets/9700541/2c263944-ebfa-49bb-955f-d9a022e23cba) **AFTER** ![Screenshot 2024-01-25 at 7 28 59 PM](https://github.com/apache/spark/assets/9700541/8d096261-3a31-4746-b52b-e01cfcdf3237) ![Screenshot 2024-01-25 at 7 29 21 PM](https://github.com/apache/spark/assets/9700541/fc4d3c10-8695-4529-a92b-6ab477c961da) ### Does this PR introduce _any_ user-facing change? No. This is a new link and UI. ### How was this patch tested? Manually. ``` $ sbin/start-master.sh ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #44890 from dongjoon-hyun/SPARK-46870. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/deploy/master/ui/LogPage.scala| 125 + .../apache/spark/deploy/master/ui/MasterPage.scala | 4 +- .../spark/deploy/master/ui/MasterWebUI.scala | 1 + 3 files changed, 129 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/LogPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/LogPage.scala new file mode 100644 index ..9da05025e1a3 --- /dev/null +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/LogPage.scala @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.master.ui + +import java.io.File +import javax.servlet.http.HttpServletRequest + +import scala.xml.{Node, Unparsed} + +import org.apache.spark.internal.Logging +import org.apache.spark.ui.{UIUtils, WebUIPage} +import org.apache.spark.util.Utils +import org.apache.spark.util.logging.RollingFileAppender + +private[ui] class LogPage(parent: MasterWebUI) extends WebUIPage("logPage") with Logging { + private val defaultBytes = 100 * 1024 + + def render(request: HttpServletRequest): Seq[Node] = { +val logDir = sys.env.getOrElse("SPARK_LOG_DIR", "logs/") +val logType = request.getParameter("logType") +val offset = Option(request.getParameter("offset")).map(_.toLong) +val byteLength = Option(request.getParameter("byteLength")).map(_.toInt) + .getOrElse(defaultBytes) +val (logText, startByte, endByte, logLength) = getLog(logDir, logType, offset, byteLength) +val curLogLength = endByte - startByte +val range = + +Showing {curLogLength} Bytes: {startByte.toString} - {endByte.toString} of {logLength} + + +val moreButton = + +Load More + + +val newButton = + +Load New + + +val alert = + +End of Log + + +val logParams = "?self&logType=%s".format(logType) +val jsOnload = "window.onload = " + + s"initLogPage('$logParams', $curLogLength, $startByte, $endByte, $logLength, $byteLength);" + +val content = + ++ + +Back to Master +{range} + + {moreButton} + {logText} + {alert} + {newButton} + +{Unparsed(jsOnload)} + + +UIUtils.basicSparkPage(request, content, logType + " log pag
(spark) branch master updated: [SPARK-46868][CORE] Support Spark Worker Log UI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 48cd8604f953 [SPARK-46868][CORE] Support Spark Worker Log UI 48cd8604f953 is described below commit 48cd8604f953dc82cadb6c076914d4d5c69b8126 Author: Dongjoon Hyun AuthorDate: Thu Jan 25 19:31:26 2024 -0800 [SPARK-46868][CORE] Support Spark Worker Log UI ### What changes were proposed in this pull request? This PR aims to support `Spark Worker Log UI` when `SPARK_LOG_DIR` is under work directory. ### Why are the changes needed? This is a new feature to allow the users to access the worker log like the following. **BEFORE** ![Screenshot 2024-01-25 at 3 04 20 PM](https://github.com/apache/spark/assets/9700541/73ef33d5-9b56-4cca-83c2-9fd2e8ab5201) **AFTER** - Worker Page (Worker ID provides a new hyperlink for Log UI) ![Screenshot 2024-01-25 at 2 58 44 PM](https://github.com/apache/spark/assets/9700541/1de66eee-7b73-4be3-a12c-e008442b7b6c) - Log UI ![Screenshot 2024-01-25 at 6 00 25 PM](https://github.com/apache/spark/assets/9700541/e20fde05-ce5e-42cb-9112-4a8d2ec69418) ### Does this PR introduce _any_ user-facing change? To provide a better UX. ### How was this patch tested? Manually. ``` $ sbin/start-master.sh $ SPARK_LOG_DIR=$PWD/work/logs sbin/start-worker.sh spark://$(hostname):7077 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44888 from dongjoon-hyun/SPARK-46868. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/deploy/worker/ui/LogPage.scala| 29 -- .../apache/spark/deploy/worker/ui/WorkerPage.scala | 6 - 2 files changed, 26 insertions(+), 9 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala index dd714cdc4437..991c791cc79e 100644 --- a/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala @@ -30,23 +30,26 @@ import org.apache.spark.util.logging.RollingFileAppender private[ui] class LogPage(parent: WorkerWebUI) extends WebUIPage("logPage") with Logging { private val worker = parent.worker private val workDir = new File(parent.workDir.toURI.normalize().getPath) - private val supportedLogTypes = Set("stderr", "stdout") + private val supportedLogTypes = Set("stderr", "stdout", "out") private val defaultBytes = 100 * 1024 def renderLog(request: HttpServletRequest): String = { val appId = Option(request.getParameter("appId")) val executorId = Option(request.getParameter("executorId")) val driverId = Option(request.getParameter("driverId")) +val self = Option(request.getParameter("self")) val logType = request.getParameter("logType") val offset = Option(request.getParameter("offset")).map(_.toLong) val byteLength = Option(request.getParameter("byteLength")).map(_.toInt) .getOrElse(defaultBytes) -val logDir = (appId, executorId, driverId) match { - case (Some(a), Some(e), None) => +val logDir = (appId, executorId, driverId, self) match { + case (Some(a), Some(e), None, None) => s"${workDir.getPath}/$a/$e/" - case (None, None, Some(d)) => + case (None, None, Some(d), None) => s"${workDir.getPath}/$d/" + case (None, None, None, Some(_)) => +s"${sys.env.getOrElse("SPARK_LOG_DIR", workDir.getPath)}/" case _ => throw new Exception("Request must specify either application or driver identifiers") } @@ -60,16 +63,19 @@ private[ui] class LogPage(parent: WorkerWebUI) extends WebUIPage("logPage") with val appId = Option(request.getParameter("appId")) val executorId = Option(request.getParameter("executorId")) val driverId = Option(request.getParameter("driverId")) +val self = Option(request.getParameter("self")) val logType = request.getParameter("logType") val offset = Option(request.getParameter("offset")).map(_.toLong) val byteLength = Option(request.getParameter("byteLength")).map(_.toInt) .getOrElse(defaultBytes) -val (logDir, params, pageName) = (appId, executorId, driverId) match { - case (Some(a), Some(e), None) => +val (logDir, params, pageName) = (appId, executorId, driverId, self) match { +
(spark) branch master updated: [SPARK-46869][K8S] Add `logrotate` to Spark docker files
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 25df5dad6761 [SPARK-46869][K8S] Add `logrotate` to Spark docker files 25df5dad6761 is described below commit 25df5dad67610357287bef075ee755c59acdb904 Author: Dongjoon Hyun AuthorDate: Thu Jan 25 16:30:55 2024 -0800 [SPARK-46869][K8S] Add `logrotate` to Spark docker files ### What changes were proposed in this pull request? This PR aims to add `logrotate` to Spark docker files. ### Why are the changes needed? To help a user to easily rotate the logs by configuration. Note that this is not for rigorous users who cannot allow log data loss. `logratate` is easy but is known to allow log loss during rotation. ### Does this PR introduce _any_ user-facing change? The image size change is negligible. ``` $ docker images spark REPOSITORY TAGIMAGE ID CREATEDSIZE sparklatest-logrotate d843879458af 18 hours ago 657MB sparklatest 0e281bd1fbe6 18 hours ago 657MB ``` ### How was this patch tested? Manually. ``` $ docker run -it --rm spark:latest-logrotate /usr/sbin/logrotate | tail -n7 logrotate 3.19.0 - Copyright (C) 1995-2001 Red Hat, Inc. This may be freely redistributed under the terms of the GNU General Public License Usage: logrotate [-dfv?] [-d|--debug] [-f|--force] [-m|--mail=command] [-s|--state=statefile] [--skip-state-lock] [-v|--verbose] [-l|--log=logfile] [--version] [-?|--help] [--usage] [OPTION...] ``` ### Was this patch authored or co-authored using generative AI tooling? Pass the CIs. Closes #44889 from dongjoon-hyun/SPARK-46869. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile index b80e72c768c6..25d7e076169b 100644 --- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile +++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile @@ -30,7 +30,7 @@ ARG spark_uid=185 RUN set -ex && \ apt-get update && \ ln -s /lib /lib64 && \ -apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools && \ +apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools logrotate && \ mkdir -p /opt/spark && \ mkdir -p /opt/spark/examples && \ mkdir -p /opt/spark/work-dir && \ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 3130ac9276bd [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler 3130ac9276bd is described below commit 3130ac9276bd43dd21aa1aa5e5ef920b00bc3aff Author: fred-db AuthorDate: Thu Jan 25 08:34:37 2024 -0800 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler * The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. * This is a deadlock you can run into, which can prevent any progress on the cluster. * No * Unit test that reproduces the issue. No Closes #44882 from fred-db/fix-deadlock. Authored-by: fred-db Signed-off-by: Dongjoon Hyun (cherry picked from commit 617014cc92d933c70c9865a578fceb265883badd) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/rdd/RDD.scala | 11 --- .../org/apache/spark/scheduler/DAGScheduler.scala | 31 ++ .../apache/spark/scheduler/DAGSchedulerSuite.scala | 38 +- 3 files changed, 62 insertions(+), 18 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala b/core/src/main/scala/org/apache/spark/rdd/RDD.scala index 407820b663a3..fc5a2089f43b 100644 --- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala @@ -223,14 +223,17 @@ abstract class RDD[T: ClassTag]( * not use `this` because RDDs are user-visible, so users might have added their own locking on * RDDs; sharing that could lead to a deadlock. * - * One thread might hold the lock on many of these, for a chain of RDD dependencies; but - * because DAGs are acyclic, and we only ever hold locks for one path in that DAG, there is no - * chance of deadlock. + * One thread might hold the lock on many of these, for a chain of RDD dependencies. Deadlocks + * are possible if we try to lock another resource while holding the stateLock, + * and the lock acquisition sequence of these locks is not guaranteed to be the same. + * This can lead lead to a deadlock as one thread might first acquire the stateLock, + * and then the resource, + * while another thread might first acquire the resource, and then the stateLock. * * Executors may reference the shared fields (though they should never mutate them, * that only happens on the driver). */ - private val stateLock = new Serializable {} + private[spark] val stateLock = new Serializable {} // Our dependencies and partitions will be gotten by calling subclass's methods below, and will // be overwritten when we're checkpointed diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala index 2a966fab6f02..26be8c72bbcb 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala @@ -173,6 +173,9 @@ private[spark] class DAGScheduler( * locations where that RDD partition is cached. * * All accesses to this map should be guarded by synchronizing on it (see SPARK-4454). + * If you need to access any RDD while synchronizing on the cache locations, + * first synchronize on the RDD, and then synchronize on this map to avoid deadlocks. The RDD + * could try to access the cache locations after synchronizing on the RDD. */ private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]] @@ -408,22 +411,24 @@ private[spark] class DAGScheduler( } private[scheduler] - def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized { -// Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times -if (!cacheLocs.contains(rdd.id)) { - // Note: if the storage level is NONE, we don't need to get locations from block manager. - val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) { -IndexedSeq.fill(rdd.pa
(spark) branch branch-3.5 updated: [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 125b2f87d453 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler 125b2f87d453 is described below commit 125b2f87d453a16325f24e7382707f2b365bba14 Author: fred-db AuthorDate: Thu Jan 25 08:34:37 2024 -0800 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler * The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. * This is a deadlock you can run into, which can prevent any progress on the cluster. * No * Unit test that reproduces the issue. No Closes #44882 from fred-db/fix-deadlock. Authored-by: fred-db Signed-off-by: Dongjoon Hyun (cherry picked from commit 617014cc92d933c70c9865a578fceb265883badd) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/rdd/RDD.scala | 11 --- .../org/apache/spark/scheduler/DAGScheduler.scala | 31 ++ .../apache/spark/scheduler/DAGSchedulerSuite.scala | 38 +- 3 files changed, 62 insertions(+), 18 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala b/core/src/main/scala/org/apache/spark/rdd/RDD.scala index a21d2ae77396..f695b1020275 100644 --- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala @@ -223,14 +223,17 @@ abstract class RDD[T: ClassTag]( * not use `this` because RDDs are user-visible, so users might have added their own locking on * RDDs; sharing that could lead to a deadlock. * - * One thread might hold the lock on many of these, for a chain of RDD dependencies; but - * because DAGs are acyclic, and we only ever hold locks for one path in that DAG, there is no - * chance of deadlock. + * One thread might hold the lock on many of these, for a chain of RDD dependencies. Deadlocks + * are possible if we try to lock another resource while holding the stateLock, + * and the lock acquisition sequence of these locks is not guaranteed to be the same. + * This can lead lead to a deadlock as one thread might first acquire the stateLock, + * and then the resource, + * while another thread might first acquire the resource, and then the stateLock. * * Executors may reference the shared fields (though they should never mutate them, * that only happens on the driver). */ - private val stateLock = new Serializable {} + private[spark] val stateLock = new Serializable {} // Our dependencies and partitions will be gotten by calling subclass's methods below, and will // be overwritten when we're checkpointed diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala index d8adaae19b90..89d16e579348 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala @@ -174,6 +174,9 @@ private[spark] class DAGScheduler( * locations where that RDD partition is cached. * * All accesses to this map should be guarded by synchronizing on it (see SPARK-4454). + * If you need to access any RDD while synchronizing on the cache locations, + * first synchronize on the RDD, and then synchronize on this map to avoid deadlocks. The RDD + * could try to access the cache locations after synchronizing on the RDD. */ private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]] @@ -420,22 +423,24 @@ private[spark] class DAGScheduler( } private[scheduler] - def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized { -// Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times -if (!cacheLocs.contains(rdd.id)) { - // Note: if the storage level is NONE, we don't need to get locations from block manager. - val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) { -IndexedSeq.fill(rdd.pa
(spark) branch master updated: [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 617014cc92d9 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler 617014cc92d9 is described below commit 617014cc92d933c70c9865a578fceb265883badd Author: fred-db AuthorDate: Thu Jan 25 08:34:37 2024 -0800 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler ### What changes were proposed in this pull request? * The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. ### Why are the changes needed? * This is a deadlock you can run into, which can prevent any progress on the cluster. ### Does this PR introduce _any_ user-facing change? * No ### How was this patch tested? * Unit test that reproduces the issue. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44882 from fred-db/fix-deadlock. Authored-by: fred-db Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/rdd/RDD.scala | 11 --- .../org/apache/spark/scheduler/DAGScheduler.scala | 31 ++ .../apache/spark/scheduler/DAGSchedulerSuite.scala | 38 +- 3 files changed, 62 insertions(+), 18 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala b/core/src/main/scala/org/apache/spark/rdd/RDD.scala index d73fb1b9bc3b..a48eaa253ad1 100644 --- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala @@ -224,14 +224,17 @@ abstract class RDD[T: ClassTag]( * not use `this` because RDDs are user-visible, so users might have added their own locking on * RDDs; sharing that could lead to a deadlock. * - * One thread might hold the lock on many of these, for a chain of RDD dependencies; but - * because DAGs are acyclic, and we only ever hold locks for one path in that DAG, there is no - * chance of deadlock. + * One thread might hold the lock on many of these, for a chain of RDD dependencies. Deadlocks + * are possible if we try to lock another resource while holding the stateLock, + * and the lock acquisition sequence of these locks is not guaranteed to be the same. + * This can lead lead to a deadlock as one thread might first acquire the stateLock, + * and then the resource, + * while another thread might first acquire the resource, and then the stateLock. * * Executors may reference the shared fields (though they should never mutate them, * that only happens on the driver). */ - private val stateLock = new Serializable {} + private[spark] val stateLock = new Serializable {} // Our dependencies and partitions will be gotten by calling subclass's methods below, and will // be overwritten when we're checkpointed diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala index e728d921d290..e74a3efac250 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala @@ -181,6 +181,9 @@ private[spark] class DAGScheduler( * locations where that RDD partition is cached. * * All accesses to this map should be guarded by synchronizing on it (see SPARK-4454). + * If you need to access any RDD while synchronizing on the cache locations, + * first synchronize on the RDD, and then synchronize on this map to avoid deadlocks. The RDD + * could try to access the cache locations after synchronizing on the RDD. */ private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]] @@ -435,22 +438,24 @@ private[spark] class DAGScheduler( } private[scheduler] - def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized { -// Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times -if (!cacheLocs.contains(rdd.id)) { - // Note: if the storage level is NONE, we don't n
(spark) branch master updated: [SPARK-46855][INFRA] Add `sketch` to the dependencies of the `catalyst` in `module.py`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1bee07e39f1b [SPARK-46855][INFRA] Add `sketch` to the dependencies of the `catalyst` in `module.py` 1bee07e39f1b is described below commit 1bee07e39f1b5aef6ce81e028207691f1dd1fc7c Author: yangjie01 AuthorDate: Thu Jan 25 08:26:13 2024 -0800 [SPARK-46855][INFRA] Add `sketch` to the dependencies of the `catalyst` in `module.py` ### What changes were proposed in this pull request? This pr add `sketch` to the dependencies of the `catalyst` module in `module.py` due to `sketch` is direct dependency of `catalyst` module. ### Why are the changes needed? Ensure that when modifying the `sketch` module, both `catalyst` and cascading modules will trigger tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44878 from LuciferYang/SPARK-46855. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: Dongjoon Hyun --- dev/sparktestsupport/modules.py | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index be3e798b0779..b9541c4be9b3 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -179,6 +179,15 @@ launcher = Module( ], ) +sketch = Module( +name="sketch", +dependencies=[tags], +source_file_regexes=[ +"common/sketch/", +], +sbt_test_goals=["sketch/test"], +) + core = Module( name="core", dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher, utils], @@ -200,7 +209,7 @@ api = Module( catalyst = Module( name="catalyst", -dependencies=[tags, core, api], +dependencies=[tags, sketch, core, api], source_file_regexes=[ "sql/catalyst/", ], @@ -315,15 +324,6 @@ connect = Module( ], ) -sketch = Module( -name="sketch", -dependencies=[tags], -source_file_regexes=[ -"common/sketch/", -], -sbt_test_goals=["sketch/test"], -) - graphx = Module( name="graphx", dependencies=[tags, core], - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark-website) branch asf-site updated: docs: udpate third party projects (#497)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new a6ce63fb9c docs: udpate third party projects (#497) a6ce63fb9c is described below commit a6ce63fb9c82dc8f25f42f377b487c0de2aff826 Author: Matthew Powers AuthorDate: Thu Jan 25 11:18:05 2024 -0500 docs: udpate third party projects (#497) --- site/third-party-projects.html | 79 ++ third-party-projects.md| 77 2 files changed, 81 insertions(+), 75 deletions(-) diff --git a/site/third-party-projects.html b/site/third-party-projects.html index ba0911b733..a0f7a953f8 100644 --- a/site/third-party-projects.html +++ b/site/third-party-projects.html @@ -141,40 +141,57 @@ This page tracks external software projects that supplement Apache Spark and add to its ecosystem. -To add a project, open a pull request against the https://github.com/apache/spark-website";>spark-website -repository. Add an entry to -https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md";>this markdown file, -then run jekyll build to generate the HTML too. Include -both in your pull request. See the README in this repo for more information. +Popular libraries with PySpark integrations -Note that all project and product names should follow trademark guidelines. + + https://github.com/great-expectations/great_expectations";>great-expectations - Always know what to expect from your data + https://github.com/apache/airflow";>Apache Airflow - A platform to programmatically author, schedule, and monitor workflows + https://github.com/dmlc/xgboost";>xgboost - Scalable, portable and distributed gradient boosting + https://github.com/shap/shap";>shap - A game theoretic approach to explain the output of any machine learning model + https://github.com/awslabs/python-deequ";>python-deequ - Measures data quality in large datasets + https://github.com/datahub-project/datahub";>datahub - Metadata platform for the modern data stack + https://github.com/dbt-labs/dbt-spark";>dbt-spark - Enables dbt to work with Apache Spark + -spark-packages.org +Connectors -https://spark-packages.org/";>spark-packages.org is an external, -community-managed list of third-party libraries, add-ons, and applications that work with -Apache Spark. You can add a package as long as you have a GitHub repository. + + https://github.com/spark-redshift-community/spark-redshift";>spark-redshift - Performant Redshift data source for Apache Spark + https://github.com/microsoft/sql-spark-connector";>spark-sql-connector - Apache Spark Connector for SQL Server and Azure SQL + https://github.com/Azure/azure-cosmosdb-spark";>azure-cosmos-spark - Apache Spark Connector for Azure Cosmos DB + https://github.com/Azure/azure-event-hubs-spark";>azure-event-hubs-spark - Enables continuous data processing with Apache Spark and Azure Event Hubs + https://github.com/Azure/azure-kusto-spark";>azure-kusto-spark - Apache Spark connector for Azure Kusto + https://github.com/mongodb/mongo-spark";>mongo-spark - The MongoDB Spark connector + https://github.com/couchbase/couchbase-spark-connector";>couchbase-spark-connector - The Official Couchbase Spark connector + https://github.com/datastax/spark-cassandra-connector";>spark-cassandra-connector - DataStax connector for Apache Spark to Apache Cassandra + https://github.com/elastic/elasticsearch-hadoop";>elasticsearch-hadoop - Elasticsearch real-time search and analytics natively integrated with Spark + https://github.com/neo4j-contrib/neo4j-spark-connector";>neo4j-spark-connector - Neo4j Connector for Apache Spark + https://github.com/StarRocks/starrocks-connector-for-apache-spark";>starrocks-connector-for-apache-spark - StarRocks Apache Spark connector + https://github.com/pingcap/tispark";>tispark - TiSpark is built for running Apache Spark on top of TiDB/TiKV + + +Open table formats + + + https://delta.io";>Delta Lake - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads + https://github.com/apache/hudi";>Hudi: Upserts, Deletes And Incremental Processing on Big Data + https://github.com/apache/iceberg";>Iceberg - Open table format for analytic datasets + Infrastructure projects - https://github.com/spark-jobserver/spark-jobserver";>REST Job Server for Apache Spark - -REST interface for managing and submitting Spark jobs on the same cluster. - http://mlbase.org/";>MLbase - Machine Learning research project on top of Spark +
(spark) branch master updated: [SPARK-46828][SQL] Remove the invalid assertion of remote mode for spark sql shell
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 51cdf34226ed [SPARK-46828][SQL] Remove the invalid assertion of remote mode for spark sql shell 51cdf34226ed is described below commit 51cdf34226ed8d137ac1c8374cc2473dc4818bbf Author: Kent Yao AuthorDate: Wed Jan 24 22:17:16 2024 -0800 [SPARK-46828][SQL] Remove the invalid assertion of remote mode for spark sql shell ### What changes were proposed in this pull request? It is safe to clean up the read side code in SparkSQLCLIDriver as `org.apache.hadoop.hive.ql.session.SessionState.setIsHiveServerQuery` is never invoked. ### Why are the changes needed? code refactoring for the purpose of having more upgradable hive deps. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - build and run `bin/spark-sql` ``` Spark Web UI available at http://***:4040 Spark master: local[*], Application Id: local-1706087266338 spark-sql (default)> show tables; Time taken: 0.327 seconds spark-sql (default)> show databases; default Time taken: 0.161 seconds, Fetched 1 row(s) - CliSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #44868 from yaooqinn/SPARK-46828. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../src/main/resources/error/error-classes.json| 5 --- .../spark/sql/errors/QueryExecutionErrors.scala| 6 .../sql/hive/thriftserver/SparkSQLCLIDriver.scala | 39 -- 3 files changed, 7 insertions(+), 43 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 6088300f8e64..1f3122a502c5 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -6201,11 +6201,6 @@ "Cannot create array with elements of data due to exceeding the limit elements for ArrayData. " ] }, - "_LEGACY_ERROR_TEMP_2178" : { -"message" : [ - "Remote operations not supported." -] - }, "_LEGACY_ERROR_TEMP_2179" : { "message" : [ "HiveServer2 Kerberos principal or keytab is not correctly configured." diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala index a3e905090bf3..69794517f917 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala @@ -1525,12 +1525,6 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase with ExecutionE cause = e) } - def remoteOperationsUnsupportedError(): SparkRuntimeException = { -new SparkRuntimeException( - errorClass = "_LEGACY_ERROR_TEMP_2178", - messageParameters = Map.empty) - } - def invalidKerberosConfigForHiveServer2Error(): Throwable = { new SparkException( errorClass = "_LEGACY_ERROR_TEMP_2179", diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala index e0a1a31a36f3..0d3538e30941 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala @@ -36,7 +36,6 @@ import org.apache.hadoop.hive.ql.Driver import org.apache.hadoop.hive.ql.processors._ import org.apache.hadoop.hive.ql.session.SessionState import org.apache.hadoop.security.{Credentials, UserGroupInformation} -import org.slf4j.LoggerFactory import sun.misc.{Signal, SignalHandler} import org.apache.spark.{ErrorMessageFormat, SparkConf, SparkThrowable, SparkThrowableHelper} @@ -45,7 +44,6 @@ import org.apache.spark.internal.Logging import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.analysis.FunctionRegistry import org.apache.spark.sql.catalyst.util.SQLKeywordUtils -import org.apache.spark.sql.errors.QueryExecutionErrors import org.apache.spark.sql.hive.client.HiveClientImpl import org.apache.spark.sql.hive.security.HiveDelegationTokenProvider import org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.closeHiveSessionStateIfStarted @@ -149,10 +147,6 @@ private[hive] object SparkSQLCLIDriver extends Logging { SparkSQLEnv.stop(exi
(spark) branch master updated: [SPARK-46846][CORE] Make `WorkerResourceInfo` extend `Serializable` explicitly
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b86053c430e5 [SPARK-46846][CORE] Make `WorkerResourceInfo` extend `Serializable` explicitly b86053c430e5 is described below commit b86053c430e5e1411467bdc1f0ddb337ca01649f Author: Dongjoon Hyun AuthorDate: Wed Jan 24 15:49:23 2024 -0800 [SPARK-46846][CORE] Make `WorkerResourceInfo` extend `Serializable` explicitly ### What changes were proposed in this pull request? This PR aims to make `WorkerResourceInfo` extend `Serializable` interface explicitly. - https://docs.oracle.com/en/java/javase/17/docs/specs/serialization/serial-arch.html > A Serializable class must do the following: > - Implement the `java.io.Serializable` interface ### Why are the changes needed? `WorkerInfo` extends `Serializable` and has `WorkerResourceInfo` as data. https://github.com/apache/spark/blob/1f23edfa84aa3318791d5fbbbae22d479a49134a/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala#L49-L58 `WorkerInfo` itself has no data field, but inherits `ResourceAllocator` which has data. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44873 from dongjoon-hyun/SPARK-46846. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala b/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala index bdc9e9c6106c..a20adcbddc24 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala @@ -25,7 +25,7 @@ import org.apache.spark.rpc.RpcEndpointRef import org.apache.spark.util.Utils private[spark] case class WorkerResourceInfo(name: String, addresses: Seq[String]) - extends ResourceAllocator { + extends Serializable with ResourceAllocator { override protected def resourceName = this.name override protected def resourceAddresses = this.addresses - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46827][CORE] Make `RocksDBPersistenceEngine` to support a symbolic link
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1f23edfa84aa [SPARK-46827][CORE] Make `RocksDBPersistenceEngine` to support a symbolic link 1f23edfa84aa is described below commit 1f23edfa84aa3318791d5fbbbae22d479a49134a Author: Dongjoon Hyun AuthorDate: Wed Jan 24 07:35:14 2024 -0800 [SPARK-46827][CORE] Make `RocksDBPersistenceEngine` to support a symbolic link ### What changes were proposed in this pull request? This PR aims to make `RocksDBPersistenceEngine` to support a symbolic link location. ### Why are the changes needed? To be consistent with `FileSystemPersistenceEngine` which supports symbolic link locations. https://github.com/apache/spark/blob/7004dd9edcad32d34d0448df9498d32c444ab082/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L45-L50 ### Does this PR introduce _any_ user-facing change? No. This is a new feature at 4.0.0. ### How was this patch tested? Pass the CIs with a newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44867 from dongjoon-hyun/SPARK-46827. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../spark/deploy/master/RocksDBPersistenceEngine.scala| 9 +++-- .../spark/deploy/master/PersistenceEngineSuite.scala | 15 +++ 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/RocksDBPersistenceEngine.scala b/core/src/main/scala/org/apache/spark/deploy/master/RocksDBPersistenceEngine.scala index 5c43dab4d066..8364dbd693b1 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/RocksDBPersistenceEngine.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/RocksDBPersistenceEngine.scala @@ -19,7 +19,7 @@ package org.apache.spark.deploy.master import java.nio.ByteBuffer import java.nio.charset.StandardCharsets.UTF_8 -import java.nio.file.{Files, Paths} +import java.nio.file.{FileAlreadyExistsException, Files, Paths} import scala.collection.mutable.ArrayBuffer import scala.reflect.ClassTag @@ -43,7 +43,12 @@ private[master] class RocksDBPersistenceEngine( RocksDB.loadLibrary() - private val path = Files.createDirectories(Paths.get(dir)) + private val path = try { +Files.createDirectories(Paths.get(dir)) + } catch { +case _: FileAlreadyExistsException if Files.isSymbolicLink(Paths.get(dir)) => + Files.createDirectories(Paths.get(dir).toRealPath()) + } /** * Use full filter. diff --git a/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala b/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala index b977a1142444..01b7e46eb2a8 100644 --- a/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala @@ -88,6 +88,21 @@ class PersistenceEngineSuite extends SparkFunSuite { } } + test("SPARK-46827: RocksDBPersistenceEngine with a symbolic link") { +withTempDir { dir => + val target = Paths.get(dir.getAbsolutePath(), "target") + val link = Paths.get(dir.getAbsolutePath(), "symbolic_link"); + + Files.createDirectories(target) + Files.createSymbolicLink(link, target); + + val conf = new SparkConf() + testPersistenceEngine(conf, serializer => +new RocksDBPersistenceEngine(link.toAbsolutePath.toString, serializer) + ) +} + } + test("SPARK-46205: Support KryoSerializer in FileSystemPersistenceEngine") { withTempDir { dir => val conf = new SparkConf() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46823][CONNECT][PYTHON] `LocalDataToArrowConversion` should check the nullability
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1642e928478c [SPARK-46823][CONNECT][PYTHON] `LocalDataToArrowConversion` should check the nullability 1642e928478c is described below commit 1642e928478c8c20bae5203ecf2e4d659aca7692 Author: Ruifeng Zheng AuthorDate: Wed Jan 24 00:43:41 2024 -0800 [SPARK-46823][CONNECT][PYTHON] `LocalDataToArrowConversion` should check the nullability ### What changes were proposed in this pull request? `LocalDataToArrowConversion` should check the nullability ### Why are the changes needed? this check was missing ### Does this PR introduce _any_ user-facing change? yes ``` data = [("asd", None)] schema = StructType( [ StructField("name", StringType(), nullable=True), StructField("age", IntegerType(), nullable=False), ] ) ``` before: ``` In [3]: df = spark.createDataFrame([("asd", None)], schema) In [4]: df Out[4]: 24/01/24 12:08:28 ERROR ErrorUtils: Spark Connect RPC error during: analyze. UserId: ruifeng.zheng. SessionId: cd692bb1-d503-4043-a9db-d29cb5c16517. java.lang.IllegalStateException: Value at index is null at org.apache.arrow.vector.IntVector.get(IntVector.java:107) at org.apache.spark.sql.vectorized.ArrowColumnVector$IntAccessor.getInt(ArrowColumnVector.java:338) at org.apache.spark.sql.vectorized.ArrowColumnVector.getInt(ArrowColumnVector.java:88) at org.apache.spark.sql.vectorized.ColumnarBatchRow.getInt(ColumnarBatchRow.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$9.next(Iterator.scala:584) at scala.collection.Iterator$$anon$9.next(Iterator.scala:584) at scala.collection.Iterator$$anon$9.next(Iterator.scala:584) at scala.collection.immutable.List.prependedAll(List.scala:153) at scala.collection.immutable.List$.from(List.scala:684) at scala.collection.immutable.List$.from(List.scala:681) at scala.collection.SeqFactory$Delegate.from(Factory.scala:306) at scala.collection.immutable.Seq$.from(Seq.scala:42) at scala.collection.IterableOnceOps.toSeq(IterableOnce.scala:1326) at scala.collection.IterableOnceOps.toSeq$(IterableOnce.scala:1326) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1300) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformLocalRelation(SparkConnectPlanner.scala:1239) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:139) at org.apache.spark.sql.connect.service.SparkConnectAnalyzeHandler.process(SparkConnectAnalyzeHandler.scala:59) at org.apache.spark.sql.connect.service.SparkConnectAnalyzeHandler.$anonfun$handle$1(SparkConnectAnalyzeHandler.scala:43) at org.apache.spark.sql.connect.service.SparkConnectAnalyzeHandler.$anonfun$handle$1$adapted(SparkConnectAnalyzeHandler.scala:42) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:289) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:918) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:289) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:80) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:182) at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:79) at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:288) at org.apache.spark.sql.connect.service.SparkConnectAnalyzeHandler.handle(SparkConnectAnalyzeHandler.scala:42) at org.apache.spark.sql.connect.service.SparkConnectService.analyzePlan(SparkConnectService.scala:95) at org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:907) at org.sparkproject.connect.grpc.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at org.sparkproject.conn
(spark) branch master updated: [SPARK-46822][SQL] Respect spark.sql.legacy.charVarcharAsString when casting jdbc type to catalyst type in jdbc
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0a470430c81c [SPARK-46822][SQL] Respect spark.sql.legacy.charVarcharAsString when casting jdbc type to catalyst type in jdbc 0a470430c81c is described below commit 0a470430c81ca2d46020f863c45e96227fbdd07c Author: Kent Yao AuthorDate: Tue Jan 23 22:57:02 2024 -0800 [SPARK-46822][SQL] Respect spark.sql.legacy.charVarcharAsString when casting jdbc type to catalyst type in jdbc ### What changes were proposed in this pull request? This PR makes `spark.sql.legacy.charVarcharAsString` be activated in `JdbcUtils.getCatalystType`. ### Why are the changes needed? For cases like CTAS, which respects schema from the query field can restore their behavior to create tables with strings instead of char/varchar. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44860 from yaooqinn/SPARK-46822. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../sql/execution/datasources/jdbc/JdbcUtils.scala | 2 ++ .../v2/jdbc/JDBCTableCatalogSuite.scala| 22 ++ 2 files changed, 24 insertions(+) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala index 9fb10f42164f..89ac615a3097 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala @@ -185,6 +185,7 @@ object JdbcUtils extends Logging with SQLConfHelper { case java.sql.Types.BIT => BooleanType // @see JdbcDialect for quirks case java.sql.Types.BLOB => BinaryType case java.sql.Types.BOOLEAN => BooleanType +case java.sql.Types.CHAR if conf.charVarcharAsString => StringType case java.sql.Types.CHAR => CharType(precision) case java.sql.Types.CLOB => StringType case java.sql.Types.DATE => DateType @@ -214,6 +215,7 @@ object JdbcUtils extends Logging with SQLConfHelper { case java.sql.Types.TIMESTAMP => TimestampType case java.sql.Types.TINYINT => IntegerType case java.sql.Types.VARBINARY => BinaryType +case java.sql.Types.VARCHAR if conf.charVarcharAsString => StringType case java.sql.Types.VARCHAR => VarcharType(precision) case _ => // For unmatched types: diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala index 5408d434fced..0088fab7d209 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala @@ -608,6 +608,28 @@ class JDBCTableCatalogSuite extends QueryTest with SharedSparkSession { } } + test("SPARK-46822: Respect charVarcharAsString when casting jdbc type to catalyst type in jdbc") { +try { + withConnection( +_.prepareStatement("""CREATE TABLE "test"."char_tbl" (ID CHAR(5), deptno VARCHAR(10))""") +.executeUpdate()) + withSQLConf(SQLConf.LEGACY_CHAR_VARCHAR_AS_STRING.key -> "true") { +val expected = new StructType() + .add("ID", StringType, true, defaultMetadata) + .add("DEPTNO", StringType, true, defaultMetadata) +assert(sql(s"SELECT * FROM h2.test.char_tbl").schema === expected) + } + val expected = new StructType() +.add("ID", CharType(5), true, defaultMetadata) +.add("DEPTNO", VarcharType(10), true, defaultMetadata) + val replaced = CharVarcharUtils.replaceCharVarcharWithStringInSchema(expected) + assert(sql(s"SELECT * FROM h2.test.char_tbl").schema === replaced) +} finally { + withConnection( +_.prepareStatement("""DROP TABLE IF EXISTS "test"."char_tbl"""").executeUpdate()) +} + } + test("SPARK-45449: Cache Invalidation Issue with JDBC Table") { withTable("h2.test.cache_t") { withConnection { conn => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new e56bd97c04c1 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command e56bd97c04c1 is described below commit e56bd97c04c184104046e51e6759e616c86683fa Author: Dongjoon Hyun AuthorDate: Tue Jan 23 16:38:45 2024 -0800 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command ### What changes were proposed in this pull request? This PR aims to fix `spark-daemon.sh` usage by adding `decommission` command. ### Why are the changes needed? This was missed when SPARK-20628 added `decommission` command at Apache Spark 3.1.0. The command has been used like the following. https://github.com/apache/spark/blob/0356ac00947282b1a0885ad7eaae1e25e43671fe/sbin/decommission-worker.sh#L41 ### Does this PR introduce _any_ user-facing change? No, this is only a change on usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44856 from dongjoon-hyun/SPARK-46817. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 00a92d328576c39b04cfd0fdd8a30c5a9bc37e36) Signed-off-by: Dongjoon Hyun --- sbin/spark-daemon.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sbin/spark-daemon.sh b/sbin/spark-daemon.sh index 3cfd5acfe2b5..28d205f03e0f 100755 --- a/sbin/spark-daemon.sh +++ b/sbin/spark-daemon.sh @@ -31,7 +31,7 @@ # SPARK_NO_DAEMONIZE If set, will run the proposed command in the foreground. It will not output a PID file. ## -usage="Usage: spark-daemon.sh [--config ] (start|stop|submit|status) " +usage="Usage: spark-daemon.sh [--config ] (start|stop|submit|decommission|status) " # if no args specified, show usage if [ $# -le 1 ]; then - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new be7f1e9979c3 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command be7f1e9979c3 is described below commit be7f1e9979c38b1358b0af2b358bacb0bd523c80 Author: Dongjoon Hyun AuthorDate: Tue Jan 23 16:38:45 2024 -0800 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command ### What changes were proposed in this pull request? This PR aims to fix `spark-daemon.sh` usage by adding `decommission` command. ### Why are the changes needed? This was missed when SPARK-20628 added `decommission` command at Apache Spark 3.1.0. The command has been used like the following. https://github.com/apache/spark/blob/0356ac00947282b1a0885ad7eaae1e25e43671fe/sbin/decommission-worker.sh#L41 ### Does this PR introduce _any_ user-facing change? No, this is only a change on usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44856 from dongjoon-hyun/SPARK-46817. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 00a92d328576c39b04cfd0fdd8a30c5a9bc37e36) Signed-off-by: Dongjoon Hyun --- sbin/spark-daemon.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sbin/spark-daemon.sh b/sbin/spark-daemon.sh index 3cfd5acfe2b5..28d205f03e0f 100755 --- a/sbin/spark-daemon.sh +++ b/sbin/spark-daemon.sh @@ -31,7 +31,7 @@ # SPARK_NO_DAEMONIZE If set, will run the proposed command in the foreground. It will not output a PID file. ## -usage="Usage: spark-daemon.sh [--config ] (start|stop|submit|status) " +usage="Usage: spark-daemon.sh [--config ] (start|stop|submit|decommission|status) " # if no args specified, show usage if [ $# -le 1 ]; then - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 00a92d328576 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command 00a92d328576 is described below commit 00a92d328576c39b04cfd0fdd8a30c5a9bc37e36 Author: Dongjoon Hyun AuthorDate: Tue Jan 23 16:38:45 2024 -0800 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command ### What changes were proposed in this pull request? This PR aims to fix `spark-daemon.sh` usage by adding `decommission` command. ### Why are the changes needed? This was missed when SPARK-20628 added `decommission` command at Apache Spark 3.1.0. The command has been used like the following. https://github.com/apache/spark/blob/0356ac00947282b1a0885ad7eaae1e25e43671fe/sbin/decommission-worker.sh#L41 ### Does this PR introduce _any_ user-facing change? No, this is only a change on usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44856 from dongjoon-hyun/SPARK-46817. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- sbin/spark-daemon.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sbin/spark-daemon.sh b/sbin/spark-daemon.sh index 3cfd5acfe2b5..28d205f03e0f 100755 --- a/sbin/spark-daemon.sh +++ b/sbin/spark-daemon.sh @@ -31,7 +31,7 @@ # SPARK_NO_DAEMONIZE If set, will run the proposed command in the foreground. It will not output a PID file. ## -usage="Usage: spark-daemon.sh [--config ] (start|stop|submit|status) " +usage="Usage: spark-daemon.sh [--config ] (start|stop|submit|decommission|status) " # if no args specified, show usage if [ $# -le 1 ]; then - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0356ac009472 [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader 0356ac009472 is described below commit 0356ac00947282b1a0885ad7eaae1e25e43671fe Author: Johan Lasperas AuthorDate: Tue Jan 23 12:37:18 2024 -0800 [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader ### What changes were proposed in this pull request? This is a follow-up from https://github.com/apache/spark/pull/44368 and https://github.com/apache/spark/pull/44513, implementing an additional type promotion from integers to decimals in the parquet vectorized reader, bringing it at parity with the non-vectorized reader in that regard. ### Why are the changes needed? This allows reading parquet files that have different schemas and mix decimals and integers - e.g reading files containing either `Decimal(15, 2)` and `INT32` as `Decimal(15, 2)` - as long as the requested decimal type is large enough to accommodate the integer values without precision loss. ### Does this PR introduce _any_ user-facing change? Yes, the following now succeeds when using the vectorized Parquet reader: ``` Seq(20).toDF($"a".cast(IntegerType)).write.parquet(path) spark.read.schema("a decimal(12, 0)").parquet(path).collect() ``` It failed before with the vectorized reader and succeeded with the non-vectorized reader. ### How was this patch tested? - Tests added to `ParquetWideningTypeSuite` - Updated relevant `ParquetQuerySuite` test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44803 from johanl-db/SPARK-40876-widening-promotion-int-to-decimal. Authored-by: Johan Lasperas Signed-off-by: Dongjoon Hyun --- .../parquet/ParquetVectorUpdaterFactory.java | 39 ++- .../parquet/VectorizedColumnReader.java| 7 +- .../datasources/parquet/ParquetQuerySuite.scala| 8 +- .../parquet/ParquetTypeWideningSuite.scala | 123 ++--- 4 files changed, 150 insertions(+), 27 deletions(-) diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java index 0d8713b58cec..f369688597b9 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java @@ -1407,7 +1407,11 @@ public class ParquetVectorUpdaterFactory { super(sparkType); LogicalTypeAnnotation typeAnnotation = descriptor.getPrimitiveType().getLogicalTypeAnnotation(); - this.parquetScale = ((DecimalLogicalTypeAnnotation) typeAnnotation).getScale(); + if (typeAnnotation instanceof DecimalLogicalTypeAnnotation) { +this.parquetScale = ((DecimalLogicalTypeAnnotation) typeAnnotation).getScale(); + } else { +this.parquetScale = 0; + } } @Override @@ -1436,14 +1440,18 @@ public class ParquetVectorUpdaterFactory { } } -private static class LongToDecimalUpdater extends DecimalUpdater { + private static class LongToDecimalUpdater extends DecimalUpdater { private final int parquetScale; - LongToDecimalUpdater(ColumnDescriptor descriptor, DecimalType sparkType) { +LongToDecimalUpdater(ColumnDescriptor descriptor, DecimalType sparkType) { super(sparkType); LogicalTypeAnnotation typeAnnotation = descriptor.getPrimitiveType().getLogicalTypeAnnotation(); - this.parquetScale = ((DecimalLogicalTypeAnnotation) typeAnnotation).getScale(); + if (typeAnnotation instanceof DecimalLogicalTypeAnnotation) { +this.parquetScale = ((DecimalLogicalTypeAnnotation) typeAnnotation).getScale(); + } else { +this.parquetScale = 0; + } } @Override @@ -1641,6 +1649,12 @@ private static class FixedLenByteArrayToDecimalUpdater extends DecimalUpdater { return typeAnnotation instanceof DateLogicalTypeAnnotation; } + private static boolean isSignedIntAnnotation(LogicalTypeAnnotation typeAnnotation) { +if (!(typeAnnotation instanceof IntLogicalTypeAnnotation)) return false; +IntLogicalTypeAnnotation intAnnotation = (IntLogicalTypeAnnotation) typeAnnotation; +return intAnnotation.isSigned(); + } + private static boolean isDecimalTypeMatched(ColumnDescriptor descriptor, DataType dt) { DecimalType requestedType = (DecimalType) dt; LogicalTypeAnnot
(spark) branch branch-3.4 updated: [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 894faabbffb4 [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints 894faabbffb4 is described below commit 894faabbffb4a7075ade6b5e830d76aa4ae7542f Author: Tom van Bussel AuthorDate: Tue Jan 23 08:45:32 2024 -0800 [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints This PR modifies `LogicalRDD` to filter out all subqueries from its `constraints`. Fixes a correctness bug. Spark can produce incorrect results when using a checkpointed `DataFrame` with a filter containing a scalar subquery. This subquery is included in the constraints of the resulting `LogicalRDD`, and may then be propagated as a filter when joining with the checkpointed `DataFrame`. This causes the subquery to be evaluated twice: once during checkpointing and once while evaluating the query. These two subquery evaluations may return different results, e.g. when t [...] No Added a test to `DataFrameSuite`. No Closes #44833 from tomvanbussel/SPARK-46794. Authored-by: Tom van Bussel Signed-off-by: Dongjoon Hyun (cherry picked from commit d26e871136e0c6e1f84a25978319733a516b7b2e) Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/execution/ExistingRDD.scala | 7 +++ .../test/scala/org/apache/spark/sql/DataFrameSuite.scala | 16 +++- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala index 3dcf0efaadd8..3b49abcb1a86 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala @@ -150,6 +150,13 @@ case class LogicalRDD( } override lazy val constraints: ExpressionSet = originConstraints.getOrElse(ExpressionSet()) +// Subqueries can have non-deterministic results even when they only contain deterministic +// expressions (e.g. consider a LIMIT 1 subquery without an ORDER BY). Propagating predicates +// containing a subquery causes the subquery to be executed twice (as the result of the subquery +// in the checkpoint computation cannot be reused), which could result in incorrect results. +// Therefore we assume that all subqueries are non-deterministic, and we do not expose any +// constraints that contain a subquery. +.filterNot(SubqueryExpression.hasSubquery) } object LogicalRDD extends Logging { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala index 9ddb4abe98b2..a9f69ab28a17 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala @@ -35,7 +35,7 @@ import org.apache.spark.scheduler.{SparkListener, SparkListenerJobEnd} import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, EqualTo, ExpressionSet, GreaterThan, Literal, PythonUDF, Uuid} +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, EqualTo, ExpressionSet, GreaterThan, Literal, PythonUDF, ScalarSubquery, Uuid} import org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LeafNode, LocalRelation, LogicalPlan, OneRowRelation, Statistics} import org.apache.spark.sql.catalyst.util.DateTimeUtils @@ -2219,6 +2219,20 @@ class DataFrameSuite extends QueryTest assert(newConstraints === newExpectedConstraints) } + test("SPARK-46794: exclude subqueries from LogicalRDD constraints") { +withTempDir { checkpointDir => + val subquery = +new Column(ScalarSubquery(spark.range(10).selectExpr("max(id)").logicalPlan)) + val df = spark.range(1000).filter($"id" === subquery) + assert(df.logicalPlan.constraints.exists(_.exists(_.isInstanceOf[ScalarSubquery]))) + + spark.sparkContext.setCheckpointDir(checkpointDir.getAbsolutePath) + val checkpointedDf = df.checkpoint() + assert(!checkpointedDf.logicalPlan.constraints +.exists(_.exists(_.isInstanceOf[ScalarSubquery]))) +} + } + test("SPARK-10656: completely support special chars") { val df = Seq(1 -> "a").toDF("i_
(spark) branch branch-3.5 updated: [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 05f7aa596c7b [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints 05f7aa596c7b is described below commit 05f7aa596c7b1c05704abfad94b1b1d3085c530e Author: Tom van Bussel AuthorDate: Tue Jan 23 08:45:32 2024 -0800 [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints This PR modifies `LogicalRDD` to filter out all subqueries from its `constraints`. Fixes a correctness bug. Spark can produce incorrect results when using a checkpointed `DataFrame` with a filter containing a scalar subquery. This subquery is included in the constraints of the resulting `LogicalRDD`, and may then be propagated as a filter when joining with the checkpointed `DataFrame`. This causes the subquery to be evaluated twice: once during checkpointing and once while evaluating the query. These two subquery evaluations may return different results, e.g. when t [...] No Added a test to `DataFrameSuite`. No Closes #44833 from tomvanbussel/SPARK-46794. Authored-by: Tom van Bussel Signed-off-by: Dongjoon Hyun (cherry picked from commit d26e871136e0c6e1f84a25978319733a516b7b2e) Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/execution/ExistingRDD.scala | 7 +++ .../test/scala/org/apache/spark/sql/DataFrameSuite.scala | 16 +++- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala index 3dcf0efaadd8..3b49abcb1a86 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala @@ -150,6 +150,13 @@ case class LogicalRDD( } override lazy val constraints: ExpressionSet = originConstraints.getOrElse(ExpressionSet()) +// Subqueries can have non-deterministic results even when they only contain deterministic +// expressions (e.g. consider a LIMIT 1 subquery without an ORDER BY). Propagating predicates +// containing a subquery causes the subquery to be executed twice (as the result of the subquery +// in the checkpoint computation cannot be reused), which could result in incorrect results. +// Therefore we assume that all subqueries are non-deterministic, and we do not expose any +// constraints that contain a subquery. +.filterNot(SubqueryExpression.hasSubquery) } object LogicalRDD extends Logging { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala index 2eba9f181098..002719f06896 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala @@ -35,7 +35,7 @@ import org.apache.spark.scheduler.{SparkListener, SparkListenerJobEnd} import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, EqualTo, ExpressionSet, GreaterThan, Literal, PythonUDF, Uuid} +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, EqualTo, ExpressionSet, GreaterThan, Literal, PythonUDF, ScalarSubquery, Uuid} import org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation import org.apache.spark.sql.catalyst.parser.ParseException import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LeafNode, LocalRelation, LogicalPlan, OneRowRelation, Statistics} @@ -2258,6 +2258,20 @@ class DataFrameSuite extends QueryTest assert(newConstraints === newExpectedConstraints) } + test("SPARK-46794: exclude subqueries from LogicalRDD constraints") { +withTempDir { checkpointDir => + val subquery = +new Column(ScalarSubquery(spark.range(10).selectExpr("max(id)").logicalPlan)) + val df = spark.range(1000).filter($"id" === subquery) + assert(df.logicalPlan.constraints.exists(_.exists(_.isInstanceOf[ScalarSubquery]))) + + spark.sparkContext.setCheckpointDir(checkpointDir.getAbsolutePath) + val checkpointedDf = df.checkpoint() + assert(!checkpointedDf.logicalPlan.constraints +.exists(_.exists(_.isInstanceOf[ScalarSubquery]))) +} + } + test("SPARK-10656: completely support special chars") { val df = Seq(1 -> "a").toDF("i_
(spark) branch master updated (8ab69992584a -> d26e871136e0)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8ab69992584a [SPARK-46772][SQL][TESTS] Benchmarking Avro with Compression Codecs add d26e871136e0 [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/execution/ExistingRDD.scala | 7 +++ .../test/scala/org/apache/spark/sql/DataFrameSuite.scala | 16 +++- 2 files changed, 22 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46772][SQL][TESTS] Benchmarking Avro with Compression Codecs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8ab69992584a [SPARK-46772][SQL][TESTS] Benchmarking Avro with Compression Codecs 8ab69992584a is described below commit 8ab69992584aa68e882c4a4aa4863049e6a58e7e Author: Kent Yao AuthorDate: Tue Jan 23 08:06:21 2024 -0800 [SPARK-46772][SQL][TESTS] Benchmarking Avro with Compression Codecs ### What changes were proposed in this pull request? This PR improves AvroWriteBenchmark by adding benchmarks with codec and their extra functionalities. - Avro compression with different codec - Avro deflate/xz/zstandard with different levels - buffer pool if zstandard ### Why are the changes needed? performance observation. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? connector/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroWriteBenchmark.scala ### Was this patch authored or co-authored using generative AI tooling? no Closes #44849 from yaooqinn/SPARK-46772. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../AvroWriteBenchmark-jdk21-results.txt | 58 ++ .../avro/benchmarks/AvroWriteBenchmark-results.txt | 58 ++ .../execution/benchmark/AvroWriteBenchmark.scala | 52 --- 3 files changed, 143 insertions(+), 25 deletions(-) diff --git a/connector/avro/benchmarks/AvroWriteBenchmark-jdk21-results.txt b/connector/avro/benchmarks/AvroWriteBenchmark-jdk21-results.txt index f3e1dfa39829..86c6b6647f2f 100644 --- a/connector/avro/benchmarks/AvroWriteBenchmark-jdk21-results.txt +++ b/connector/avro/benchmarks/AvroWriteBenchmark-jdk21-results.txt @@ -1,16 +1,56 @@ -OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure +OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 5.15.0-1053-azure AMD EPYC 7763 64-Core Processor Avro writer benchmark:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Output Single Int Column 1389 1404 21 11.3 88.3 1.0X -Output Single Double Column1522 1523 1 10.3 96.8 0.9X -Output Int and String Column 3398 3400 3 4.6 216.0 0.4X -Output Partitions 2855 2874 27 5.5 181.5 0.5X -Output Buckets 3857 3903 66 4.1 245.2 0.4X +Output Single Int Column 1433 1505 101 11.0 91.1 1.0X +Output Single Double Column1467 1487 28 10.7 93.3 1.0X +Output Int and String Column 3187 3203 23 4.9 202.6 0.4X +Output Partitions 2759 2796 52 5.7 175.4 0.5X +Output Buckets 3760 3767 9 4.2 239.1 0.4X -OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure +OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 5.15.0-1053-azure AMD EPYC 7763 64-Core Processor -Write wide rows into 20 files:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative +Avro compression with different codec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Write wide rows 22729 22774 63 0.0 45458.0 1.0X +BZIP2: 116001 116248 349 0.0 1160008.1 1.0X +DEFLATE: 6867 6870 4 0.0 68672.5 16.9X +UNCOMPRESSED: 5339 5354 21 0.0 53388.4 21.7X +SNAPPY:5077 5096 28 0.0 50769.3 22.8X +XZ: 61387 61501 161 0.0 613871.9 1.9X +ZSTANDARD: 5333 5349 23
(spark) branch master updated: [SPARK-46718][BUILD] Upgrade Arrow to 15.0.0
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8a7609f1cb2d [SPARK-46718][BUILD] Upgrade Arrow to 15.0.0 8a7609f1cb2d is described below commit 8a7609f1cb2dd92ee30ec8172a1c1501d5810dae Author: yangjie01 AuthorDate: Mon Jan 22 21:25:21 2024 -0800 [SPARK-46718][BUILD] Upgrade Arrow to 15.0.0 ### What changes were proposed in this pull request? This pr aims to upgrade Arrow from 14.0.2 to 15.0.0, this version fixes the compatibility issue with Netty 4.1.104.Final(GH-39265). Additionally, since the `arrow-vector` module uses `eclipse-collections` to replace `netty-common` as a compile-level dependency, Apache Spark has added a dependency on `eclipse-collections` after upgrading to use Arrow 15.0.0. ### Why are the changes needed? The new version brings the following major changes: Bug Fixes GH-34610 - [Java] Fix valueCount and field name when loading/transferring NullVector GH-38242 - [Java] Fix incorrect internal struct accounting for DenseUnionVector#getBufferSizeFor GH-38254 - [Java] Add reusable buffer getters to char/binary vectors GH-38366 - [Java] Fix Murmur hash on buffers less than 4 bytes GH-38387 - [Java] Fix JDK8 compilation issue with TestAllTypes GH-38614 - [Java] Add VarBinary and VarCharWriter helper methods to more writers GH-38725 - [Java] decompression in Lz4CompressionCodec.java does not set writer index New Features and Improvements GH-38511 - [Java] Add getTransferPair(Field, BufferAllocator, CallBack) for StructVector and MapVector GH-14936 - [Java] Remove netty dependency from arrow-vector GH-38990 - [Java] Upgrade to flatc version 23.5.26 GH-39265 - [Java] Make it run well with the netty newest version 4.1.104 The full release notes as follows: - https://arrow.apache.org/release/15.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44797 from LuciferYang/SPARK-46718. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 12 +++- pom.xml | 2 +- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 6220626069af..4ee0f5a41191 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -16,10 +16,10 @@ antlr4-runtime/4.13.1//antlr4-runtime-4.13.1.jar aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar arpack/3.0.3//arpack-3.0.3.jar arpack_combined_all/0.1//arpack_combined_all-0.1.jar -arrow-format/14.0.2//arrow-format-14.0.2.jar -arrow-memory-core/14.0.2//arrow-memory-core-14.0.2.jar -arrow-memory-netty/14.0.2//arrow-memory-netty-14.0.2.jar -arrow-vector/14.0.2//arrow-vector-14.0.2.jar +arrow-format/15.0.0//arrow-format-15.0.0.jar +arrow-memory-core/15.0.0//arrow-memory-core-15.0.0.jar +arrow-memory-netty/15.0.0//arrow-memory-netty-15.0.0.jar +arrow-vector/15.0.0//arrow-vector-15.0.0.jar audience-annotations/0.12.0//audience-annotations-0.12.0.jar avro-ipc/1.11.3//avro-ipc-1.11.3.jar avro-mapred/1.11.3//avro-mapred-1.11.3.jar @@ -63,7 +63,9 @@ derby/10.16.1.1//derby-10.16.1.1.jar derbyshared/10.16.1.1//derbyshared-10.16.1.1.jar derbytools/10.16.1.1//derbytools-10.16.1.1.jar dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar -flatbuffers-java/1.12.0//flatbuffers-java-1.12.0.jar +eclipse-collections-api/11.1.0//eclipse-collections-api-11.1.0.jar +eclipse-collections/11.1.0//eclipse-collections-11.1.0.jar +flatbuffers-java/23.5.26//flatbuffers-java-23.5.26.jar gcs-connector/hadoop3-2.2.18/shaded/gcs-connector-hadoop3-2.2.18-shaded.jar gmetric4j/1.0.10//gmetric4j-1.0.10.jar gson/2.2.4//gson-2.2.4.jar diff --git a/pom.xml b/pom.xml index e290273543c6..5f33dd7d8ebc 100644 --- a/pom.xml +++ b/pom.xml @@ -230,7 +230,7 @@ If you are changing Arrow version specification, please check ./python/pyspark/sql/pandas/utils.py, and ./python/setup.py too. --> -14.0.2 +15.0.0 2.5.11 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46805][BUILD] Upgrade `scalafmt` to 3.7.17
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 34beb02827ff [SPARK-46805][BUILD] Upgrade `scalafmt` to 3.7.17 34beb02827ff is described below commit 34beb02827ffe14e3ed0407bed3f434098340ce4 Author: panbingkun AuthorDate: Mon Jan 22 21:23:10 2024 -0800 [SPARK-46805][BUILD] Upgrade `scalafmt` to 3.7.17 ### What changes were proposed in this pull request? The pr aims to upgrade `scalafmt` from `3.7.13` to `3.7.17`. ### Why are the changes needed? - Regular upgrade, the last upgrade occurred 5 months ago. - The full release notes: https://github.com/scalameta/scalafmt/releases/tag/v3.7.17 https://github.com/scalameta/scalafmt/releases/tag/v3.7.16 https://github.com/scalameta/scalafmt/releases/tag/v3.7.15 https://github.com/scalameta/scalafmt/releases/tag/v3.7.14 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44845 from panbingkun/SPARK-46805. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- dev/.scalafmt.conf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/.scalafmt.conf b/dev/.scalafmt.conf index 721dec289900..b3a43a03651a 100644 --- a/dev/.scalafmt.conf +++ b/dev/.scalafmt.conf @@ -32,4 +32,4 @@ fileOverride { runner.dialect = scala213 } } -version = 3.7.13 +version = 3.7.17 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (00a9b94d1827 -> 31ffb2d99900)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 00a9b94d1827 [SPARK-46804][DOCS][TESTS] Recover the generated documents add 31ffb2d99900 [SPARK-46800][CORE] Support `spark.deploy.spreadOutDrivers` No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/master/Master.scala| 66 ++ .../org/apache/spark/internal/config/Deploy.scala | 5 ++ .../apache/spark/deploy/master/MasterSuite.scala | 38 + docs/spark-standalone.md | 10 4 files changed, 96 insertions(+), 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46804][DOCS][TESTS] Recover the generated documents
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 00a9b94d1827 [SPARK-46804][DOCS][TESTS] Recover the generated documents 00a9b94d1827 is described below commit 00a9b94d18279cc75259c46b67cbb3da0078327b Author: Dongjoon Hyun AuthorDate: Mon Jan 22 17:57:05 2024 -0800 [SPARK-46804][DOCS][TESTS] Recover the generated documents ### What changes were proposed in this pull request? This PR regenerated the documents with the following. ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly *SparkThrowableSuite -- -t \"Error classes match with document\"" ``` ### Why are the changes needed? The following PR broke CIs by manually fixing the generated docs. - #44825 Currently, CI is broken like the following. - https://github.com/apache/spark/actions/runs/7619269448/job/20752056653 - https://github.com/apache/spark/actions/runs/7619199659/job/20751858197 ``` [info] - Error classes match with document *** FAILED *** (24 milliseconds) [info] "...lstates.html#class-0[A]-feature-not-support..." did not equal "...lstates.html#class-0[a]-feature-not-support..." The error class document is not up to date. Please regenerate it. (SparkThrowableSuite.scala:322) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually check. ``` $ build/sbt "core/testOnly *.SparkThrowableSuite" ... [info] SparkThrowableSuite: [info] - No duplicate error classes (31 milliseconds) [info] - Error classes are correctly formatted (47 milliseconds) [info] - SQLSTATE is mandatory (2 milliseconds) [info] - SQLSTATE invariants (26 milliseconds) [info] - Message invariants (8 milliseconds) [info] - Message format invariants (7 milliseconds) [info] - Error classes match with document (65 milliseconds) [info] - Round trip (28 milliseconds) [info] - Error class names should contain only capital letters, numbers and underscores (7 milliseconds) [info] - Check if error class is missing (15 milliseconds) [info] - Check if message parameters match message format (4 milliseconds) [info] - Error message is formatted (1 millisecond) [info] - Error message does not do substitution on values (1 millisecond) [info] - Try catching legacy SparkError (0 milliseconds) [info] - Try catching SparkError with error class (1 millisecond) [info] - Try catching internal SparkError (0 milliseconds) [info] - Get message in the specified format (6 milliseconds) [info] - overwrite error classes (61 milliseconds) [info] - prohibit dots in error class names (23 milliseconds) [info] Run completed in 1 second, 357 milliseconds. [info] Total number of tests run: 19 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44843 from dongjoon-hyun/SPARK-46804. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- ...r-conditions-cannot-update-field-error-class.md | 4 +- ...ions-insufficient-table-property-error-class.md | 4 +- ...-internal-error-metadata-catalog-error-class.md | 4 +- ...-error-conditions-invalid-cursor-error-class.md | 4 +- ...-error-conditions-invalid-handle-error-class.md | 4 +- ...or-conditions-missing-attributes-error-class.md | 6 +- ...ns-not-supported-in-jdbc-catalog-error-class.md | 4 +- ...-conditions-unsupported-add-file-error-class.md | 4 +- ...itions-unsupported-default-value-error-class.md | 4 +- ...ditions-unsupported-deserializer-error-class.md | 4 +- ...r-conditions-unsupported-feature-error-class.md | 4 +- ...conditions-unsupported-save-mode-error-class.md | 4 +- ...ted-subquery-expression-category-error-class.md | 4 +- docs/sql-error-conditions.md | 102 ++--- 14 files changed, 92 insertions(+), 64 deletions(-) diff --git a/docs/sql-error-conditions-cannot-update-field-error-class.md b/docs/sql-error-conditions-cannot-update-field-error-class.md index 3d7152e499c9..42f952a403be 100644 --- a/docs/sql-error-conditions-cannot-update-field-error-class.md +++ b/docs/sql-error-conditions-cannot-update-field-error-class.md @@ -19,7 +19,7 @@ license: | limitations under the License. --- -[SQLSTATE: 0A000](sql-error-conditions-sqlstates.html#class-0a-feature-not-supported) +[SQLSTATE: 0A000](sql-error-conditions-sqlstates.html#class-0A-feature-not-supported) Cannot update `` field `` type: @@ -44
(spark) branch master updated: [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 52b62921cadb [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script 52b62921cadb is described below commit 52b62921cadb05da5b1183f979edf7d608256f2e Author: Hyukjin Kwon AuthorDate: Mon Jan 22 17:06:59 2024 -0800 [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script ### What changes were proposed in this pull request? This PR proposes to avoid treating the exit code 5 as a test failure in Python testing script. ### Why are the changes needed? ``` ... Running PySpark tests Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.12'] Will test the following Python modules: ['pyspark-core', 'pyspark-streaming', 'pyspark-errors'] python3.12 python_implementation is CPython python3.12 version is: Python 3.12.1 Starting test(python3.12): pyspark.streaming.tests.test_context (temp output: /__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log) Finished test(python3.12): pyspark.streaming.tests.test_context (12s) Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp output: /__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log) Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s) Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp output: /__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log) test_kinesis_stream (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." test_kinesis_stream_api (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." -- Ran 0 tests in 0.000s NO TESTS RAN (skipped=2) Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; see logs. Error: running /__w/spark/spark/python/run-tests --modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 --python-executables=python3.12 ; received return code 255 Error: Process completed with exit code 19. ``` Scheduled job fails because of exit 5, see https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No, Closes #44841 from HyukjinKwon/SPARK-46801. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- python/run-tests.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/run-tests.py b/python/run-tests.py index 97fbf9be320b..4cd3569efce3 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -147,8 +147,8 @@ def run_individual_python_test(target_dir, test_name, pyspark_python, keep_test_ # this code is invoked from a thread other than the main thread. os._exit(1) duration = time.time() - start_time -# Exit on the first failure. -if retcode != 0: +# Exit on the first failure but exclude the code 5 for no test ran, see SPARK-46801. +if retcode != 0 and retcode != 5: try: with FAILURE_REPORTING_LOCK: with open(LOG_FILE, 'ab') as log_file: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 2621882da3ef [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script 2621882da3ef is described below commit 2621882da3effe2c9e0b3aedbcb26942e165a09f Author: Hyukjin Kwon AuthorDate: Mon Jan 22 17:06:59 2024 -0800 [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script ### What changes were proposed in this pull request? This PR proposes to avoid treating the exit code 5 as a test failure in Python testing script. ### Why are the changes needed? ``` ... Running PySpark tests Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.12'] Will test the following Python modules: ['pyspark-core', 'pyspark-streaming', 'pyspark-errors'] python3.12 python_implementation is CPython python3.12 version is: Python 3.12.1 Starting test(python3.12): pyspark.streaming.tests.test_context (temp output: /__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log) Finished test(python3.12): pyspark.streaming.tests.test_context (12s) Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp output: /__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log) Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s) Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp output: /__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log) test_kinesis_stream (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." test_kinesis_stream_api (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." -- Ran 0 tests in 0.000s NO TESTS RAN (skipped=2) Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; see logs. Error: running /__w/spark/spark/python/run-tests --modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 --python-executables=python3.12 ; received return code 255 Error: Process completed with exit code 19. ``` Scheduled job fails because of exit 5, see https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No, Closes #44841 from HyukjinKwon/SPARK-46801. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun (cherry picked from commit 52b62921cadb05da5b1183f979edf7d608256f2e) Signed-off-by: Dongjoon Hyun --- python/run-tests.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/run-tests.py b/python/run-tests.py index 19e39c822cbb..b9031765d943 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -147,8 +147,8 @@ def run_individual_python_test(target_dir, test_name, pyspark_python, keep_test_ # this code is invoked from a thread other than the main thread. os._exit(1) duration = time.time() - start_time -# Exit on the first failure. -if retcode != 0: +# Exit on the first failure but exclude the code 5 for no test ran, see SPARK-46801. +if retcode != 0 and retcode != 5: try: with FAILURE_REPORTING_LOCK: with open(LOG_FILE, 'ab') as log_file: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new a6869b25fb9a [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script a6869b25fb9a is described below commit a6869b25fb9a7ac0e7e5015d342435e5c1b5f044 Author: Hyukjin Kwon AuthorDate: Mon Jan 22 17:06:59 2024 -0800 [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script ### What changes were proposed in this pull request? This PR proposes to avoid treating the exit code 5 as a test failure in Python testing script. ### Why are the changes needed? ``` ... Running PySpark tests Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.12'] Will test the following Python modules: ['pyspark-core', 'pyspark-streaming', 'pyspark-errors'] python3.12 python_implementation is CPython python3.12 version is: Python 3.12.1 Starting test(python3.12): pyspark.streaming.tests.test_context (temp output: /__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log) Finished test(python3.12): pyspark.streaming.tests.test_context (12s) Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp output: /__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log) Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s) Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp output: /__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log) test_kinesis_stream (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." test_kinesis_stream_api (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." -- Ran 0 tests in 0.000s NO TESTS RAN (skipped=2) Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; see logs. Error: running /__w/spark/spark/python/run-tests --modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 --python-executables=python3.12 ; received return code 255 Error: Process completed with exit code 19. ``` Scheduled job fails because of exit 5, see https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No, Closes #44841 from HyukjinKwon/SPARK-46801. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun (cherry picked from commit 52b62921cadb05da5b1183f979edf7d608256f2e) Signed-off-by: Dongjoon Hyun --- python/run-tests.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/run-tests.py b/python/run-tests.py index 19e39c822cbb..b9031765d943 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -147,8 +147,8 @@ def run_individual_python_test(target_dir, test_name, pyspark_python, keep_test_ # this code is invoked from a thread other than the main thread. os._exit(1) duration = time.time() - start_time -# Exit on the first failure. -if retcode != 0: +# Exit on the first failure but exclude the code 5 for no test ran, see SPARK-46801. +if retcode != 0 and retcode != 5: try: with FAILURE_REPORTING_LOCK: with open(LOG_FILE, 'ab') as log_file: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46799][CORE][TESTS] Improve `MasterSuite` to use nanoTime-based appIDs and workerIDs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f9feddfbc9de [SPARK-46799][CORE][TESTS] Improve `MasterSuite` to use nanoTime-based appIDs and workerIDs f9feddfbc9de is described below commit f9feddfbc9de8e87f7a2e9d8abade7e687335b84 Author: Dongjoon Hyun AuthorDate: Mon Jan 22 16:34:26 2024 -0800 [SPARK-46799][CORE][TESTS] Improve `MasterSuite` to use nanoTime-based appIDs and workerIDs ### What changes were proposed in this pull request? This PR aims to improve `MasterSuite` to use nanoTime-based appIDs and workerIDs. ### Why are the changes needed? During testing, I hit a case where two workers have the same ID. This PR will prevent the duplicated IDs in Apps and Workers in `MasterSuite`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ``` $ build/sbt "core/testOnly *.MasterSuite" [info] MasterSuite: [info] - can use a custom recovery mode factory (443 milliseconds) [info] - SPARK-46664: master should recover quickly in case of zero workers and apps (38 milliseconds) [info] - master correctly recover the application (41 milliseconds) [info] - SPARK-46205: Recovery with Kryo Serializer (27 milliseconds) [info] - SPARK-46216: Recovery without compression (19 milliseconds) [info] - SPARK-46216: Recovery with compression (20 milliseconds) [info] - SPARK-46258: Recovery with RocksDB (306 milliseconds) [info] - master/worker web ui available (197 milliseconds) [info] - master/worker web ui available with reverseProxy (30 seconds, 123 milliseconds) [info] - master/worker web ui available behind front-end reverseProxy (30 seconds, 113 milliseconds) [info] - basic scheduling - spread out (23 milliseconds) [info] - basic scheduling - no spread out (14 milliseconds) [info] - basic scheduling with more memory - spread out (10 milliseconds) [info] - basic scheduling with more memory - no spread out (10 milliseconds) [info] - scheduling with max cores - spread out (9 milliseconds) [info] - scheduling with max cores - no spread out (9 milliseconds) [info] - scheduling with cores per executor - spread out (9 milliseconds) [info] - scheduling with cores per executor - no spread out (8 milliseconds) [info] - scheduling with cores per executor AND max cores - spread out (8 milliseconds) [info] - scheduling with cores per executor AND max cores - no spread out (7 milliseconds) [info] - scheduling with executor limit - spread out (8 milliseconds) [info] - scheduling with executor limit - no spread out (7 milliseconds) [info] - scheduling with executor limit AND max cores - spread out (8 milliseconds) [info] - scheduling with executor limit AND max cores - no spread out (9 milliseconds) [info] - scheduling with executor limit AND cores per executor - spread out (8 milliseconds) [info] - scheduling with executor limit AND cores per executor - no spread out (13 milliseconds) [info] - scheduling with executor limit AND cores per executor AND max cores - spread out (8 milliseconds) [info] - scheduling with executor limit AND cores per executor AND max cores - no spread out (7 milliseconds) [info] - scheduling for app with multiple resource profiles (44 milliseconds) [info] - scheduling for app with multiple resource profiles with max cores (37 milliseconds) [info] - SPARK-45174: scheduling with max drivers (9 milliseconds) [info] - SPARK-13604: Master should ask Worker kill unknown executors and drivers (15 milliseconds) [info] - SPARK-20529: Master should reply the address received from worker (20 milliseconds) [info] - SPARK-27510: Master should avoid dead loop while launching executor failed in Worker (34 milliseconds) [info] - All workers on a host should be decommissioned (28 milliseconds) [info] - No workers should be decommissioned with invalid host (25 milliseconds) [info] - Only worker on host should be decommissioned (19 milliseconds) [info] - SPARK-19900: there should be a corresponding driver for the app after relaunching driver (2 seconds, 60 milliseconds) [info] - assign/recycle resources to/from driver (33 milliseconds) [info] - assign/recycle resources to/from executor (27 milliseconds) [info] - resource description with multiple resource profiles (1 millisecond) [info] - SPARK-45753: Support driver id pattern (7 milliseconds) [info] - SPARK-45753: Prevent invalid driver id patterns (6 milliseconds) [info] - SPARK-45754: Support app id pattern (7 milliseconds) [info] - SPARK-45754: Prevent invalid app id patterns (7 milliseconds) [info] - SPARK-45785:
(spark) branch master updated: [SPARK-46797][CORE] Rename `spark.deploy.spreadOut` to `spark.deploy.spreadOutApps`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8d1212837538 [SPARK-46797][CORE] Rename `spark.deploy.spreadOut` to `spark.deploy.spreadOutApps` 8d1212837538 is described below commit 8d121283753894d4969d8ff9e09bb487f76e82e1 Author: Dongjoon Hyun AuthorDate: Mon Jan 22 16:26:43 2024 -0800 [SPARK-46797][CORE] Rename `spark.deploy.spreadOut` to `spark.deploy.spreadOutApps` ### What changes were proposed in this pull request? This PR aims to rename `spark.deploy.spreadOut` to `spark.deploy.spreadOutApps`. ### Why are the changes needed? Although Apache Spark documentation clearly says it's about `applications`, this still misleads users to forget `Driver` JVMs which will be spread out always independently from this configuration. https://github.com/apache/spark/blob/b80e8cb4552268b771fc099457b9186807081c4a/docs/spark-standalone.md?plain=1#L282-L285 ### Does this PR introduce _any_ user-facing change? No, the behavior is the same. Only it will show warnings for old config name usages. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44838 from dongjoon-hyun/SPARK-46797. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/internal/config/Deploy.scala | 3 ++- docs/spark-standalone.md | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala b/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala index 6585d62b3b9c..31ac07621176 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala @@ -97,8 +97,9 @@ private[spark] object Deploy { .intConf .createWithDefault(10) - val SPREAD_OUT_APPS = ConfigBuilder("spark.deploy.spreadOut") + val SPREAD_OUT_APPS = ConfigBuilder("spark.deploy.spreadOutApps") .version("0.6.1") +.withAlternative("spark.deploy.spreadOut") .booleanConf .createWithDefault(true) diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index b9e3bb5d3f7f..6e454dff1bde 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -279,7 +279,7 @@ SPARK_MASTER_OPTS supports the following system properties: 1.1.0 - spark.deploy.spreadOut + spark.deploy.spreadOutApps true Whether the standalone cluster manager should spread applications out across nodes or try - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 97536c6673bb [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent 97536c6673bb is described below commit 97536c6673bb08ba8741a6a6f697b6880ca629ce Author: Bruce Robbins AuthorDate: Mon Jan 22 11:09:01 2024 -0800 [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent When canonicalizing `output` in `InMemoryRelation`, use `output` itself as the schema for determining the ordinals, rather than `cachedPlan.output`. `InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't necessarily use the same exprIds. E.g.: ``` +- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [c1#254, c2#255] ``` Because of this, `InMemoryRelation` will sometimes fail to fully canonicalize, resulting in cases where two semantically equivalent `InMemoryRelation` instances appear to be semantically nonequivalent. Example: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (3, 7), (4, 5); cache table data; select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all; ``` If plan change validation checking is on (i.e., `spark.sql.planChangeValidation=true`), the failure is: ``` [PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS count(c2)#83L] ... is not a valid aggregate expression: [SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an aggregate function. ``` If plan change validation checking is off, the failure is more mysterious: ``` [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 ``` If you remove the cache command, the query succeeds. The above failures happen because the subquery in the aggregate expressions and the subquery in the grouping expressions seem semantically nonequivalent since the `InMemoryRelation` in one of the subquery plans failed to completely canonicalize. In `CacheManager#useCachedData`, two lookups for the same cached plan may create `InMemoryRelation` instances that have different exprIds in `output`. That's because the plan fragments used as lookup keys may have been deduplicated by `DeduplicateRelations`, and thus have different exprIds in their respective output schemas. When `CacheManager#useCachedData` creates an `InMemoryRelation` instance, it borrows the output schema of the plan fragment used as the lookup key. The failure to fully canonicalize has other effects. For example, this query fails to reuse the exchange: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (2, 4), (3, 7), (7, 22); cache table data; set spark.sql.autoBroadcastJoinThreshold=-1; set spark.sql.adaptive.enabled=false; select * from data l join data r on l.c1 = r.c1; ``` No. New tests. No. Closes #44806 from bersprockets/plan_validation_issue. Authored-by: Bruce Robbins Signed-off-by: Dongjoon Hyun (cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a) Signed-off-by: Dongjoon Hyun --- .../spark/sql/execution/columnar/InMemoryRelation.scala | 2 +- .../org/apache/spark/sql/DataFrameAggregateSuite.scala| 15 +++ .../sql/execution/columnar/InMemoryRelationSuite.scala| 7 +++ 3 files changed, 23 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala index 4df9915dc96e..119e9e0a188f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala @@ -391,7 +391,7 @@ case class InMemoryRelation( } override def doCanonicalize(): logical.LogicalPlan = -copy(output = output.map(QueryPlan.normalizeExpres
(spark) branch branch-3.5 updated: [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 68d9f353300e [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent 68d9f353300e is described below commit 68d9f353300ed7de0b47c26cb30236bada896d25 Author: Bruce Robbins AuthorDate: Mon Jan 22 11:09:01 2024 -0800 [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent When canonicalizing `output` in `InMemoryRelation`, use `output` itself as the schema for determining the ordinals, rather than `cachedPlan.output`. `InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't necessarily use the same exprIds. E.g.: ``` +- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [c1#254, c2#255] ``` Because of this, `InMemoryRelation` will sometimes fail to fully canonicalize, resulting in cases where two semantically equivalent `InMemoryRelation` instances appear to be semantically nonequivalent. Example: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (3, 7), (4, 5); cache table data; select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all; ``` If plan change validation checking is on (i.e., `spark.sql.planChangeValidation=true`), the failure is: ``` [PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS count(c2)#83L] ... is not a valid aggregate expression: [SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an aggregate function. ``` If plan change validation checking is off, the failure is more mysterious: ``` [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 ``` If you remove the cache command, the query succeeds. The above failures happen because the subquery in the aggregate expressions and the subquery in the grouping expressions seem semantically nonequivalent since the `InMemoryRelation` in one of the subquery plans failed to completely canonicalize. In `CacheManager#useCachedData`, two lookups for the same cached plan may create `InMemoryRelation` instances that have different exprIds in `output`. That's because the plan fragments used as lookup keys may have been deduplicated by `DeduplicateRelations`, and thus have different exprIds in their respective output schemas. When `CacheManager#useCachedData` creates an `InMemoryRelation` instance, it borrows the output schema of the plan fragment used as the lookup key. The failure to fully canonicalize has other effects. For example, this query fails to reuse the exchange: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (2, 4), (3, 7), (7, 22); cache table data; set spark.sql.autoBroadcastJoinThreshold=-1; set spark.sql.adaptive.enabled=false; select * from data l join data r on l.c1 = r.c1; ``` No. New tests. No. Closes #44806 from bersprockets/plan_validation_issue. Authored-by: Bruce Robbins Signed-off-by: Dongjoon Hyun (cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a) Signed-off-by: Dongjoon Hyun --- .../spark/sql/execution/columnar/InMemoryRelation.scala | 2 +- .../org/apache/spark/sql/DataFrameAggregateSuite.scala| 15 +++ .../sql/execution/columnar/InMemoryRelationSuite.scala| 7 +++ 3 files changed, 23 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala index 65f7835b42cf..5bab8e53eb16 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala @@ -405,7 +405,7 @@ case class InMemoryRelation( } override def doCanonicalize(): logical.LogicalPlan = -copy(output = output.map(QueryPlan.normalizeExpres