[spark] branch master updated: [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5a00631e5805 [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show 5a00631e5805 is described below commit 5a00631e5805f3c1bc9d8e4827e2cf30ee312274 Author: allisonwang-db AuthorDate: Thu Oct 12 13:05:40 2023 +0800 [SPARK-45442][PYTHON][DOCS] Refine docstring of DataFrame.show ### What changes were proposed in this pull request? This PR refines the docstring of `DataFrame.show` by adding more examples. ### Why are the changes needed? To improve PySpark documentations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No Closes #43252 from allisonwang-db/spark-45442-refine-show. Authored-by: allisonwang-db Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/dataframe.py | 49 - 1 file changed, 39 insertions(+), 10 deletions(-) diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index c44838c0ee11..637787ceb660 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -887,7 +887,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): return self._jdf.isEmpty() def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = False) -> None: -"""Prints the first ``n`` rows to the console. +""" +Prints the first ``n`` rows of the DataFrame to the console. .. versionadded:: 1.3.0 @@ -896,20 +897,32 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): Parameters -- -n : int, optional +n : int, optional, default 20 Number of rows to show. -truncate : bool or int, optional -If set to ``True``, truncate strings longer than 20 chars by default. +truncate : bool or int, optional, default True +If set to ``True``, truncate strings longer than 20 chars. If set to a number greater than one, truncates long strings to length ``truncate`` and align cells right. vertical : bool, optional -If set to ``True``, print output rows vertically (one line -per column value). +If set to ``True``, print output rows vertically (one line per column value). Examples >>> df = spark.createDataFrame([ -... (14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) +... (14, "Tom"), (23, "Alice"), (16, "Bob"), (19, "This is a super long name")], +... ["age", "name"]) + +Show :class:`DataFrame` + +>>> df.show() ++---++ +|age|name| ++---++ +| 14| Tom| +| 23| Alice| +| 16| Bob| +| 19|This is a super l...| ++---++ Show only top 2 rows. @@ -922,6 +935,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): +---+-+ only showing top 2 rows +Show full column content without truncation. + +>>> df.show(truncate=False) ++---+-+ +|age|name | ++---+-+ +|14 |Tom | +|23 |Alice| +|16 |Bob | +|19 |This is a super long name| ++---+-+ + Show :class:`DataFrame` where the maximum number of characters is 3. >>> df.show(truncate=3) @@ -931,20 +956,24 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): | 14| Tom| | 23| Ali| | 16| Bob| +| 19| Thi| +---++ Show :class:`DataFrame` vertically. >>> df.show(vertical=True) --RECORD 0- +-RECORD 0 age | 14 name | Tom --RECORD 1- +-RECORD 1 age | 23 name | Alice --RECORD 2- +-RECORD 2 age | 16 name | Bob +-RECORD 3 +age | 19 +name | This is a super l... """ if not isinstance(n, int) or isinstance(n, bool): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 after get_json_object supports codgen
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 95653904a116 [SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 after get_json_object supports codgen 95653904a116 is described below commit 95653904a116a8220972108a94d70a15827f3c66 Author: panbingkun AuthorDate: Thu Oct 12 11:08:43 2023 +0800 [SPARK-42881][SQL][FOLLOWUP] Update the results of JsonBenchmark-jdk21 after get_json_object supports codgen ### What changes were proposed in this pull request? The pr aims to followup https://github.com/apache/spark/pull/40506, update JsonBenchmark-jdk21-results.txt for it. ### Why are the changes needed? Update JsonBenchmark-jdk21-results.txt. https://github.com/panbingkun/spark/actions/runs/6489918873 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only update the results of the benchmark, ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43346 from panbingkun/get_json_object_followup. Authored-by: panbingkun Signed-off-by: yangjie01 --- .../benchmarks/JsonBenchmark-jdk21-results.txt | 153 +++-- 1 file changed, 77 insertions(+), 76 deletions(-) diff --git a/sql/core/benchmarks/JsonBenchmark-jdk21-results.txt b/sql/core/benchmarks/JsonBenchmark-jdk21-results.txt index 3b48a59e660a..f0e19c0ecf9a 100644 --- a/sql/core/benchmarks/JsonBenchmark-jdk21-results.txt +++ b/sql/core/benchmarks/JsonBenchmark-jdk21-results.txt @@ -3,127 +3,128 @@ Benchmark for performance of JSON parsing Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 21+35 on Linux 5.15.0-1046-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 21+35-LTS on Linux 5.15.0-1047-azure +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz JSON schema inferring:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding2855 2912 65 1.8 571.0 1.0X -UTF-8 is set 4699 4723 31 1.1 939.9 0.6X +No encoding2944 3061 191 1.7 588.8 1.0X +UTF-8 is set 4437 4465 26 1.1 887.5 0.7X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 21+35 on Linux 5.15.0-1046-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 21+35-LTS on Linux 5.15.0-1047-azure +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz count a short column: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding2946 2952 10 1.7 589.1 1.0X -UTF-8 is set 4557 4580 32 1.1 911.4 0.6X +No encoding2545 2567 31 2.0 509.0 1.0X +UTF-8 is set 4020 4028 9 1.2 804.1 0.6X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 21+35 on Linux 5.15.0-1046-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 21+35-LTS on Linux 5.15.0-1047-azure +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding6977 7229 433 0.16977.2 1.0X -UTF-8 is set 6373 6394 25 0.26372.9 1.1X +No encoding6786 6939 264 0.16785.7 1.0X +UTF-8 is set 5668 5680 11 0.25668.1 1.2X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 21+35 on Linux 5.15.0-1046-azure
[spark] branch master updated: [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 69cf80d25f0e [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result 69cf80d25f0e is described below commit 69cf80d25f0e4ed46ec38a63e063471988c31732 Author: Daniel Tenedorio AuthorDate: Wed Oct 11 18:52:06 2023 -0700 [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result ### What changes were proposed in this pull request? This PR adds a Python UDTF API for the `eval` and `terminate` methods to consume the previous `analyze` result. This also works for subclasses of the `AnalyzeResult` class, allowing the UDTF to return custom state from `analyze` to be consumed later. For example, we can now define a UDTF that perform complex initialization in the `analyze` method and then returns the result of that in the `terminate` method: ``` def MyUDTF(self): dataclass class AnalyzeResultWithBuffer(AnalyzeResult): buffer: str udtf class TestUDTF: def __init__(self, analyze_result): self._total = 0 self._buffer = do_complex_initialization(analyze_result.buffer) staticmethod def analyze(argument, _): return AnalyzeResultWithBuffer( schema=StructType() .add("total", IntegerType()) .add("buffer", StringType()), with_single_partition=True, buffer=argument.value, ) def eval(self, argument, row: Row): self._total += 1 def terminate(self): yield self._total, self._buffer self.spark.udtf.register("my_ddtf", MyUDTF) ``` Then the results might look like: ``` sql( """ WITH t AS ( SELECT id FROM range(1, 21) ) SELECT total, buffer FROM test_udtf("abc", TABLE(t)) """ ).collect() > 20, "complex_initialization_result" ``` ### Why are the changes needed? In this way, the UDTF can perform potentially expensive initialization logic in the `analyze` method just once and result the result of such initialization rather than repeating the initialization in `eval`. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds new unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43204 from dtenedor/prepare-string. Authored-by: Daniel Tenedorio Signed-off-by: Takuya UESHIN --- python/docs/source/user_guide/sql/python_udtf.rst | 124 - python/pyspark/sql/tests/test_udtf.py | 53 + python/pyspark/sql/udtf.py | 5 +- python/pyspark/sql/worker/analyze_udtf.py | 2 + python/pyspark/worker.py | 34 +- .../spark/sql/catalyst/analysis/Analyzer.scala | 5 +- .../spark/sql/catalyst/expressions/PythonUDF.scala | 20 +++- .../execution/python/BatchEvalPythonUDTFExec.scala | 8 ++ .../python/UserDefinedPythonFunction.scala | 7 +- .../sql-tests/analyzer-results/udtf/udtf.sql.out | 26 +++-- .../test/resources/sql-tests/inputs/udtf/udtf.sql | 9 +- .../resources/sql-tests/results/udtf/udtf.sql.out | 28 +++-- .../apache/spark/sql/IntegratedUDFTestUtils.scala | 64 ++- .../sql/execution/python/PythonUDTFSuite.scala | 42 +-- 14 files changed, 374 insertions(+), 53 deletions(-) diff --git a/python/docs/source/user_guide/sql/python_udtf.rst b/python/docs/source/user_guide/sql/python_udtf.rst index 74d8eb889861..fb42644dc702 100644 --- a/python/docs/source/user_guide/sql/python_udtf.rst +++ b/python/docs/source/user_guide/sql/python_udtf.rst @@ -50,10 +50,108 @@ To implement a Python UDTF, you first need to define a class implementing the me Notes - -- This method does not accept any extra arguments. Only the default - constructor is supported. - You cannot create or reference the Spark session within the UDTF. Any attempt to do so will result in a serialization error. +- If the below `analyze` method is implemented, it is also possible to define this + method as: `__init__(self, analyze_result: AnalyzeResult)`. In this case, the result + of the `analyze` method is passed into all future instantiations of this UDTF class. +
[spark] branch master updated: [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 045eb2d6dade [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic 045eb2d6dade is described below commit 045eb2d6dadec905f5c8f249fe19be6001107668 Author: Ruifeng Zheng AuthorDate: Thu Oct 12 09:20:15 2023 +0800 [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Make doctests deterministic ### What changes were proposed in this pull request? sort before show ### Why are the changes needed? the orders of rows are non-deterministic after groupby the tests fail in some env ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #43331 from zhengruifeng/py_collect_groupby. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/functions.py | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 04968440e394..25958bdf15da 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -3751,7 +3751,8 @@ def collect_list(col: "ColumnOrName") -> Column: >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name")) ->>> df.groupBy("name").agg(sf.sort_array(sf.collect_list('id')).alias('sorted_list')).show() +>>> df = df.groupBy("name").agg(sf.sort_array(sf.collect_list('id')).alias('sorted_list')) +>>> df.orderBy(sf.desc("name")).show() ++---+ |name|sorted_list| ++---+ @@ -3842,7 +3843,8 @@ def collect_set(col: "ColumnOrName") -> Column: >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name")) ->>> df.groupBy("name").agg(sf.sort_array(sf.collect_set('id')).alias('sorted_set')).show() +>>> df = df.groupBy("name").agg(sf.sort_array(sf.collect_set('id')).alias('sorted_set')) +>>> df.orderBy(sf.desc("name")).show() ++--+ |name|sorted_set| ++--+ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4027474cc744 [SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet 4027474cc744 is described below commit 4027474cc74438b29b0eae38f07ab03aeab99f5a Author: Hyukjin Kwon AuthorDate: Thu Oct 12 09:24:27 2023 +0900 [SPARK-45221][PYTHON][DOCS] Refine docstring of DataFrameReader.parquet ### What changes were proposed in this pull request? This PR refines the docstring of DataFrameReader.parquet by adding more examples. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No Closes #43301 from allisonwang-db/spark-45221-refine-parquet. Lead-authored-by: Hyukjin Kwon Co-authored-by: allisonwang-db Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/readwriter.py | 68 ++-- 1 file changed, 58 insertions(+), 10 deletions(-) diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index cfac8fdbc68b..ea429a75e157 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -495,6 +495,7 @@ class DataFrameReader(OptionUtils): Parameters -- paths : str +One or more file paths to read the Parquet files from. Other Parameters @@ -505,24 +506,71 @@ class DataFrameReader(OptionUtils): .. # noqa +Returns +--- +:class:`DataFrame` +A DataFrame containing the data from the Parquet files. + Examples +Create sample dataframes. + +>>> df = spark.createDataFrame( +... [(10, "Alice"), (15, "Bob"), (20, "Tom")], schema=["age", "name"]) +>>> df2 = spark.createDataFrame([(70, "Alice"), (80, "Bob")], schema=["height", "name"]) + Write a DataFrame into a Parquet file and read it back. >>> import tempfile >>> with tempfile.TemporaryDirectory() as d: -... # Write a DataFrame into a Parquet file -... spark.createDataFrame( -... [{"age": 100, "name": "Hyukjin Kwon"}] -... ).write.mode("overwrite").format("parquet").save(d) +... # Write a DataFrame into a Parquet file. +... df.write.mode("overwrite").format("parquet").save(d) ... ... # Read the Parquet file as a DataFrame. -... spark.read.parquet(d).show() -+---++ -|age|name| -+---++ -|100|Hyukjin Kwon| -+---++ +... spark.read.parquet(d).orderBy("name").show() ++---+-+ +|age| name| ++---+-+ +| 10|Alice| +| 15| Bob| +| 20| Tom| ++---+-+ + +Read a Parquet file with a specific column. + +>>> with tempfile.TemporaryDirectory() as d: +... df.write.mode("overwrite").format("parquet").save(d) +... +... # Read the Parquet file with only the 'name' column. +... spark.read.schema("name string").parquet(d).orderBy("name").show() ++-+ +| name| ++-+ +|Alice| +| Bob| +| Tom| ++-+ + +Read multiple Parquet files and merge schema. + +>>> with tempfile.TemporaryDirectory() as d1, tempfile.TemporaryDirectory() as d2: +... df.write.mode("overwrite").format("parquet").save(d1) +... df2.write.mode("overwrite").format("parquet").save(d2) +... +... spark.read.option( +... "mergeSchema", "true" +... ).parquet(d1, d2).select( +... "name", "age", "height" +... ).orderBy("name", "age").show() ++-++--+ +| name| age|height| ++-++--+ +|Alice|NULL|70| +|Alice| 10| NULL| +| Bob|NULL|80| +| Bob| 15| NULL| +| Tom| 20| NULL| ++-++--+ """ mergeSchema = options.get("mergeSchema", None) pathGlobFilter = options.get("pathGlobFilter", None) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch dependabot/maven/connector/kafka-0-10-sql/org.apache.zookeeper-zookeeper-3.7.2 deleted (was b4b69957bd66)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/connector/kafka-0-10-sql/org.apache.zookeeper-zookeeper-3.7.2 in repository https://gitbox.apache.org/repos/asf/spark.git was b4b69957bd66 Bump org.apache.zookeeper:zookeeper in /connector/kafka-0-10-sql The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch dependabot/maven/connector/kafka-0-10/org.apache.zookeeper-zookeeper-3.7.2 deleted (was 18ddbd8d5893)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/connector/kafka-0-10/org.apache.zookeeper-zookeeper-3.7.2 in repository https://gitbox.apache.org/repos/asf/spark.git was 18ddbd8d5893 Bump org.apache.zookeeper:zookeeper in /connector/kafka-0-10 The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch dependabot/maven/org.apache.zookeeper-zookeeper-3.7.2 deleted (was f083a40d537e)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/org.apache.zookeeper-zookeeper-3.7.2 in repository https://gitbox.apache.org/repos/asf/spark.git was f083a40d537e Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-connect-go] branch dependabot/go_modules/golang.org/x/net-0.17.0 deleted (was d3523ef)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/go_modules/golang.org/x/net-0.17.0 in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git was d3523ef Bump golang.org/x/net from 0.10.0 to 0.17.0 The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-connect-go] branch master updated: Bump golang.org/x/net from 0.10.0 to 0.17.0
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git The following commit(s) were added to refs/heads/master by this push: new 3007733 Bump golang.org/x/net from 0.10.0 to 0.17.0 3007733 is described below commit 3007733a4f6e18a6e67c6302bee5941d7f888414 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> AuthorDate: Thu Oct 12 09:12:52 2023 +0900 Bump golang.org/x/net from 0.10.0 to 0.17.0 Bumps [golang.org/x/net](https://github.com/golang/net) from 0.10.0 to 0.17.0. Commits https://github.com/golang/net/commit/b225e7ca6dde1ef5a5ae5ce922861bda011cfabd;>b225e7c http2: limit maximum handler goroutines to MaxConcurrentStreams https://github.com/golang/net/commit/88194ad8ab44a02ea952c169883c3f57db6cf9f4;>88194ad go.mod: update golang.org/x dependencies https://github.com/golang/net/commit/2b60a61f1e4cf3a5ecded0bd7e77ea168289e6de;>2b60a61 quic: fix several bugs in flow control accounting https://github.com/golang/net/commit/73d82efb96cacc0c378bc150b56675fc191894b9;>73d82ef quic: handle DATA_BLOCKED frames https://github.com/golang/net/commit/5d5a036a503f8accd748f7453c0162115187be13;>5d5a036 quic: handle streams moving from the data queue to the meta queue https://github.com/golang/net/commit/350aad2603e57013fafb1a9e2089a382fe67dc80;>350aad2 quic: correctly extend peer's flow control window after MAX_DATA https://github.com/golang/net/commit/21814e71db756f39b69fb1a3e06350fa555a79b1;>21814e7 quic: validate connection id transport parameters https://github.com/golang/net/commit/a600b3518eed7a9a4e24380b4b249cb986d9b64d;>a600b35 quic: avoid redundant MAX_DATA updates https://github.com/golang/net/commit/ea633599b58dc6a50d33c7f5438edfaa8bc313df;>ea63359 http2: check stream body is present on read timeout https://github.com/golang/net/commit/ddd8598e5694aa5e966e44573a53e895f6fa5eb2;>ddd8598 quic: version negotiation Additional commits viewable in https://github.com/golang/net/compare/v0.10.0...v0.17.0;>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=golang.org/x/net=go_modules=0.10.0=0.17.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `dependabot rebase` will rebase this PR - `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `dependabot merge` will merge this PR after your CI passes on it - `dependabot squash and merge` will squash and merge this PR after your CI passes on it - `dependabot cancel merge` will cancel a previously requested merge and block automerging - `dependabot reopen` will reopen this PR if it is closed - `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/apache/spark-connect-go/network/alerts). Closes #16 from dependabot[bot]/dependabot/go_modules/golang.org/x/net-0.17.0. Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- go.mod | 6 +++--- go.sum | 12 ++-- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/go.mod b/go.mod index 6b4c0a0..d518fc5 100644 --- a/go.mod +++ b/go.mod @@ -49,10 +49,10 @@ require ( github.com/pmezard/go-difflib v1.0.0 // indirect github.com/zeebo/xxh3 v1.0.2 // indirect golang.org/x/mod v0.10.0 // indirect - golang.org/x/net v0.10.0 // indirect + golang.org/x/net v0.17.0 // indirect
[spark] branch master updated: [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a5f019554991 [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore a5f019554991 is described below commit a5f01955499141c53c619ddf81d6846a72ad789a Author: Scott Schenkein AuthorDate: Thu Oct 12 08:44:13 2023 +0900 [SPARK-45415] Allow selective disabling of "fallocate" in RocksDB statestore ### What changes were proposed in this pull request? Our spark environment features a number of parallel structured streaming jobs, many of which have use state store. Most use state store for dropDuplicates and work with a tiny amount of information, but a few have a substantially large state store requiring use of RocksDB. In such a configuration, spark allocates a minimum of `spark.sql.shuffle.partitions * queryCount` partitions, each of which pre-allocate about 74mb (observed on EMR/Hadoop) disk storage for RocksDB. This allocati [...] This PR provides users with the option to simply disable fallocate so RocksDB uses far less space for the smaller state stores, reducing complexity and disk storage at the expense of performance. ### Why are the changes needed? As previously mentioned, these changes allow a spark context to support many parallel structured streaming jobs when using RocksDB state stores without the need to allocate a glut of excess storage. ### Does this PR introduce _any_ user-facing change? Users disable the fallocate rocksdb performance optimization by configuring `spark.sql.streaming.stateStore.rocksdb.allowFAllocate=false` ### How was this patch tested? 1) A few test cases were added 2) The state store size was validated by running this script with and without fallocate disabled ``` from pyspark.sql.types import StructType, StructField, StringType, TimestampType import datetime if disable_fallocate: spark.conf.set("spark.sql.streaming.stateStore.rocksdb.allowFAllocate", "false") spark.conf.set( "spark.sql.streaming.stateStore.providerClass", "org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider", ) schema = StructType( [ StructField("one", TimestampType(), False), StructField("two", StringType(), True), ] ) now = datetime.datetime.now() data = [(now, y) for y in range(300)] init_df = spark.createDataFrame(data, schema) path = "/tmp/stream_try/test" init_df.write.format("parquet").mode("append").save(path) stream_df = spark.readStream.schema(schema).format("parquet").load(path) stream_df = stream_df.dropDuplicates(["one"]) def foreach_batch_function(batch_df, epoch_id): batch_df.write.format("parquet").mode("append").option("path", path + "_out").save() stream_df.writeStream.foreachBatch(foreach_batch_function).option( "checkpointLocation", path + "_checkpoint" ).start() ``` With these results (local run, docker container with small FS) ``` allowFAllocate=True (current default) - root0ef384f699e0:/tmp# du -sh spark-d43a2964-c92a-4d94-9fdd-f3557a651fd9 808Mspark-d43a2964-c92a-4d94-9fdd-f3557a651fd9 | |-->4.1M StateStoreId(opId=0,partId=0,name=default)-d59b907c-8004-47f9-a8a1-dec131f73505 |--> |-->4.1M StateStoreId(opId=0,partId=199,name=default)-b49a93fe-1007-4e92-8f8f-5767aef41e5c allowFAllocate=False (new feature) -- root0ef384f699e0:/tmp# du -sh spark-00cb768d-2659-453c-8670-4aaf70148041 7.9Mspark-00cb768d-2659-453c-8670-4aaf70148041 | |-->40K StateStoreId(opId=0,partId=0,name=default)-45b38d9c-737b-49b1-bb82- |--> |-->40K StateStoreId(opId=0,partId=199,name=default)-28a6cc02-2693-4360-b47a-1f1ab0d54a61 ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #43202 from schenksj/feature/rocksdb_allow_fallocate. Authored-by: Scott Schenkein Signed-off-by: Jungtaek Lim --- docs/structured-streaming-programming-guide.md| 5 + .../spark/sql/execution/streaming/state/RocksDB.scala | 15 +-- .../streaming/state/RocksDBStateStoreSuite.scala | 2 ++ .../sql/execution/streaming/state/RocksDBSuite.scala | 6 ++ 4 files changed, 26 insertions(+), 2 deletions(-) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 774422a9cd9d..9fb823abaa3a 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -2385,6 +2385,11
[spark-connect-go] branch dependabot/go_modules/golang.org/x/net-0.17.0 created (now d3523ef)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/go_modules/golang.org/x/net-0.17.0 in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git at d3523ef Bump golang.org/x/net from 0.10.0 to 0.17.0 No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch dependabot/maven/org.apache.zookeeper-zookeeper-3.7.2 created (now f083a40d537e)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/org.apache.zookeeper-zookeeper-3.7.2 in repository https://gitbox.apache.org/repos/asf/spark.git at f083a40d537e Bump org.apache.zookeeper:zookeeper from 3.5.7 to 3.7.2 No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch dependabot/maven/connector/kafka-0-10/org.apache.zookeeper-zookeeper-3.7.2 created (now 18ddbd8d5893)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/connector/kafka-0-10/org.apache.zookeeper-zookeeper-3.7.2 in repository https://gitbox.apache.org/repos/asf/spark.git at 18ddbd8d5893 Bump org.apache.zookeeper:zookeeper in /connector/kafka-0-10 No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch dependabot/maven/connector/kafka-0-10-sql/org.apache.zookeeper-zookeeper-3.7.2 created (now b4b69957bd66)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/connector/kafka-0-10-sql/org.apache.zookeeper-zookeeper-3.7.2 in repository https://gitbox.apache.org/repos/asf/spark.git at b4b69957bd66 Bump org.apache.zookeeper:zookeeper in /connector/kafka-0-10-sql No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44855][CONNECT] Small tweaks to attaching ExecuteGrpcResponseSender to ExecuteResponseObserver
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 292a1131b542 [SPARK-44855][CONNECT] Small tweaks to attaching ExecuteGrpcResponseSender to ExecuteResponseObserver 292a1131b542 is described below commit 292a1131b542ddc7b227a7e51e4f4233f3d2f9d8 Author: Juliusz Sompolski AuthorDate: Wed Oct 11 15:01:20 2023 -0400 [SPARK-44855][CONNECT] Small tweaks to attaching ExecuteGrpcResponseSender to ExecuteResponseObserver ### What changes were proposed in this pull request? Small improvements can be made to the way new ExecuteGrpcResponseSender is attached to observer. * Since now we have addGrpcResponseSender in ExecuteHolder, it should be ExecuteHolder responsibility to interrupt the old sender and that there is only one at a time, and to ExecuteResponseObserver's responsibility * executeObserver is used as a lock for synchronization. An explicit lock object could be better. Fix a small bug, when ExecuteGrpcResponseSender will not be waken up by interrupt if it was sleeping on the grpcCallObserverReadySignal. This would result in the sender potentially sleeping until the deadline (2 minutes) and only then removed, which would potentially delay timing the execution out by these 2 minutes. It should **not** cause any hang or wait on the client side, because if ExecuteGrpcResponseSender is interrupted, it means that the client has already came back with a ne [...] ### Why are the changes needed? Minor cleanup of previous work. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests in ReattachableExecuteSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43181 from juliuszsompolski/SPARK-44855. Authored-by: Juliusz Sompolski Signed-off-by: Herman van Hovell --- .../execution/ExecuteGrpcResponseSender.scala | 26 - .../execution/ExecuteResponseObserver.scala| 44 ++ .../spark/sql/connect/service/ExecuteHolder.scala | 4 ++ 3 files changed, 40 insertions(+), 34 deletions(-) diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala index 08496c36b28a..ba5ecc7a045a 100644 --- a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala +++ b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala @@ -63,15 +63,15 @@ private[connect] class ExecuteGrpcResponseSender[T <: Message]( /** * Interrupt this sender and make it exit. */ - def interrupt(): Unit = executionObserver.synchronized { + def interrupt(): Unit = { interrupted = true -executionObserver.notifyAll() +wakeUp() } // For testing - private[connect] def setDeadline(deadlineMs: Long) = executionObserver.synchronized { + private[connect] def setDeadline(deadlineMs: Long) = { deadlineTimeMillis = deadlineMs -executionObserver.notifyAll() +wakeUp() } def run(lastConsumedStreamIndex: Long): Unit = { @@ -152,9 +152,6 @@ private[connect] class ExecuteGrpcResponseSender[T <: Message]( s"lastConsumedStreamIndex=$lastConsumedStreamIndex") val startTime = System.nanoTime() -// register to be notified about available responses. -executionObserver.attachConsumer(this) - var nextIndex = lastConsumedStreamIndex + 1 var finished = false @@ -191,7 +188,7 @@ private[connect] class ExecuteGrpcResponseSender[T <: Message]( sentResponsesSize > maximumResponseSize || deadlineTimeMillis < System.currentTimeMillis() logTrace(s"Trying to get next response with index=$nextIndex.") - executionObserver.synchronized { + executionObserver.responseLock.synchronized { logTrace(s"Acquired executionObserver lock.") val sleepStart = System.nanoTime() var sleepEnd = 0L @@ -208,7 +205,7 @@ private[connect] class ExecuteGrpcResponseSender[T <: Message]( if (response.isEmpty) { val timeout = Math.max(1, deadlineTimeMillis - System.currentTimeMillis()) logTrace(s"Wait for response to become available with timeout=$timeout ms.") -executionObserver.wait(timeout) +executionObserver.responseLock.wait(timeout) logTrace(s"Reacquired executionObserver lock after waiting.") sleepEnd = System.nanoTime() } @@ -339,4 +336,15 @@ private[connect] class ExecuteGrpcResponseSender[T <: Message](
[spark] branch master updated (eae5c0e1efc -> 5ad57a70e51)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from eae5c0e1efc [SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat add 5ad57a70e51 [SPARK-45204][CONNECT] Add optional ExecuteHolder to SparkConnectPlanner No new revisions were added by this update. Summary of changes: .../connect/execution/ExecuteThreadRunner.scala| 7 +- .../execution/SparkConnectPlanExecution.scala | 2 +- .../sql/connect/planner/SparkConnectPlanner.scala | 84 ++ .../connect/planner/SparkConnectPlannerSuite.scala | 11 +-- .../plugin/SparkConnectPluginRegistrySuite.scala | 4 +- 5 files changed, 50 insertions(+), 58 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 7e3ddc1e582 [SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat 7e3ddc1e582 is described below commit 7e3ddc1e582a6e4fa96bab608c4c2bbc2c93b449 Author: Jia Fan AuthorDate: Wed Oct 11 19:33:23 2023 +0300 [SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat ### What changes were proposed in this pull request? This PR fix CSV/JSON schema inference when timestamps do not match specified timestampFormat will report error. ```scala //eg val csv = spark.read.option("timestampFormat", "-MM-dd'T'HH:mm:ss") .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS()) csv.show() //error Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 ``` This bug only happend when partition had one row. The data type should be `StringType` not `TimestampType` because the value not match `timestampFormat`. Use csv as eg, in `CSVInferSchema::tryParseTimestampNTZ`, first, use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to inferring return `TimestampType`, if same partition had another row, it will use `tryParseTimestamp` to parse row with user defined `timestampFormat`, then found it can't be convert to timestamp with `timestampFormat`. Finally return `StringType`. But when only one row, we use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to parse normally timestamp not r [...] ### Why are the changes needed? Fix schema inference bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43243 from Hisoka-X/SPARK-45433-inference-mismatch-timestamp-one-row. Authored-by: Jia Fan Signed-off-by: Max Gekk (cherry picked from commit eae5c0e1efce83c2bb08754784db070be285285a) Signed-off-by: Max Gekk --- .../org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala | 9 ++--- .../org/apache/spark/sql/catalyst/json/JsonInferSchema.scala | 8 +--- .../apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala| 10 ++ .../apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala | 8 4 files changed, 29 insertions(+), 6 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala index 51586a0065e..ec01b56f9eb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala @@ -27,7 +27,7 @@ import org.apache.spark.sql.catalyst.expressions.ExprUtils import org.apache.spark.sql.catalyst.util.{DateFormatter, TimestampFormatter} import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT import org.apache.spark.sql.errors.QueryExecutionErrors -import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf} import org.apache.spark.sql.types._ class CSVInferSchema(val options: CSVOptions) extends Serializable { @@ -202,8 +202,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { // We can only parse the value as TimestampNTZType if it does not have zone-offset or // time-zone component and can be parsed with the timestamp formatter. // Otherwise, it is likely to be a timestamp with timezone. -if (timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) { - SQLConf.get.timestampType +val timestampType = SQLConf.get.timestampType +if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY || +timestampType == TimestampNTZType) && +timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) { + timestampType } else { tryParseTimestamp(field) } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala index 5385afe8c93..4123c5290b6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala @@ -32,7 +32,7 @@ import org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil import org.apache.spark.sql.catalyst.util._ import
[spark] branch master updated: [SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new eae5c0e1efc [SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat eae5c0e1efc is described below commit eae5c0e1efce83c2bb08754784db070be285285a Author: Jia Fan AuthorDate: Wed Oct 11 19:33:23 2023 +0300 [SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat ### What changes were proposed in this pull request? This PR fix CSV/JSON schema inference when timestamps do not match specified timestampFormat will report error. ```scala //eg val csv = spark.read.option("timestampFormat", "-MM-dd'T'HH:mm:ss") .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS()) csv.show() //error Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 ``` This bug only happend when partition had one row. The data type should be `StringType` not `TimestampType` because the value not match `timestampFormat`. Use csv as eg, in `CSVInferSchema::tryParseTimestampNTZ`, first, use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to inferring return `TimestampType`, if same partition had another row, it will use `tryParseTimestamp` to parse row with user defined `timestampFormat`, then found it can't be convert to timestamp with `timestampFormat`. Finally return `StringType`. But when only one row, we use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to parse normally timestamp not r [...] ### Why are the changes needed? Fix schema inference bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43243 from Hisoka-X/SPARK-45433-inference-mismatch-timestamp-one-row. Authored-by: Jia Fan Signed-off-by: Max Gekk --- .../org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala | 9 ++--- .../org/apache/spark/sql/catalyst/json/JsonInferSchema.scala | 8 +--- .../apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala| 10 ++ .../apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala | 8 4 files changed, 29 insertions(+), 6 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala index 51586a0065e..ec01b56f9eb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala @@ -27,7 +27,7 @@ import org.apache.spark.sql.catalyst.expressions.ExprUtils import org.apache.spark.sql.catalyst.util.{DateFormatter, TimestampFormatter} import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT import org.apache.spark.sql.errors.QueryExecutionErrors -import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf} import org.apache.spark.sql.types._ class CSVInferSchema(val options: CSVOptions) extends Serializable { @@ -202,8 +202,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { // We can only parse the value as TimestampNTZType if it does not have zone-offset or // time-zone component and can be parsed with the timestamp formatter. // Otherwise, it is likely to be a timestamp with timezone. -if (timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) { - SQLConf.get.timestampType +val timestampType = SQLConf.get.timestampType +if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY || +timestampType == TimestampNTZType) && +timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) { + timestampType } else { tryParseTimestamp(field) } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala index 5385afe8c93..4123c5290b6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala @@ -32,7 +32,7 @@ import org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil import org.apache.spark.sql.catalyst.util._ import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT import
[spark] branch master updated: [SPARK-45483][CONNECT] Correct the function groups in connect.functions
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8e70c39650f [SPARK-45483][CONNECT] Correct the function groups in connect.functions 8e70c39650f is described below commit 8e70c39650f9abd329062a3651306da2d335aeb9 Author: Ruifeng Zheng AuthorDate: Wed Oct 11 08:37:48 2023 -0700 [SPARK-45483][CONNECT] Correct the function groups in connect.functions ### What changes were proposed in this pull request? Correct the function groups in connect.functions ### Why are the changes needed? to be consistent with https://github.com/apache/spark/commit/17da43803fd4c405fda00ffc2c7f4ff835ab24aa ### Does this PR introduce _any_ user-facing change? yes, will changes the scaladoc (when it is available) ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #43309 from zhengruifeng/connect_function_scaladoc. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/sql/functions.scala | 318 +++-- 1 file changed, 166 insertions(+), 152 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala index 2adba11923d..9c5adca7e28 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala @@ -55,16 +55,28 @@ import org.apache.spark.util.SparkClassUtils * only `Column` but also other types such as a native string. The other variants currently exist * for historical reasons. * - * @groupname udf_funcs UDF functions + * @groupname udf_funcs UDF, UDAF and UDT * @groupname agg_funcs Aggregate functions - * @groupname datetime_funcs Date time functions - * @groupname sort_funcs Sorting functions - * @groupname normal_funcs Non-aggregate functions - * @groupname math_funcs Math functions + * @groupname datetime_funcs Date and Timestamp functions + * @groupname sort_funcs Sort functions + * @groupname normal_funcs Normal functions + * @groupname math_funcs Mathematical functions + * @groupname bitwise_funcs Bitwise functions + * @groupname predicate_funcs Predicate functions + * @groupname conditional_funcs Conditional functions + * @groupname hash_funcs Hash functions * @groupname misc_funcs Misc functions * @groupname window_funcs Window functions + * @groupname generator_funcs Generator functions * @groupname string_funcs String functions * @groupname collection_funcs Collection functions + * @groupname array_funcs Array functions + * @groupname map_funcs Map functions + * @groupname struct_funcs Struct functions + * @groupname csv_funcs CSV functions + * @groupname json_funcs JSON functions + * @groupname xml_funcs XML functions + * @groupname url_funcs URL functions * @groupname partition_transforms Partition transform functions * @groupname Ungrouped Support functions for DataFrames * @@ -101,6 +113,7 @@ object functions { * Scala Symbol, it is converted into a [[Column]] also. Otherwise, a new [[Column]] is created * to represent the literal value. * + * @group normal_funcs * @since 3.4.0 */ def lit(literal: Any): Column = { @@ -145,7 +158,7 @@ object functions { /** * Creates a struct with the given field names and values. * - * @group normal_funcs + * @group struct_funcs * @since 3.5.0 */ def named_struct(cols: Column*): Column = Column.fn("named_struct", cols: _*) @@ -1610,7 +1623,7 @@ object functions { /** * Creates a new array column. The input columns must all have the same data type. * - * @group normal_funcs + * @group array_funcs * @since 3.4.0 */ @scala.annotation.varargs @@ -1619,7 +1632,7 @@ object functions { /** * Creates a new array column. The input columns must all have the same data type. * - * @group normal_funcs + * @group array_funcs * @since 3.4.0 */ @scala.annotation.varargs @@ -1632,7 +1645,7 @@ object functions { * value1, key2, value2, ...). The key columns must all have the same data type, and can't be * null. The value columns must all have the same data type. * - * @group normal_funcs + * @group map_funcs * @since 3.4.0 */ @scala.annotation.varargs @@ -1642,7 +1655,7 @@ object functions { * Creates a new map column. The array in the first column is used for keys. The array in the * second column is used for values. All elements in the array for key should not be null. * - * @group normal_funcs + * @group map_funcs * @since 3.4.0 */ def map_from_arrays(keys:
[spark] branch master updated: [SPARK-45499][CORE][TESTS] Replace `Reference#isEnqueued` with `Reference#refersTo(null)`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d1aff011fe1 [SPARK-45499][CORE][TESTS] Replace `Reference#isEnqueued` with `Reference#refersTo(null)` d1aff011fe1 is described below commit d1aff011fe1e78788fb5cb00d41a28e5925e4572 Author: yangjie01 AuthorDate: Wed Oct 11 08:11:52 2023 -0700 [SPARK-45499][CORE][TESTS] Replace `Reference#isEnqueued` with `Reference#refersTo(null)` ### What changes were proposed in this pull request? This pr just replace `Reference#isEnqueued` with `Reference#refersTo` in `CompletionIteratorSuite`, the solution refer to https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/lang/ref/Reference.java#L436-L454 ``` * deprecated * This method was originally specified to test if a reference object has * been cleared and enqueued but was never implemented to do this test. * This method could be misused due to the inherent race condition * or without an associated {code ReferenceQueue}. * An application relying on this method to release critical resources * could cause serious performance issue. * An application should use {link ReferenceQueue} to reliably determine * what reference objects that have been enqueued or * {link #refersTo(Object) refersTo(null)} to determine if this reference * object has been cleared. * * return {code true} if and only if this reference object is * in its associated queue (if any). */ Deprecated(since="16") public boolean isEnqueued() { return (this.queue == ReferenceQueue.ENQUEUED); } ``` ### Why are the changes needed? Clean up deprecated api usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43325 from LuciferYang/SPARK-45499. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../test/scala/org/apache/spark/util/CompletionIteratorSuite.scala | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala b/core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala index 29421f7aa9e..297e4fd53ab 100644 --- a/core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala +++ b/core/src/test/scala/org/apache/spark/util/CompletionIteratorSuite.scala @@ -57,13 +57,13 @@ class CompletionIteratorSuite extends SparkFunSuite { sub = null iter.toArray -for (_ <- 1 to 100 if !ref.isEnqueued) { +for (_ <- 1 to 100 if !ref.refersTo(null)) { System.gc() - if (!ref.isEnqueued) { + if (!ref.refersTo(null)) { Thread.sleep(10) } } -assert(ref.isEnqueued) +assert(ref.refersTo(null)) assert(refQueue.poll() === ref) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-42881][SQL] Codegen Support for get_json_object
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c2525308330 [SPARK-42881][SQL] Codegen Support for get_json_object c2525308330 is described below commit c252530833097759b1f943ff89b05f22025f0dd0 Author: panbingkun AuthorDate: Wed Oct 11 17:42:48 2023 +0300 [SPARK-42881][SQL] Codegen Support for get_json_object ### What changes were proposed in this pull request? The PR adds Codegen Support for get_json_object. ### Why are the changes needed? Improve codegen coverage and performance. Github benchmark data(https://github.com/panbingkun/spark/actions/runs/4497396473/jobs/7912952710): https://user-images.githubusercontent.com/15246973/227117793-bab38c42-dcc1-46de-a689-25a87b8f3561.png;> Local benchmark data: https://user-images.githubusercontent.com/15246973/227098745-9b360e60-fe84-4419-8b7d-073a0530816a.png;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new UT. Pass GA. Closes #40506 from panbingkun/json_code_gen. Authored-by: panbingkun Signed-off-by: Max Gekk --- .../sql/catalyst/expressions/jsonExpressions.scala | 121 +--- sql/core/benchmarks/JsonBenchmark-results.txt | 127 +++-- .../org/apache/spark/sql/JsonFunctionsSuite.scala | 28 + .../execution/datasources/json/JsonBenchmark.scala | 15 ++- 4 files changed, 208 insertions(+), 83 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index e7df542ddab..04bc457b66a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -28,7 +28,8 @@ import com.fasterxml.jackson.core.json.JsonReadFeature import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.TypeCheckResult import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.DataTypeMismatch -import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, CodeGenerator, CodegenFallback, ExprCode} +import org.apache.spark.sql.catalyst.expressions.codegen.Block.BlockHelper import org.apache.spark.sql.catalyst.json._ import org.apache.spark.sql.catalyst.trees.TreePattern.{JSON_TO_STRUCT, TreePattern} import org.apache.spark.sql.catalyst.util._ @@ -125,13 +126,7 @@ private[this] object SharedFactory { group = "json_funcs", since = "1.5.0") case class GetJsonObject(json: Expression, path: Expression) - extends BinaryExpression with ExpectsInputTypes with CodegenFallback { - - import com.fasterxml.jackson.core.JsonToken._ - - import PathInstruction._ - import SharedFactory._ - import WriteStyle._ + extends BinaryExpression with ExpectsInputTypes { override def left: Expression = json override def right: Expression = path @@ -140,18 +135,114 @@ case class GetJsonObject(json: Expression, path: Expression) override def nullable: Boolean = true override def prettyName: String = "get_json_object" - @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) + @transient + private lazy val evaluator = if (path.foldable) { +new GetJsonObjectEvaluator(path.eval().asInstanceOf[UTF8String]) + } else { +new GetJsonObjectEvaluator() + } override def eval(input: InternalRow): Any = { -val jsonStr = json.eval(input).asInstanceOf[UTF8String] +evaluator.setJson(json.eval(input).asInstanceOf[UTF8String]) +if (!path.foldable) { + evaluator.setPath(path.eval(input).asInstanceOf[UTF8String]) +} +evaluator.evaluate() + } + + protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val evaluatorClass = classOf[GetJsonObjectEvaluator].getName +val initEvaluator = path.foldable match { + case true if path.eval() != null => +val cachedPath = path.eval().asInstanceOf[UTF8String] +val refCachedPath = ctx.addReferenceObj("cachedPath", cachedPath) +s"new $evaluatorClass($refCachedPath)" + case _ => s"new $evaluatorClass()" +} +val evaluator = ctx.addMutableState(evaluatorClass, "evaluator", + v => s"""$v = $initEvaluator;""", forceInline = true) + +val jsonEval = json.genCode(ctx) +val pathEval = path.genCode(ctx) + +val setJson = + s""" + |if (${jsonEval.isNull}) { + | $evaluator.setJson(null); + |} else { + | $evaluator.setJson(${jsonEval.value}); + |} +
[spark] branch master updated: [SPARK-45467][CORE] Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass`
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new acd5dc499d1 [SPARK-45467][CORE] Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass` acd5dc499d1 is described below commit acd5dc499d139ce8b2571a69beab0f971947adb4 Author: YangJie AuthorDate: Wed Oct 11 08:49:09 2023 -0500 [SPARK-45467][CORE] Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass` ### What changes were proposed in this pull request? This pr replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass` to clean up deprecated api usage ref to https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/lang/reflect/Proxy.java#L376-L391 ``` * deprecated Proxy classes generated in a named module are encapsulated * and not accessible to code outside its module. * {link Constructor#newInstance(Object...) Constructor.newInstance} * will throw {code IllegalAccessException} when it is called on * an inaccessible proxy class. * Use {link #newProxyInstance(ClassLoader, Class[], InvocationHandler)} * to create a proxy instance instead. * * see Package and Module Membership of Proxy Class * revised 9 */ Deprecated CallerSensitive public static Class getProxyClass(ClassLoader loader, Class... interfaces) throws IllegalArgumentException ``` For the `InvocationHandler`, since the `invoke` method doesn't need to be actually called in the current scenario, but the `InvocationHandler` can't be null, a new `DummyInvocationHandler` has been added as follows: ``` private[spark] object DummyInvocationHandler extends InvocationHandler { override def invoke(proxy: Any, method: Method, args: Array[AnyRef]): AnyRef = { throw new UnsupportedOperationException("Not implemented") } } ``` ### Why are the changes needed? Clean up deprecated API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43291 from LuciferYang/SPARK-45467. Lead-authored-by: YangJie Co-authored-by: yangjie01 Signed-off-by: Sean Owen --- .../main/scala/org/apache/spark/serializer/JavaSerializer.scala | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala b/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala index 95d2bdc39e1..856e639fcd9 100644 --- a/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala +++ b/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala @@ -18,6 +18,7 @@ package org.apache.spark.serializer import java.io._ +import java.lang.reflect.{InvocationHandler, Method, Proxy} import java.nio.ByteBuffer import scala.reflect.ClassTag @@ -79,7 +80,7 @@ private[spark] class JavaDeserializationStream(in: InputStream, loader: ClassLoa // scalastyle:off classforname val resolved = ifaces.map(iface => Class.forName(iface, false, loader)) // scalastyle:on classforname - java.lang.reflect.Proxy.getProxyClass(loader, resolved: _*) + Proxy.newProxyInstance(loader, resolved, DummyInvocationHandler).getClass } } @@ -88,6 +89,12 @@ private[spark] class JavaDeserializationStream(in: InputStream, loader: ClassLoa def close(): Unit = { objIn.close() } } +private[spark] object DummyInvocationHandler extends InvocationHandler { + override def invoke(proxy: Any, method: Method, args: Array[AnyRef]): AnyRef = { +throw new UnsupportedOperationException("Not implemented") + } +} + private object JavaDeserializationStream { val primitiveMappings = Map[String, Class[_]]( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (11af786b35c -> 97218051308)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 11af786b35c [SPARK-45451][SQL] Make the default storage level of dataset cache configurable add 97218051308 [SPARK-45496][CORE][DSTREAM] Fix the compilation warning related to `other-pure-statement` No new revisions were added by this update. Summary of changes: .../org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala | 2 +- pom.xml | 3 --- project/SparkBuild.scala | 4 .../org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala | 2 +- 4 files changed, 2 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45451][SQL] Make the default storage level of dataset cache configurable
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 11af786b35c [SPARK-45451][SQL] Make the default storage level of dataset cache configurable 11af786b35c is described below commit 11af786b35cabe6d139dd9763ccf1af9ceb7eb9f Author: ulysses-you AuthorDate: Wed Oct 11 20:51:22 2023 +0800 [SPARK-45451][SQL] Make the default storage level of dataset cache configurable ### What changes were proposed in this pull request? This pr adds a new config `spark.sql.defaultCacheStorageLevel` , so that people can use `set spark.sql.defaultCacheStorageLevel=xxx` to change the default storage level of `dataset.cache`. ### Why are the changes needed? Most people use the default storage level, so this pr makes it easy to change the storage level without touching code. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #43259 from ulysses-you/cache. Authored-by: ulysses-you Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/internal/SQLConf.scala| 13 .../main/scala/org/apache/spark/sql/Dataset.scala | 5 +-- .../execution/datasources/v2/CacheTableExec.scala | 22 + .../apache/spark/sql/internal/CatalogImpl.scala| 4 +-- .../org/apache/spark/sql/CachedTableSuite.scala| 37 ++ 5 files changed, 61 insertions(+), 20 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 65d2e6136e9..12ec9e911d3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -44,6 +44,7 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.connector.catalog.CatalogManager.SESSION_CATALOG_NAME import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryExecutionErrors} import org.apache.spark.sql.types.{AtomicType, TimestampNTZType, TimestampType} +import org.apache.spark.storage.{StorageLevel, StorageLevelMapper} import org.apache.spark.unsafe.array.ByteArrayMethods import org.apache.spark.util.Utils @@ -1563,6 +1564,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DEFAULT_CACHE_STORAGE_LEVEL = buildConf("spark.sql.defaultCacheStorageLevel") +.doc("The default storage level of `dataset.cache()`, `catalog.cacheTable()` and " + + "sql query `CACHE TABLE t`.") +.version("4.0.0") +.stringConf +.transform(_.toUpperCase(Locale.ROOT)) +.checkValues(StorageLevelMapper.values.map(_.name()).toSet) +.createWithDefault(StorageLevelMapper.MEMORY_AND_DISK.name()) + val CROSS_JOINS_ENABLED = buildConf("spark.sql.crossJoin.enabled") .internal() .doc("When false, we will throw an error if a query contains a cartesian product without " + @@ -5027,6 +5037,9 @@ class SQLConf extends Serializable with Logging with SqlApiConf { def groupByAliases: Boolean = getConf(GROUP_BY_ALIASES) + def defaultCacheStorageLevel: StorageLevel = +StorageLevel.fromString(getConf(DEFAULT_CACHE_STORAGE_LEVEL)) + def crossJoinEnabled: Boolean = getConf(SQLConf.CROSS_JOINS_ENABLED) override def sessionLocalTimeZone: String = getConf(SQLConf.SESSION_LOCAL_TIMEZONE) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 0cc037b157e..5079cfcca9d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -3798,10 +3798,7 @@ class Dataset[T] private[sql]( * @group basic * @since 1.6.0 */ - def persist(): this.type = { -sparkSession.sharedState.cacheManager.cacheQuery(this) -this - } + def persist(): this.type = persist(sparkSession.sessionState.conf.defaultCacheStorageLevel) /** * Persist this Dataset with the default storage level (`MEMORY_AND_DISK`). diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala index 8c14b5e3707..1744df83033 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala @@ -38,25 +38,19 @@ trait BaseCacheTableExec extends LeafV2CommandExec { override def run(): Seq[InternalRow] = { val storageLevelKey =
[spark] branch master updated (e1a7b84f47b -> ae112e4279f)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from e1a7b84f47b [SPARK-45397][ML][CONNECT] Add array assembler feature transformer add ae112e4279f [SPARK-45116][SQL] Add some comment for param of JdbcDialect `createTable` No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala| 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8394ebb52b9 -> e1a7b84f47b)
This is an automated email from the ASF dual-hosted git repository. weichenxu123 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8394ebb52b9 [SPARK-45469][CORE][SQL][CONNECT][PYTHON] Replace `toIterator` with `iterator` for `IterableOnce` add e1a7b84f47b [SPARK-45397][ML][CONNECT] Add array assembler feature transformer No new revisions were added by this update. Summary of changes: .../docs/source/reference/pyspark.ml.connect.rst | 1 + python/pyspark/ml/connect/base.py | 6 +- python/pyspark/ml/connect/feature.py | 156 - python/pyspark/ml/connect/util.py | 6 +- python/pyspark/ml/param/_shared_params_code_gen.py | 7 + python/pyspark/ml/param/shared.py | 22 +++ .../ml/tests/connect/test_legacy_mode_feature.py | 48 +++ 7 files changed, 238 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0c1e9a5b19c -> 8394ebb52b9)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 0c1e9a5b19c [SPARK-45500][CORE][WEBUI] Show the number of abnormally completed drivers in MasterPage add 8394ebb52b9 [SPARK-45469][CORE][SQL][CONNECT][PYTHON] Replace `toIterator` with `iterator` for `IterableOnce` No new revisions were added by this update. Summary of changes: .../jvm/src/main/scala/org/apache/spark/sql/functions.scala | 2 +- .../src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala | 2 +- core/src/main/scala/org/apache/spark/util/collection/Utils.scala | 2 +- .../apache/spark/sql/catalyst/expressions/objects/objects.scala | 4 ++-- .../main/scala/org/apache/spark/sql/execution/GenerateExec.scala | 8 .../org/apache/spark/sql/execution/arrow/ArrowConverters.scala| 2 +- .../spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala | 2 +- .../spark/sql/execution/python/BatchEvalPythonUDTFExec.scala | 2 +- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 2 +- 9 files changed, 13 insertions(+), 13 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45500][CORE][WEBUI] Show the number of abnormally completed drivers in MasterPage
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0c1e9a5b19c [SPARK-45500][CORE][WEBUI] Show the number of abnormally completed drivers in MasterPage 0c1e9a5b19c is described below commit 0c1e9a5b19c4cfb77c86f3871998cf2673260b06 Author: Dongjoon Hyun AuthorDate: Wed Oct 11 01:46:29 2023 -0700 [SPARK-45500][CORE][WEBUI] Show the number of abnormally completed drivers in MasterPage ### What changes were proposed in this pull request? This PR aims to show the number of abnormaly completed drivers in MasterPage. ### Why are the changes needed? In the `Completed Drivers` table, there are various exit states. https://github.com/apache/spark/assets/9700541/ff0b33f5-c546-42e7-870c-8323e2eefded;> We had better show the abnormally completed drivers in the top of the page. **BEFORE** ``` Drivers: 0 Running (0 Waiting), 7 Completed ``` **AFTER** ``` Drivers: 0 Running (0 Waiting), 7 Completed (1 Killed, 4 Failed, 0 Error) ``` https://github.com/apache/spark/assets/9700541/94deab1f-b9f7-4e5b-8284-aaac4f7520df;> ### Does this PR introduce _any_ user-facing change? Yes, this is a new UI field. However, since this is UI, there will be no technical issues. ### How was this patch tested? Manual build Spark and check UI. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43328 from dongjoon-hyun/SPARK-45500. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index cc4370ad02e..5c1887be5b8 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -153,7 +153,11 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { Drivers: {state.activeDrivers.length} Running ({state.activeDrivers.count(_.state == DriverState.SUBMITTED)} Waiting), -{state.completedDrivers.length} Completed +{state.completedDrivers.length} Completed +({state.completedDrivers.count(_.state == DriverState.KILLED)} Killed, +{state.completedDrivers.count(_.state == DriverState.FAILED)} Failed, +{state.completedDrivers.count(_.state == DriverState.ERROR)} Error) + Status: {state.status} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45480][SQL][UI] Selectable Spark Plan Node on UI
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 305db078ff2 [SPARK-45480][SQL][UI] Selectable Spark Plan Node on UI 305db078ff2 is described below commit 305db078ff25c3c29b222f197b3293dd84db3045 Author: Kent Yao AuthorDate: Wed Oct 11 15:49:14 2023 +0800 [SPARK-45480][SQL][UI] Selectable Spark Plan Node on UI ### What changes were proposed in this pull request? This PR introduces selectable animation for Spark SQL Plan Node On UI, which lights up the selected node and its linked nodes and edges. ### Why are the changes needed? Better UX for SQL plan visualization and debugging. Especially for large queries, users can now concentrate on the current node and its nearest neighbors to get a better understanding of node lineage. ### Does this PR introduce _any_ user-facing change? Yes, let's see the video. ### How was this patch tested? https://github.com/apache/spark/assets/8326978/f5ba884c-acce-46b8-8568-3ead55c91d4f ### Was this patch authored or co-authored using generative AI tooling? no Closes #43307 from yaooqinn/SPARK-45480. Authored-by: Kent Yao Signed-off-by: Kent Yao --- .../sql/execution/ui/static/spark-sql-viz.css | 18 + .../spark/sql/execution/ui/static/spark-sql-viz.js | 43 ++ 2 files changed, 61 insertions(+) diff --git a/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.css b/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.css index dbdbf9fbf57..d6a498e9387 100644 --- a/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.css +++ b/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.css @@ -57,3 +57,21 @@ .job-url { word-wrap: break-word; } + +#plan-viz-graph svg g.node rect.selected { + fill: #E25A1CFF; + stroke: #317EACFF; + stroke-width: 2px; +} + +#plan-viz-graph svg g.node rect.linked { + fill: #FFC106FF; + stroke: #317EACFF; + stroke-width: 2px; +} + +#plan-viz-graph svg path.linked { + fill: #317EACFF; + stroke: #317EACFF; + stroke-width: 2px; +} diff --git a/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.js b/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.js index 96a7a7a3cc0..d4cc45a1639 100644 --- a/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.js +++ b/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.js @@ -47,6 +47,7 @@ function renderPlanViz() { .attr("ry", "5"); setupLayoutForSparkPlanCluster(g, svg); + setupSelectionForSparkPlanNode(g); setupTooltipForSparkPlanNode(g); resizeSvg(svg); postprocessForAdditionalMetrics(); @@ -269,3 +270,45 @@ function togglePlanViz() { // eslint-disable-line no-unused-vars planVizContainer().style("display", "none"); } } + +/* + * Light up the selected node and its linked nodes and edges. + */ +function setupSelectionForSparkPlanNode(g) { + const linkedNodes = new Map(); + const linkedEdges = new Map(); + + g.edges().forEach(function (e) { +const edge = g.edge(e); +const from = g.node(e.v); +const to = g.node(e.w); +collectLinks(linkedNodes, from.id, to.id); +collectLinks(linkedNodes, to.id, from.id); +collectLinks(linkedEdges, from.id, edge.arrowheadId); +collectLinks(linkedEdges, to.id, edge.arrowheadId); + }); + + linkedNodes.forEach((linkedNodes, selectNode) => { +d3.select("#" + selectNode).on("click", () => { + planVizContainer().selectAll(".selected").classed("selected", false); + planVizContainer().selectAll(".linked").classed("linked", false); + d3.select("#" + selectNode + " rect").classed("selected", true); + linkedNodes.forEach((linkedNode) => { +d3.select("#" + linkedNode + " rect").classed("linked", true); + }); + linkedEdges.get(selectNode).forEach((linkedEdge) => { +const arrowHead = d3.select("#" + linkedEdge + " path"); +arrowHead.classed("linked", true); +const arrowShaft = $(arrowHead.node()).parents("g.edgePath").children("path"); +arrowShaft.addClass("linked"); + }); +}); + }); +} + +function collectLinks(map, key, value) { + if (!map.has(key)) { +map.set(key, new Set()); + } + map.get(key).add(value); +} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45497][K8S] Add a symbolic link file `spark-examples.jar` in K8s Docker images
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 64ac59d7129 [SPARK-45497][K8S] Add a symbolic link file `spark-examples.jar` in K8s Docker images 64ac59d7129 is described below commit 64ac59d71296e631ded97b332660d3db5675623a Author: Dongjoon Hyun AuthorDate: Wed Oct 11 00:35:38 2023 -0700 [SPARK-45497][K8S] Add a symbolic link file `spark-examples.jar` in K8s Docker images ### What changes were proposed in this pull request? This PR aims to add a symbolic link file, `spark-examples.jar`, in the example jar directory. ``` $ docker run -it --rm spark:latest ls -al /opt/spark/examples/jars | tail -n6 total 1620 drwxr-xr-x 1 root root4096 Oct 11 04:37 . drwxr-xr-x 1 root root4096 Sep 9 02:08 .. -rw-r--r-- 1 root root 78803 Sep 9 02:08 scopt_2.12-3.7.1.jar -rw-r--r-- 1 root root 1564255 Sep 9 02:08 spark-examples_2.12-3.5.0.jar lrwxrwxrwx 1 root root 29 Oct 11 04:37 spark-examples.jar -> spark-examples_2.12-3.5.0.jar ``` ### Why are the changes needed? Like PySpark example (`pi.py`), we can submit the examples without considering the version numbers which was painful before. ``` bin/spark-submit \ --master k8s://$K8S_MASTER \ --deploy-mode cluster \ ... --class org.apache.spark.examples.SparkPi \ local:///opt/spark/examples/jars/spark-examples.jar 1 ``` The following is the driver pod log. ``` + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit ... --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi local:///opt/spark/examples/jars/spark-examples.jar 1 Files local:///opt/spark/examples/jars/spark-examples.jar from /opt/spark/examples/jars/spark-examples.jar to /opt/spark/work-dir/./spark-examples.jar ``` ### Does this PR introduce _any_ user-facing change? No, this is an additional file. ### How was this patch tested? Manually build the docker image and do `ls`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43324 from dongjoon-hyun/SPARK-45497. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 1 + 1 file changed, 1 insertion(+) diff --git a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile index 02559b8de2d..b80e72c768c 100644 --- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile +++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile @@ -49,6 +49,7 @@ COPY sbin /opt/spark/sbin COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ COPY kubernetes/dockerfiles/spark/decom.sh /opt/ COPY examples /opt/spark/examples +RUN ln -s $(basename $(ls /opt/spark/examples/jars/spark-examples_*.jar)) /opt/spark/examples/jars/spark-examples.jar COPY kubernetes/tests /opt/spark/tests COPY data /opt/spark/data - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org