[GitHub] [spark-website] LuciferYang commented on pull request #462: Add Jie Yang to committers
LuciferYang commented on PR #462: URL: https://github.com/apache/spark-website/pull/462#issuecomment-1548985598 Thanks all ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-site updated: Add Jie Yang to committers
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new aa3fc86513 Add Jie Yang to committers aa3fc86513 is described below commit aa3fc86513f430f5aa9630ccf056a222c68886f1 Author: yangjie01 AuthorDate: Tue May 16 12:46:59 2023 +0800 Add Jie Yang to committers Author: yangjie01 Closes #462 from LuciferYang/add-yangjie. --- committers.md| 1 + site/committers.html | 4 2 files changed, 5 insertions(+) diff --git a/committers.md b/committers.md index 827073a0d6..ce4ebefbdf 100644 --- a/committers.md +++ b/committers.md @@ -92,6 +92,7 @@ navigation: |Reynold Xin|Databricks| |Weichen Xu|Databricks| |Takeshi Yamamuro|NTT| +|Jie Yang|Baidu| |Kent Yao|NetEase| |Burak Yavuz|Databricks| |Matei Zaharia|Databricks, Stanford| diff --git a/site/committers.html b/site/committers.html index 47a29b7ee2..c06c053ed1 100644 --- a/site/committers.html +++ b/site/committers.html @@ -460,6 +460,10 @@ Takeshi Yamamuro NTT + + Jie Yang + Baidu + Kent Yao NetEase - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] LuciferYang closed pull request #462: Add Jie Yang to committers
LuciferYang closed pull request #462: Add Jie Yang to committers URL: https://github.com/apache/spark-website/pull/462 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43502][PYTHON][CONNECT] DataFrame.drop` should accept empty column
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0df4c01b7c4 [SPARK-43502][PYTHON][CONNECT] DataFrame.drop` should accept empty column 0df4c01b7c4 is described below commit 0df4c01b7c4d4476fe0de9dccb3425cc1295fc85 Author: Ruifeng Zheng AuthorDate: Tue May 16 12:38:08 2023 +0800 [SPARK-43502][PYTHON][CONNECT] DataFrame.drop` should accept empty column ### What changes were proposed in this pull request? Make `DataFrame.drop` accept empty column ### Why are the changes needed? to be consistent with vanilla PySpark ### Does this PR introduce _any_ user-facing change? yes ``` In [1]: df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age")) In [2]: df.drop() ``` before: ``` In [2]: df.drop() --- PySparkValueError Traceback (most recent call last) Cell In[2], line 1 > 1 df.drop() File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:449, in DataFrame.drop(self, *cols) 444 raise PySparkTypeError( 445 error_class="NOT_COLUMN_OR_STR", 446 message_parameters={"arg_name": "cols", "arg_type": type(cols).__name__}, 447 ) 448 if len(_cols) == 0: --> 449 raise PySparkValueError( 450 error_class="CANNOT_BE_EMPTY", 451 message_parameters={"item": "cols"}, 452 ) 454 return DataFrame.withPlan( 455 plan.Drop( 456 child=self._plan, (...) 459 session=self._session, 460 ) PySparkValueError: [CANNOT_BE_EMPTY] At least one cols must be specified. ``` after ``` In [2]: df.drop() Out[2]: DataFrame[id: bigint, age: bigint] ``` ### How was this patch tested? enabled UT Closes #41180 from zhengruifeng/connect_drop_empty_col. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/connect/dataframe.py | 5 - python/pyspark/sql/connect/plan.py| 3 ++- python/pyspark/sql/tests/connect/test_parity_dataframe.py | 5 - 3 files changed, 2 insertions(+), 11 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index ffd52cf0cec..7a5ba50b3c6 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -445,11 +445,6 @@ class DataFrame: error_class="NOT_COLUMN_OR_STR", message_parameters={"arg_name": "cols", "arg_type": type(cols).__name__}, ) -if len(_cols) == 0: -raise PySparkValueError( -error_class="CANNOT_BE_EMPTY", -message_parameters={"item": "cols"}, -) return DataFrame.withPlan( plan.Drop( diff --git a/python/pyspark/sql/connect/plan.py b/python/pyspark/sql/connect/plan.py index 03aca4896be..eb4765cbd4b 100644 --- a/python/pyspark/sql/connect/plan.py +++ b/python/pyspark/sql/connect/plan.py @@ -664,7 +664,8 @@ class Drop(LogicalPlan): columns: List[Union[Column, str]], ) -> None: super().__init__(child) -assert len(columns) > 0 and all(isinstance(c, (Column, str)) for c in columns) +if len(columns) > 0: +assert all(isinstance(c, (Column, str)) for c in columns) self._columns = columns def plan(self, session: "SparkConnectClient") -> proto.Relation: diff --git a/python/pyspark/sql/tests/connect/test_parity_dataframe.py b/python/pyspark/sql/tests/connect/test_parity_dataframe.py index 34f63c1410e..a74afc4d504 100644 --- a/python/pyspark/sql/tests/connect/test_parity_dataframe.py +++ b/python/pyspark/sql/tests/connect/test_parity_dataframe.py @@ -84,11 +84,6 @@ class DataFrameParityTests(DataFrameTestsMixin, ReusedConnectTestCase): def test_to_pandas_from_mixed_dataframe(self): self.check_to_pandas_from_mixed_dataframe() -# TODO(SPARK-43502): DataFrame.drop should support empty column -@unittest.skip("Fails in Spark Connect, should enable.") -def test_drop_empty_column(self): -super().test_drop_empty_column() - if __name__ == "__main__": import unittest - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] Hisoka-X commented on pull request #462: Add Jie Yang to committers
Hisoka-X commented on PR #462: URL: https://github.com/apache/spark-website/pull/462#issuecomment-1548923570 Congrats! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (23c072d2a0e -> 4bf979c969e)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 23c072d2a0e [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch add 4bf979c969e [SPARK-43482][SS] Expand QueryTerminatedEvent to contain error class if it exists in exception No new revisions were added by this update. Summary of changes: python/pyspark/sql/streaming/listener.py| 17 + .../sql/tests/streaming/test_streaming_listener.py | 3 ++- .../spark/sql/execution/streaming/StreamExecution.scala | 11 +-- .../spark/sql/streaming/StreamingQueryListener.scala| 13 - .../sql/streaming/StreamingQueryListenerSuite.scala | 10 -- .../ui/StreamingQueryStatusListenerSuite.scala | 10 +- 6 files changed, 53 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 23c072d2a0e [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch 23c072d2a0e is described below commit 23c072d2a0ef046f45893d9a13f5788e6ec09ea5 Author: Hyukjin Kwon AuthorDate: Tue May 16 11:16:27 2023 +0900 [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch ### What changes were proposed in this pull request? This PR proposes to add a migration guide for https://github.com/apache/spark/pull/38700. ### Why are the changes needed? To guide users about the workaround of bringing the namedtuple patch back. ### Does this PR introduce _any_ user-facing change? Yes, it adds the migration guides for end-users. ### How was this patch tested? CI in this PR will test it out. Closes #41177 from HyukjinKwon/update-migration-namedtuple. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/docs/source/migration_guide/pyspark_upgrade.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst b/python/docs/source/migration_guide/pyspark_upgrade.rst index d06475f9b36..7513d64ef6c 100644 --- a/python/docs/source/migration_guide/pyspark_upgrade.rst +++ b/python/docs/source/migration_guide/pyspark_upgrade.rst @@ -34,6 +34,7 @@ Upgrading from PySpark 3.3 to 3.4 * In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace pre-existing arrays, which will NOT be over-written to follow pandas 1.4 behaviors. * In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` have got new parameter ``args`` which provides binding of named parameters to their SQL literals. * In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs were deprecated or removed in Spark 3.4 according to the changes made in pandas 2.0. Please refer to the [release notes of pandas](https://pandas.pydata.org/docs/dev/whatsnew/) for more details. +* In Spark 3.4, the custom monkey-patch of ``collections.namedtuple`` was removed, and ``cloudpickle`` was used by default. To restore the previous behavior for any relevant pickling issue of ``collections.namedtuple``, set ``PYSPARK_ENABLE_NAMEDTUPLE_PATCH`` environment variable to ``1``. Upgrading from PySpark 3.2 to 3.3 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated: [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 792680ea73a [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch 792680ea73a is described below commit 792680ea73a19e91d2a15d672bd27bc252bc1c91 Author: Hyukjin Kwon AuthorDate: Tue May 16 11:16:27 2023 +0900 [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch ### What changes were proposed in this pull request? This PR proposes to add a migration guide for https://github.com/apache/spark/pull/38700. ### Why are the changes needed? To guide users about the workaround of bringing the namedtuple patch back. ### Does this PR introduce _any_ user-facing change? Yes, it adds the migration guides for end-users. ### How was this patch tested? CI in this PR will test it out. Closes #41177 from HyukjinKwon/update-migration-namedtuple. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 23c072d2a0ef046f45893d9a13f5788e6ec09ea5) Signed-off-by: Hyukjin Kwon --- python/docs/source/migration_guide/pyspark_upgrade.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst b/python/docs/source/migration_guide/pyspark_upgrade.rst index d06475f9b36..7513d64ef6c 100644 --- a/python/docs/source/migration_guide/pyspark_upgrade.rst +++ b/python/docs/source/migration_guide/pyspark_upgrade.rst @@ -34,6 +34,7 @@ Upgrading from PySpark 3.3 to 3.4 * In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace pre-existing arrays, which will NOT be over-written to follow pandas 1.4 behaviors. * In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` have got new parameter ``args`` which provides binding of named parameters to their SQL literals. * In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs were deprecated or removed in Spark 3.4 according to the changes made in pandas 2.0. Please refer to the [release notes of pandas](https://pandas.pydata.org/docs/dev/whatsnew/) for more details. +* In Spark 3.4, the custom monkey-patch of ``collections.namedtuple`` was removed, and ``cloudpickle`` was used by default. To restore the previous behavior for any relevant pickling issue of ``collections.namedtuple``, set ``PYSPARK_ENABLE_NAMEDTUPLE_PATCH`` environment variable to ``1``. Upgrading from PySpark 3.2 to 3.3 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d53ddbe00fe -> 6221995f67b)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from d53ddbe00fe [SPARK-43300][CORE] NonFateSharingCache wrapper for Guava Cache add 6221995f67b [SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/session.py| 4 +- python/pyspark/sql/pandas/conversion.py | 3 +- python/pyspark/sql/pandas/serializers.py | 203 +++ python/pyspark/sql/tests/test_arrow.py | 26 4 files changed, 154 insertions(+), 82 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43300][CORE] NonFateSharingCache wrapper for Guava Cache
This is an automated email from the ASF dual-hosted git repository. joshrosen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d53ddbe00fe [SPARK-43300][CORE] NonFateSharingCache wrapper for Guava Cache d53ddbe00fe is described below commit d53ddbe00fe73a703f870b0297278f3870148fc4 Author: Ziqi Liu AuthorDate: Mon May 15 18:47:29 2023 -0700 [SPARK-43300][CORE] NonFateSharingCache wrapper for Guava Cache ### What changes were proposed in this pull request? Create `NonFateSharingCache` to wrap around Guava cache with a KeyLock to synchronize all requests with the same key, so they will run individually and fail as if they come one at a time. Wrap cache in `CodeGenerator` with `NonFateSharingCache` to protect it from unexpected cascade failure due to cancellation from irrelevant queries that loading the same key. Feel free to use this in other places where we used Guava cache and don't want fate-sharing behavior. Also, instead of implementing Guava Cache and LoadingCache interface, I define a subset of it so that we can control at compile time what cache operations are allowed and make sure all cache loading action go through our narrow waist code path with key lock. Feel free to add new APIs when needed. ### Why are the changes needed? Guava cache is widely used in spark, however, it suffers from fate-sharing behavior: If there are multiple requests trying to access the same key in the cache at the same time when the key is not in the cache, Guava cache will block all requests and create the object only once. If the creation fails, all requests will fail immediately without retry. So we might see task failure due to irrelevant failure in other queries due to fate sharing. This fate sharing behavior leads to unexpected results in some situation(for example, in code gen). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40982 from liuzqt/SPARK-43300. Authored-by: Ziqi Liu Signed-off-by: Josh Rosen --- .../apache/spark/util/NonFateSharingCache.scala| 78 .../spark/util/NonFateSharingCacheSuite.scala | 140 + .../expressions/codegen/CodeGenerator.scala| 10 +- 3 files changed, 225 insertions(+), 3 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/util/NonFateSharingCache.scala b/core/src/main/scala/org/apache/spark/util/NonFateSharingCache.scala new file mode 100644 index 000..d9847313304 --- /dev/null +++ b/core/src/main/scala/org/apache/spark/util/NonFateSharingCache.scala @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +import java.util.concurrent.Callable + +import com.google.common.cache.Cache +import com.google.common.cache.LoadingCache + +/** + * SPARK-43300: Guava cache fate-sharing behavior might lead to unexpected cascade failure: + * when multiple threads access the same key in the cache at the same time when the key is not in + * the cache, Guava cache will block all requests and load the data only once. If the loading fails, + * all requests will fail immediately without retry. Therefore individual failure will also fail + * other irrelevant queries who are waiting for the same key. Given that spark can cancel tasks at + * arbitrary times for many different reasons, fate sharing means that a task which gets canceled + * while populating a cache entry can cause spurious failures in tasks from unrelated jobs -- even + * though those tasks would have successfully populated the cache if they had been allowed to try. + * + * This util Cache wrapper with KeyLock to synchronize threads looking for the same key + * so that they should run individually and fail as if they had arrived one at a time. + * + * There are so many ways to add cache entries in Guava Cache, instead of implementing Guava Cache + * and LoadingCache interface, we expose a subset of APIs so that we can control at compile time + * what ca
[spark] branch branch-3.4 updated: [SPARK-43281][SQL] Fix concurrent writer does not update file metrics
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new ee12c81cf33 [SPARK-43281][SQL] Fix concurrent writer does not update file metrics ee12c81cf33 is described below commit ee12c81cf336e5e8408970e8a65b6890dcc0c63e Author: ulysses-you AuthorDate: Tue May 16 09:42:20 2023 +0800 [SPARK-43281][SQL] Fix concurrent writer does not update file metrics ### What changes were proposed in this pull request? `DynamicPartitionDataConcurrentWriter` it uses temp file path to get file status after commit task. However, the temp file has already moved to new path during commit task. This pr calls `closeFile` before commit task. ### Why are the changes needed? fix bug ### Does this PR introduce _any_ user-facing change? yes, after this pr the metrics is correct ### How was this patch tested? add test Closes #40952 from ulysses-you/SPARK-43281. Authored-by: ulysses-you Signed-off-by: Wenchen Fan (cherry picked from commit 592e92262246a6345096655270e2ca114934d0eb) Signed-off-by: Wenchen Fan --- .../datasources/BasicWriteStatsTracker.scala | 3 -- .../datasources/FileFormatDataWriter.scala | 1 + .../BasicWriteTaskStatsTrackerSuite.scala | 16 ++- .../sql/test/DataFrameReaderWriterSuite.scala | 53 ++ 4 files changed, 68 insertions(+), 5 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala index 47685899784..8a9fbd15e2e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala @@ -159,9 +159,6 @@ class BasicWriteTaskStatsTracker( } override def getFinalStats(taskCommitTime: Long): WriteTaskStats = { -submittedFiles.foreach(updateFileStats) -submittedFiles.clear() - // Reports bytesWritten and recordsWritten to the Spark output metrics. Option(TaskContext.get()).map(_.taskMetrics().outputMetrics).foreach { outputMetrics => outputMetrics.setBytesWritten(numBytes) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala index 0b1b616bd83..a3d2d2ef0f4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala @@ -427,6 +427,7 @@ class DynamicPartitionDataConcurrentWriter( if (status.outputWriter != null) { try { status.outputWriter.close() + statsTrackers.foreach(_.closeFile(status.outputWriter.path())) } finally { status.outputWriter = null } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala index 96c36dd3c37..e8b9dcf172a 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala @@ -85,6 +85,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val missing = new Path(tempDirPath, "missing") val tracker = new BasicWriteTaskStatsTracker(conf) tracker.newFile(missing.toString) +tracker.closeFile(missing.toString) assertStats(tracker, 0, 0) } @@ -92,7 +93,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val tracker = new BasicWriteTaskStatsTracker(conf) tracker.newFile("") intercept[IllegalArgumentException] { - finalStatus(tracker) + tracker.closeFile("") } } @@ -100,7 +101,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val tracker = new BasicWriteTaskStatsTracker(conf) tracker.newFile(null) intercept[IllegalArgumentException] { - finalStatus(tracker) + tracker.closeFile(null) } } @@ -109,6 +110,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val tracker = new BasicWriteTaskStatsTracker(conf) tracker.newFile(file.toString) touch(file) +tracker.closeFile(file.toString) assertStats(tracker, 1, 0) } @@ -117,6 +119,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val tracker = new Basi
[spark] branch master updated: [SPARK-43281][SQL] Fix concurrent writer does not update file metrics
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 592e9226224 [SPARK-43281][SQL] Fix concurrent writer does not update file metrics 592e9226224 is described below commit 592e92262246a6345096655270e2ca114934d0eb Author: ulysses-you AuthorDate: Tue May 16 09:42:20 2023 +0800 [SPARK-43281][SQL] Fix concurrent writer does not update file metrics ### What changes were proposed in this pull request? `DynamicPartitionDataConcurrentWriter` it uses temp file path to get file status after commit task. However, the temp file has already moved to new path during commit task. This pr calls `closeFile` before commit task. ### Why are the changes needed? fix bug ### Does this PR introduce _any_ user-facing change? yes, after this pr the metrics is correct ### How was this patch tested? add test Closes #40952 from ulysses-you/SPARK-43281. Authored-by: ulysses-you Signed-off-by: Wenchen Fan --- .../datasources/BasicWriteStatsTracker.scala | 3 -- .../datasources/FileFormatDataWriter.scala | 1 + .../BasicWriteTaskStatsTrackerSuite.scala | 16 ++- .../sql/test/DataFrameReaderWriterSuite.scala | 53 ++ 4 files changed, 68 insertions(+), 5 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala index 47685899784..8a9fbd15e2e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala @@ -159,9 +159,6 @@ class BasicWriteTaskStatsTracker( } override def getFinalStats(taskCommitTime: Long): WriteTaskStats = { -submittedFiles.foreach(updateFileStats) -submittedFiles.clear() - // Reports bytesWritten and recordsWritten to the Spark output metrics. Option(TaskContext.get()).map(_.taskMetrics().outputMetrics).foreach { outputMetrics => outputMetrics.setBytesWritten(numBytes) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala index 0b1b616bd83..a3d2d2ef0f4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala @@ -427,6 +427,7 @@ class DynamicPartitionDataConcurrentWriter( if (status.outputWriter != null) { try { status.outputWriter.close() + statsTrackers.foreach(_.closeFile(status.outputWriter.path())) } finally { status.outputWriter = null } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala index 96c36dd3c37..e8b9dcf172a 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala @@ -85,6 +85,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val missing = new Path(tempDirPath, "missing") val tracker = new BasicWriteTaskStatsTracker(conf) tracker.newFile(missing.toString) +tracker.closeFile(missing.toString) assertStats(tracker, 0, 0) } @@ -92,7 +93,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val tracker = new BasicWriteTaskStatsTracker(conf) tracker.newFile("") intercept[IllegalArgumentException] { - finalStatus(tracker) + tracker.closeFile("") } } @@ -100,7 +101,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val tracker = new BasicWriteTaskStatsTracker(conf) tracker.newFile(null) intercept[IllegalArgumentException] { - finalStatus(tracker) + tracker.closeFile(null) } } @@ -109,6 +110,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val tracker = new BasicWriteTaskStatsTracker(conf) tracker.newFile(file.toString) touch(file) +tracker.closeFile(file.toString) assertStats(tracker, 1, 0) } @@ -117,6 +119,7 @@ class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite { val tracker = new BasicWriteTaskStatsTracker(conf) tracker.newFile(file.toString) write1(file) +tracker.closeFile(file.to
[spark] branch master updated: [SPARK-43413][SQL] Fix IN subquery ListQuery nullability
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2e568218300 [SPARK-43413][SQL] Fix IN subquery ListQuery nullability 2e568218300 is described below commit 2e56821830019765bf8530e0e6a8a5abd6125293 Author: Jack Chen AuthorDate: Tue May 16 09:40:54 2023 +0800 [SPARK-43413][SQL] Fix IN subquery ListQuery nullability ### What changes were proposed in this pull request? Before this PR, IN subquery expressions are incorrectly marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: `(non_nullable_col IN (select nullable_col)) <=> TRUE`. Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to `non_nullable_col IN (select nullable_col)`, which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. Fix this by calculating nullability correctly from the ListQuery child output expressions. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests at least. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. ### Why are the changes needed? Fix correctness bug. ### Does this PR introduce _any_ user-facing change? May change query results to fix correctness bug. ### How was this patch tested? Unit tests Closes #41094 from jchen5/listquery-nullable. Authored-by: Jack Chen Signed-off-by: Wenchen Fan --- .../sql/catalyst/expressions/predicates.scala | 10 +- .../spark/sql/catalyst/expressions/subquery.scala | 10 +- .../org/apache/spark/sql/internal/SQLConf.scala| 10 ++ .../BinaryComparisonSimplificationSuite.scala | 32 + .../subquery/in-subquery/in-nullability.sql.out| 141 + .../inputs/subquery/in-subquery/in-nullability.sql | 14 ++ .../subquery/in-subquery/in-nullability.sql.out| 71 +++ 7 files changed, 286 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala index 38005e78653..ee2ba7c73d1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala @@ -399,7 +399,15 @@ case class InSubquery(values: Seq[Expression], query: ListQuery) } override def children: Seq[Expression] = values :+ query - override def nullable: Boolean = children.exists(_.nullable) + override def nullable: Boolean = { +if (!SQLConf.get.getConf(SQLConf.LEGACY_IN_SUBQUERY_NULLABILITY)) { + values.exists(_.nullable) || query.childOutputs.exists(_.nullable) +} else { + // Legacy (incorrect) behavior checked only the nullability of the left-hand side + // (see SPARK-43413). + values.exists(_.nullable) +} + } override def toString: String = s"$value IN ($query)" override def sql: String = s"(${value.sql} IN (${query.sql}))" diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala index 1e957466308..b0f10895c17 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala @@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.logical.{Filter, HintInfo, LogicalPlan} import org.apache.spark.sql.catalyst.trees.TreePattern._ import org.apache.spark.sql.errors.QueryCompilationErrors +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ import org.apache.spark.util.collection.BitSet @@ -367,7 +368,14 @@ case class ListQuery( plan.output.head.dataType } override lazy val resolved: Boolean = childrenResolved && plan.resolved && numCols !=
[GitHub] [spark-website] HyukjinKwon commented on pull request #462: Add Jie Yang to committers
HyukjinKwon commented on PR #462: URL: https://github.com/apache/spark-website/pull/462#issuecomment-1548802646 @LuciferYang feel free to merge it by yourself! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR] Remove redundant character escape "\\" and add UT
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b8f22f33308 [MINOR] Remove redundant character escape "\\" and add UT b8f22f33308 is described below commit b8f22f33308ab51b93052457dba17b04c2daeb4a Author: panbingkun AuthorDate: Mon May 15 18:04:31 2023 -0500 [MINOR] Remove redundant character escape "\\" and add UT ### What changes were proposed in this pull request? The pr aims to remove redundant character escape "\\" and add UT for SparkHadoopUtil.substituteHadoopVariables. ### Why are the changes needed? Make code clean & remove warning. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA & Add new UT. Closes #41170 from panbingkun/SparkHadoopUtil_fix. Authored-by: panbingkun Signed-off-by: Sean Owen --- .../org/apache/spark/deploy/SparkHadoopUtil.scala | 4 +- .../apache/spark/deploy/SparkHadoopUtilSuite.scala | 52 ++ 2 files changed, 54 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 4908a081367..9ff2621b791 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -174,7 +174,7 @@ private[spark] class SparkHadoopUtil extends Logging { * So we need a map to track the bytes read from the child threads and parent thread, * summing them together to get the bytes read of this task. */ -new Function0[Long] { +new (() => Long) { private val bytesReadMap = new mutable.HashMap[Long, Long]() override def apply(): Long = { @@ -248,7 +248,7 @@ private[spark] class SparkHadoopUtil extends Logging { if (isGlobPath(pattern)) globPath(fs, pattern) else Seq(pattern) } - private val HADOOP_CONF_PATTERN = "(\\$\\{hadoopconf-[^\\}\\$\\s]+\\})".r.unanchored + private val HADOOP_CONF_PATTERN = "(\\$\\{hadoopconf-[^}$\\s]+})".r.unanchored /** * Substitute variables by looking them up in Hadoop configs. Only variables that match the diff --git a/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala b/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala index 17f1476cd8d..6250b7d0ed2 100644 --- a/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala @@ -123,6 +123,58 @@ class SparkHadoopUtilSuite extends SparkFunSuite { assertConfigValue(hadoopConf, "fs.s3a.session.token", null) } + test("substituteHadoopVariables") { +val hadoopConf = new Configuration(false) +hadoopConf.set("xxx", "yyy") + +val text1 = "${hadoopconf-xxx}" +val result1 = new SparkHadoopUtil().substituteHadoopVariables(text1, hadoopConf) +assert(result1 == "yyy") + +val text2 = "${hadoopconf-xxx" +val result2 = new SparkHadoopUtil().substituteHadoopVariables(text2, hadoopConf) +assert(result2 == "${hadoopconf-xxx") + +val text3 = "${hadoopconf-xxx}zzz" +val result3 = new SparkHadoopUtil().substituteHadoopVariables(text3, hadoopConf) +assert(result3 == "yyyzzz") + +val text4 = "www${hadoopconf-xxx}zzz" +val result4 = new SparkHadoopUtil().substituteHadoopVariables(text4, hadoopConf) +assert(result4 == "wwwyyyzzz") + +val text5 = "www${hadoopconf-xxx}" +val result5 = new SparkHadoopUtil().substituteHadoopVariables(text5, hadoopConf) +assert(result5 == "wwwyyy") + +val text6 = "www${hadoopconf-xxx" +val result6 = new SparkHadoopUtil().substituteHadoopVariables(text6, hadoopConf) +assert(result6 == "www${hadoopconf-xxx") + +val text7 = "www$hadoopconf-xxx}" +val result7 = new SparkHadoopUtil().substituteHadoopVariables(text7, hadoopConf) +assert(result7 == "www$hadoopconf-xxx}") + +val text8 = "www{hadoopconf-xxx}" +val result8 = new SparkHadoopUtil().substituteHadoopVariables(text8, hadoopConf) +assert(result8 == "www{hadoopconf-xxx}") + } + + test("Redundant character escape '\\}' in RegExp ") { +val HADOOP_CONF_PATTERN_1 = "(\\$\\{hadoopconf-[^}$\\s]+})".r.unanchored +val HADOOP_CONF_PATTERN_2 = "(\\$\\{hadoopconf-[^}$\\s]+\\})".r.unanchored + +val text = "www${hadoopconf-xxx}zzz" +val target1 = text match { + case HADOOP_CONF_PATTERN_1(matched) => text.replace(matched, "yyy") +} +val target2 = text match { + case HADOOP_CONF_PATTERN_2(matched) => text.replace(matched, "yyy") +} +assert(target1 == "wwwyyyzzz") +assert(target2 == "wwwyyyzzz") + } + /** * Assert that a hadoop configuration option has the expected v
[spark] branch master updated: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c8e85eab3fc [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect` c8e85eab3fc is described below commit c8e85eab3fca0e4e5f4bdf9d1d6d1702ecf3fd07 Author: Ruifeng Zheng AuthorDate: Mon May 15 10:56:27 2023 -0700 [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect` ### What changes were proposed in this pull request? Split test module `pyspark_pandas_connect`. Add a new module `pyspark_pandas_slow_connect` which should keep in line with `pyspark_pandas_slow` ### Why are the changes needed? `pyspark_pandas_connect` may take 3~4 hours ### Does this PR introduce _any_ user-facing change? No, test-only ### How was this patch tested? updated CI Closes #41127 from zhengruifeng/test_split_pandas_connect. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 ++ dev/sparktestsupport/modules.py | 17 + dev/sparktestsupport/utils.py| 14 +++--- 3 files changed, 26 insertions(+), 7 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 4aff1bc9753..580415540a6 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -341,6 +341,8 @@ jobs: pyspark-connect, pyspark-errors - >- pyspark-pandas-connect + - >- +pyspark-pandas-slow-connect env: MODULES_TO_TEST: ${{ matrix.modules }} HADOOP_PROFILE: ${{ inputs.hadoop }} diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index b24cdbddbf6..5e374ebba97 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -852,6 +852,23 @@ pyspark_pandas_connect = Module( "pyspark.pandas.tests.connect.test_parity_utils", "pyspark.pandas.tests.connect.test_parity_window", "pyspark.pandas.tests.connect.indexes.test_parity_base", +], +excluded_python_implementations=[ +"PyPy" # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and +# they aren't available there +], +) + + +# This module should contain the same test list with 'pyspark_pandas_slow' for maintenance. +pyspark_pandas_slow_connect = Module( +name="pyspark-pandas-slow-connect", +dependencies=[pyspark_connect, pyspark_pandas_slow], +source_file_regexes=[ +"python/pyspark/pandas", +], +python_test_goals=[ +# pandas-on-Spark unittests "pyspark.pandas.tests.connect.indexes.test_parity_datetime", "pyspark.pandas.tests.connect.test_parity_dataframe", "pyspark.pandas.tests.connect.test_parity_dataframe_slow", diff --git a/dev/sparktestsupport/utils.py b/dev/sparktestsupport/utils.py index ebb69be2841..d07fc936f8f 100755 --- a/dev/sparktestsupport/utils.py +++ b/dev/sparktestsupport/utils.py @@ -113,24 +113,24 @@ def determine_modules_to_test(changed_modules, deduplicated=True): ... # doctest: +NORMALIZE_WHITESPACE ['avro', 'connect', 'docker-integration-tests', 'examples', 'hive', 'hive-thriftserver', 'mllib', 'protobuf', 'pyspark-connect', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', - 'pyspark-pandas-connect', 'pyspark-pandas-slow', 'pyspark-sql', 'repl', 'sparkr', 'sql', - 'sql-kafka-0-10'] + 'pyspark-pandas-connect', 'pyspark-pandas-slow', 'pyspark-pandas-slow-connect', 'pyspark-sql', + 'repl', 'sparkr', 'sql', 'sql-kafka-0-10'] >>> sorted([x.name for x in determine_modules_to_test( ... [modules.sparkr, modules.sql], deduplicated=False)]) ... # doctest: +NORMALIZE_WHITESPACE ['avro', 'connect', 'docker-integration-tests', 'examples', 'hive', 'hive-thriftserver', 'mllib', 'protobuf', 'pyspark-connect', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', - 'pyspark-pandas-connect', 'pyspark-pandas-slow', 'pyspark-sql', 'repl', 'sparkr', 'sql', - 'sql-kafka-0-10'] + 'pyspark-pandas-connect', 'pyspark-pandas-slow', 'pyspark-pandas-slow-connect', 'pyspark-sql', + 'repl', 'sparkr', 'sql', 'sql-kafka-0-10'] >>> sorted([x.name for x in determine_modules_to_test( ... [modules.sql, modules.core], deduplicated=False)]) ... # doctest: +NORMALIZE_WHITESPACE ['avro', 'catalyst', 'connect', 'core', 'docker-integration-tests', 'examples', 'graphx', 'hive', 'hive-thriftserver', 'mllib', 'mllib-local', 'protobuf', 'pyspark-connect', 'pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-connect', - 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql', 'pyspark-st
[spark] branch master updated: [SPARK-43494][CORE] Directly call `replicate()` for `HdfsDataOutputStreamBuilder` instead of reflection in `SparkHadoopUtil#createFile`
This is an automated email from the ASF dual-hosted git repository. sunchao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 98da24869b6 [SPARK-43494][CORE] Directly call `replicate()` for `HdfsDataOutputStreamBuilder` instead of reflection in `SparkHadoopUtil#createFile` 98da24869b6 is described below commit 98da24869b6510435619d0554da24c31c14da97f Author: yangjie01 AuthorDate: Mon May 15 08:44:39 2023 -0700 [SPARK-43494][CORE] Directly call `replicate()` for `HdfsDataOutputStreamBuilder` instead of reflection in `SparkHadoopUtil#createFile` ### What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/40945#discussion_r1191804004 and https://github.com/apache/spark/pull/40945#discussion_r1191804004, `replicate()` is a very private method of `HdfsDataOutputStreamBuilder`, so this pr uses case match to further remove the use of reflection in `SparkHadoopUtil#createFile`. ### Why are the changes needed? Code simplification: remove reflection calls used for compatibility with Hadoop2. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #41164 from LuciferYang/SPARK-43272-FOLLOWUP. Authored-by: yangjie01 Signed-off-by: Chao Sun --- .../org/apache/spark/deploy/SparkHadoopUtil.scala | 29 -- 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 842d3556112..4908a081367 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -30,6 +30,7 @@ import scala.language.existentials import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ +import org.apache.hadoop.hdfs.DistributedFileSystem.HdfsDataOutputStreamBuilder import org.apache.hadoop.mapred.JobConf import org.apache.hadoop.security.{Credentials, UserGroupInformation} import org.apache.hadoop.security.token.{Token, TokenIdentifier} @@ -566,24 +567,16 @@ private[spark] object SparkHadoopUtil extends Logging { if (allowEC) { fs.create(path) } else { - try { -// the builder api does not resolve relative paths, nor does it create parent dirs, while -// the old api does. -if (!fs.mkdirs(path.getParent())) { - throw new IOException(s"Failed to create parents of $path") -} -val qualifiedPath = fs.makeQualified(path) -val builder = fs.createFile(qualifiedPath) -val builderCls = builder.getClass() -// this may throw a NoSuchMethodException if the path is not on hdfs -val replicateMethod = builderCls.getMethod("replicate") -val buildMethod = builderCls.getMethod("build") -val b2 = replicateMethod.invoke(builder) -buildMethod.invoke(b2).asInstanceOf[FSDataOutputStream] - } catch { -case _: NoSuchMethodException => - // No replicate() method, so just create a file with old apis. - fs.create(path) + // the builder api does not resolve relative paths, nor does it create parent dirs, while + // the old api does. + if (!fs.mkdirs(path.getParent())) { +throw new IOException(s"Failed to create parents of $path") + } + val qualifiedPath = fs.makeQualified(path) + val builder = fs.createFile(qualifiedPath) + builder match { +case hb: HdfsDataOutputStreamBuilder => hb.replicate().build() +case _ => fs.create(path) } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43223][CONNECT] Typed agg, reduce functions, RelationalGroupedDataset#as
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ba8cae2031f [SPARK-43223][CONNECT] Typed agg, reduce functions, RelationalGroupedDataset#as ba8cae2031f is described below commit ba8cae2031f81dc326d386cbe7d19c1f0a8f239e Author: Zhen Li AuthorDate: Mon May 15 11:05:33 2023 -0400 [SPARK-43223][CONNECT] Typed agg, reduce functions, RelationalGroupedDataset#as ### What changes were proposed in this pull request? Added the agg, reduce support in `KeyValueGroupedDataset`. Added `Dataset#reduce` Added `RelationalGroupedDataset#as`. Summary: * `KVGDS#agg`: `KVGDS#agg` and the `RelationalGroupedDS#agg` shares the exact same proto. The only difference is that the KVGDS always passing a UDF as the first grouping expression. That's also how we tell them apart in this PR. * `KVGDS#reduce`: Reduce is a special aggregation. The client uses an UnresolvedFunc "reduce" to mark the agg operator is a `ReduceAggregator` and calls `KVGDS#agg` directly. The server would be able to pick this func up directly and reuse the agg code path by sending in a `ReduceAggregator`. * `Dataset#reduce`: This is free after `KVGDS#reduce`. * `RelationalGroupedDS#as`: The only difference between `KVGDS` created using `ds#groupByKey` and `ds#agg#as` is the grouping expressions. The former requires one grouping func as the grouping expression, the latter uses a dummy func (to pass encoders/types to the server) + grouping expressions. Thus the server can count how many grouping expressions received and decide if the `KVGDS` should be created as `ds#groupByKey` or `ds#agg#as`. Followups: * [SPARK-43415] Support mapValues in the Agg functions. * [SPARK-43416] The tupled ProductEncoder dose not pick up the fields names from the server. ### Why are the changes needed? Missing APIs in Scala Client ### Does this PR introduce _any_ user-facing change? Added `KeyValueGrouppedDataset#agg, reduce`, `Dataset#reduce`, `RelationalGroupedDataset#as` methods for the Scala client. ### How was this patch tested? E2E tests Closes #40796 from zhenlineo/typed-agg. Authored-by: Zhen Li Signed-off-by: Herman van Hovell --- .../main/scala/org/apache/spark/sql/Dataset.scala | 66 +++-- .../apache/spark/sql/KeyValueGroupedDataset.scala | 255 -- .../spark/sql/RelationalGroupedDataset.scala | 14 +- .../sql/KeyValueGroupedDatasetE2ETestSuite.scala | 290 ++--- .../sql/UserDefinedFunctionE2ETestSuite.scala | 18 ++ .../CheckConnectJvmClientCompatibility.scala | 8 - .../spark/sql/connect/client/util/QueryTest.scala | 36 ++- .../apache/spark/sql/connect/common/UdfUtils.scala | 4 + .../sql/connect/planner/SparkConnectPlanner.scala | 209 +++ .../spark/sql/catalyst/plans/logical/object.scala | 16 ++ .../main/scala/org/apache/spark/sql/Column.scala | 13 +- .../apache/spark/sql/KeyValueGroupedDataset.scala | 15 +- .../spark/sql/RelationalGroupedDataset.scala | 53 ++-- .../spark/sql/expressions/ReduceAggregator.scala | 6 + .../apache/spark/sql/internal/TypedAggUtils.scala | 62 + 15 files changed, 883 insertions(+), 182 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala index 555f6c312c5..7a680bde7d3 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1242,10 +1242,7 @@ class Dataset[T] private[sql] ( */ @scala.annotation.varargs def groupBy(cols: Column*): RelationalGroupedDataset = { -new RelationalGroupedDataset( - toDF(), - cols.map(_.expr), - proto.Aggregate.GroupType.GROUP_TYPE_GROUPBY) +new RelationalGroupedDataset(toDF(), cols, proto.Aggregate.GroupType.GROUP_TYPE_GROUPBY) } /** @@ -1273,10 +1270,45 @@ class Dataset[T] private[sql] ( val colNames: Seq[String] = col1 +: cols new RelationalGroupedDataset( toDF(), - colNames.map(colName => Column(colName).expr), + colNames.map(colName => Column(colName)), proto.Aggregate.GroupType.GROUP_TYPE_GROUPBY) } + /** + * (Scala-specific) Reduces the elements of this Dataset using the specified binary function. + * The given `func` must be commutative and associative or the result may be non-deterministic. + * + * @group action + * @since 3.5.0 + */ + def reduce(func: (T, T) => T): T = { +val udf = ScalarUserDefinedFunction( + function = func, + inputEncoders = encoder :: encoder :: Nil, + outputEncoder = e
[GitHub] [spark-website] huaxingao commented on pull request #462: Add Jie Yang to committers
huaxingao commented on PR #462: URL: https://github.com/apache/spark-website/pull/462#issuecomment-1548006578 Congratulations! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop version 3
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cadfef6f807 [SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop version 3 cadfef6f807 is described below commit cadfef6f807a75ff403f6dd9234a3996ec7c691c Author: panbingkun AuthorDate: Mon May 15 09:44:03 2023 -0500 [SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop version 3 ### What changes were proposed in this pull request? The pr aims to replace the link related to hadoop version 2 with hadoop version 3 ### Why are the changes needed? Because [SPARK-40651](https://issues.apache.org/jira/browse/SPARK-40651) Drop Hadoop2 binary distribtuion from release process and [SPARK-42447](https://issues.apache.org/jira/browse/SPARK-42447) Remove Hadoop 2 GitHub Action job. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #41171 from panbingkun/SPARK-43508. Authored-by: panbingkun Signed-off-by: Sean Owen --- docs/streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index 5ed66eab348..f8f98ca5442 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -748,7 +748,7 @@ of the store is consistent with that expected by Spark Streaming. It may be that writing directly into a destination directory is the appropriate strategy for streaming data via the chosen object store. -For more details on this topic, consult the [Hadoop Filesystem Specification](https://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/filesystem/introduction.html). +For more details on this topic, consult the [Hadoop Filesystem Specification](https://hadoop.apache.org/docs/stable3/hadoop-project-dist/hadoop-common/filesystem/introduction.html). Streams based on Custom Receivers {:.no_toc} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43495][BUILD] Upgrade RoaringBitmap to 0.9.44
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bee8187d731 [SPARK-43495][BUILD] Upgrade RoaringBitmap to 0.9.44 bee8187d731 is described below commit bee8187d7319ededf82701b4fd2a2928cd56c7f8 Author: yangjie01 AuthorDate: Mon May 15 08:46:20 2023 -0500 [SPARK-43495][BUILD] Upgrade RoaringBitmap to 0.9.44 ### What changes were proposed in this pull request? This pr aims upgrade `RoaringBitmap` from 0.9.39 to 0.9.44. ### Why are the changes needed? The new version brings 2 bug fix: - https://github.com/RoaringBitmap/RoaringBitmap/issues/619 | https://github.com/RoaringBitmap/RoaringBitmap/pull/620 - https://github.com/RoaringBitmap/RoaringBitmap/issues/623 | https://github.com/RoaringBitmap/RoaringBitmap/pull/624 The full release notes as follows: - https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.40 - https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.41 - https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.44 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #41165 from LuciferYang/SPARK-43495. Authored-by: yangjie01 Signed-off-by: Sean Owen --- core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt | 10 +- core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt | 10 +- core/benchmarks/MapStatusesConvertBenchmark-results.txt | 10 +- dev/deps/spark-deps-hadoop-3-hive-2.3 | 4 ++-- pom.xml | 2 +- 5 files changed, 18 insertions(+), 18 deletions(-) diff --git a/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt b/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt index ef9dd139ff2..f42b95e8d4c 100644 --- a/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt +++ b/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt @@ -2,12 +2,12 @@ MapStatuses Convert Benchmark -OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure -Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz +OpenJDK 64-Bit Server VM 11.0.19+7 on Linux 5.15.0-1037-azure +Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz MapStatuses Convert: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Num Maps: 5 Fetch partitions:500 1288 1317 38 0.0 1288194389.0 1.0X -Num Maps: 5 Fetch partitions:1000 2608 2671 65 0.0 2607771122.0 0.5X -Num Maps: 5 Fetch partitions:1500 3985 4026 64 0.0 3984885770.0 0.3X +Num Maps: 5 Fetch partitions:500 1346 1367 28 0.0 1345826909.0 1.0X +Num Maps: 5 Fetch partitions:1000 2807 2818 11 0.0 2806866333.0 0.5X +Num Maps: 5 Fetch partitions:1500 4287 4308 19 0.0 4286688536.0 0.3X diff --git a/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt b/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt index 12af87d9689..b0b61cc11ef 100644 --- a/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt +++ b/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt @@ -2,12 +2,12 @@ MapStatuses Convert Benchmark -OpenJDK 64-Bit Server VM 17.0.6+10 on Linux 5.15.0-1031-azure -Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz +OpenJDK 64-Bit Server VM 17.0.7+7 on Linux 5.15.0-1037-azure +Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz MapStatuses Convert: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Num Maps: 5 Fetch partitions:500 1052 1061 12 0.0 1051946292.0 1.0X -Num Maps: 5 Fetch partitions:1000 1888 2007 109 0.0 1888235523.0 0.6X -Num Maps: 5 Fetch partitions:1500 3070 3149 81 0.0 3070386448.0 0.3X +Num Maps: 5 Fetch partitions:500 1041 1050 16 0.0 1040911290.
[spark] branch branch-3.4 updated: [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new cacaed8e36f [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause cacaed8e36f is described below commit cacaed8e36f512354dd889c30bb63d911b573342 Author: Jiaan Geng AuthorDate: Mon May 15 20:54:55 2023 +0800 [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause ### What changes were proposed in this pull request? Spark 3.4.0 released the new syntax: `OFFSET clause`. But the SQL reference missing the description for it. ### Why are the changes needed? Adds SQL reference for `OFFSET` clause. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could find out the SQL reference for `OFFSET` clause. ### How was this patch tested? Manual verify. ![image](https://github.com/apache/spark/assets/8486025/55398194-5193-45eb-ac04-10f5f0793f7f) ![image](https://github.com/apache/spark/assets/8486025/fef0abc1-7dfa-44e2-b2e0-a56fa82a0817) ![image](https://github.com/apache/spark/assets/8486025/5ab9dc39-6812-45b4-a758-85668ab040f1) ![image](https://github.com/apache/spark/assets/8486025/b726abd4-daae-4de4-a78e-45120573e699) Closes #41151 from beliefer/SPARK-43483. Authored-by: Jiaan Geng Signed-off-by: Wenchen Fan --- docs/sql-ref-syntax-qry-select-case.md | 1 + docs/sql-ref-syntax-qry-select-clusterby.md| 1 + docs/sql-ref-syntax-qry-select-distribute-by.md| 1 + docs/sql-ref-syntax-qry-select-groupby.md | 1 + docs/sql-ref-syntax-qry-select-having.md | 1 + docs/sql-ref-syntax-qry-select-lateral-view.md | 1 + docs/sql-ref-syntax-qry-select-limit.md| 1 + ...imit.md => sql-ref-syntax-qry-select-offset.md} | 51 +- docs/sql-ref-syntax-qry-select-orderby.md | 1 + docs/sql-ref-syntax-qry-select-pivot.md| 1 + docs/sql-ref-syntax-qry-select-sortby.md | 1 + docs/sql-ref-syntax-qry-select-transform.md| 1 + docs/sql-ref-syntax-qry-select-unpivot.md | 1 + docs/sql-ref-syntax-qry-select-where.md| 1 + docs/sql-ref-syntax-qry-select.md | 1 + docs/sql-ref-syntax.md | 1 + 16 files changed, 36 insertions(+), 30 deletions(-) diff --git a/docs/sql-ref-syntax-qry-select-case.md b/docs/sql-ref-syntax-qry-select-case.md index 5d0d055919e..d9725f001ae 100644 --- a/docs/sql-ref-syntax-qry-select-case.md +++ b/docs/sql-ref-syntax-qry-select-case.md @@ -105,6 +105,7 @@ SELECT * FROM person * [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html) * [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html) * [LIMIT Clause](sql-ref-syntax-qry-select-limit.html) +* [OFFSET Clause](sql-ref-syntax-qry-select-offset.html) * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html) * [UNPIVOT Clause](sql-ref-syntax-qry-select-unpivot.html) * [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html) diff --git a/docs/sql-ref-syntax-qry-select-clusterby.md b/docs/sql-ref-syntax-qry-select-clusterby.md index 8f495143b2d..79d72ca438b 100644 --- a/docs/sql-ref-syntax-qry-select-clusterby.md +++ b/docs/sql-ref-syntax-qry-select-clusterby.md @@ -99,6 +99,7 @@ SELECT age, name FROM person CLUSTER BY age; * [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html) * [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html) * [LIMIT Clause](sql-ref-syntax-qry-select-limit.html) +* [OFFSET Clause](sql-ref-syntax-qry-select-offset.html) * [CASE Clause](sql-ref-syntax-qry-select-case.html) * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html) * [UNPIVOT Clause](sql-ref-syntax-qry-select-unpivot.html) diff --git a/docs/sql-ref-syntax-qry-select-distribute-by.md b/docs/sql-ref-syntax-qry-select-distribute-by.md index f686f032ab3..91c75f61b97 100644 --- a/docs/sql-ref-syntax-qry-select-distribute-by.md +++ b/docs/sql-ref-syntax-qry-select-distribute-by.md @@ -94,6 +94,7 @@ SELECT age, name FROM person DISTRIBUTE BY age; * [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html) * [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html) * [LIMIT Clause](sql-ref-syntax-qry-select-limit.html) +* [OFFSET Clause](sql-ref-syntax-qry-select-offset.html) * [CASE Clause](sql-ref-syntax-qry-select-case.html) * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html) * [UNPIVOT Clause](sql-ref-syntax-qry-select-unpivot.html) diff --git a/docs/sql-ref-syntax-qry-select-groupby.md b/docs/sql-ref-syntax-qry-select-groupby.md index f3157f514b4..72ccfcce099 100644 --- a/docs/sql-ref-syntax-qry-select-groupby.md +++ b/docs/sql-ref-syntax-qry-select-groupby.md @@ -314,6 +314,7 @@ SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM per
[spark] branch master updated (e06275c6f14 -> e1114e86194)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from e06275c6f14 [SPARK-43485][SQL] Fix the error message for the `unit` argument of the datetime add/diff functions add e1114e86194 [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-qry-select-case.md | 1 + docs/sql-ref-syntax-qry-select-clusterby.md| 1 + docs/sql-ref-syntax-qry-select-distribute-by.md| 1 + docs/sql-ref-syntax-qry-select-groupby.md | 1 + docs/sql-ref-syntax-qry-select-having.md | 1 + docs/sql-ref-syntax-qry-select-lateral-view.md | 1 + docs/sql-ref-syntax-qry-select-limit.md| 1 + ...imit.md => sql-ref-syntax-qry-select-offset.md} | 51 +- docs/sql-ref-syntax-qry-select-orderby.md | 1 + docs/sql-ref-syntax-qry-select-pivot.md| 1 + docs/sql-ref-syntax-qry-select-sortby.md | 1 + docs/sql-ref-syntax-qry-select-transform.md| 1 + docs/sql-ref-syntax-qry-select-unpivot.md | 1 + docs/sql-ref-syntax-qry-select-where.md| 1 + docs/sql-ref-syntax-qry-select.md | 1 + docs/sql-ref-syntax.md | 1 + 16 files changed, 36 insertions(+), 30 deletions(-) copy docs/{sql-ref-syntax-qry-select-limit.md => sql-ref-syntax-qry-select-offset.md} (68%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43485][SQL] Fix the error message for the `unit` argument of the datetime add/diff functions
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e06275c6f14 [SPARK-43485][SQL] Fix the error message for the `unit` argument of the datetime add/diff functions e06275c6f14 is described below commit e06275c6f14e88ba583ffb3aac1159718a8cae83 Author: Max Gekk AuthorDate: Mon May 15 13:55:18 2023 +0300 [SPARK-43485][SQL] Fix the error message for the `unit` argument of the datetime add/diff functions ### What changes were proposed in this pull request? In the PR, I propose to extend the grammar rule of the `DATEADD`/`TIMESTAMPADD` and `DATEDIFF`/`TIMESTAMPDIFF`, and catch wrong type of the first argument `unit` when an user pass a string instead of an identifier like `YEAR`, ..., `MICROSECOND`. In that case, Spark raised an error of new error class `INVALID_PARAMETER_VALUE.DATETIME_UNIT`. ### Why are the changes needed? To make the error message clear for the case when a literal string instead of an identifier is passed to the datetime `ADD`/`DIFF` functions: ```sql spark-sql (default)> select dateadd('MONTH', 1, date'2023-05-11'); [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `dateadd` requires 2 parameters but the actual number is 3. Please, refer to 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix.; line 1 pos 7 ``` ### Does this PR introduce _any_ user-facing change? Yes, it changes the error class. After the changes: ```sql spark-sql (default)> select dateadd('MONTH', 1, date'2023-05-11'); [INVALID_PARAMETER_VALUE.DATETIME_UNIT] The value of parameter(s) `unit` in `dateadd` is invalid: expects one of the units without quotes YEAR, QUARTER, MONTH, WEEK, DAY, DAYOFYEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, but got the string literal 'MONTH'.(line 1, pos 7) == SQL == select dateadd('MONTH', 1, date'2023-05-11') ---^^^ ``` ### How was this patch tested? By running the existing test suites: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes #41143 from MaxGekk/dateadd-unit-error. Authored-by: Max Gekk Signed-off-by: Max Gekk --- core/src/main/resources/error/error-classes.json | 5 + .../spark/sql/catalyst/parser/SqlBaseParser.g4 | 4 +- .../spark/sql/catalyst/parser/AstBuilder.scala | 18 +- .../spark/sql/errors/QueryParsingErrors.scala | 14 ++ .../sql-tests/analyzer-results/ansi/date.sql.out | 88 ++ .../analyzer-results/ansi/timestamp.sql.out| 88 ++ .../sql-tests/analyzer-results/date.sql.out| 88 ++ .../analyzer-results/datetime-legacy.sql.out | 176 +++ .../sql-tests/analyzer-results/timestamp.sql.out | 88 ++ .../timestampNTZ/timestamp-ansi.sql.out| 88 ++ .../timestampNTZ/timestamp.sql.out | 88 ++ .../src/test/resources/sql-tests/inputs/date.sql | 6 + .../test/resources/sql-tests/inputs/timestamp.sql | 6 + .../resources/sql-tests/results/ansi/date.sql.out | 96 +++ .../sql-tests/results/ansi/timestamp.sql.out | 96 +++ .../test/resources/sql-tests/results/date.sql.out | 96 +++ .../sql-tests/results/datetime-legacy.sql.out | 192 + .../resources/sql-tests/results/timestamp.sql.out | 96 +++ .../results/timestampNTZ/timestamp-ansi.sql.out| 96 +++ .../results/timestampNTZ/timestamp.sql.out | 96 +++ 20 files changed, 1521 insertions(+), 4 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index dde165e5fa9..fa838a6da76 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -1051,6 +1051,11 @@ "expects a binary value with 16, 24 or 32 bytes, but got bytes." ] }, + "DATETIME_UNIT" : { +"message" : [ + "expects one of the units without quotes YEAR, QUARTER, MONTH, WEEK, DAY, DAYOFYEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, but got the string literal ." +] + }, "PATTERN" : { "message" : [ "." diff --git a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 index 2bc79430343..591b0839ac7 100644 --- a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 +++ b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 @@ -892,8 +892,8 @@ datetimeUnit primaryExpression : name=(CURRENT_DATE
[GitHub] [spark-website] jerqi commented on pull request #462: Add Jie Yang to committers
jerqi commented on PR #462: URL: https://github.com/apache/spark-website/pull/462#issuecomment-1547493795 Gongrats! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43500][PYTHON][TESTS] Test `DataFrame.drop` with empty column list and names containing dot
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ffd64a32eb2 [SPARK-43500][PYTHON][TESTS] Test `DataFrame.drop` with empty column list and names containing dot ffd64a32eb2 is described below commit ffd64a32eb2609ecfb68d252723671ec5cca3ffb Author: Ruifeng Zheng AuthorDate: Mon May 15 16:47:15 2023 +0800 [SPARK-43500][PYTHON][TESTS] Test `DataFrame.drop` with empty column list and names containing dot ### What changes were proposed in this pull request? add tests for: 1, `DataFrame.drop` with empty column; 2, `DataFrame.drop` with column names containing dot; ### Why are the changes needed? for better test coverage, the two UTs were once broken in [SPARK-39895](https://issues.apache.org/jira/browse/SPARK-39895), and then fixed in [SPARK-42444](https://issues.apache.org/jira/browse/SPARK-42444) ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? added UTs Closes #41167 from zhengruifeng/py_test_drop_more. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .../sql/tests/connect/test_parity_dataframe.py | 5 + python/pyspark/sql/tests/test_dataframe.py | 24 ++ 2 files changed, 29 insertions(+) diff --git a/python/pyspark/sql/tests/connect/test_parity_dataframe.py b/python/pyspark/sql/tests/connect/test_parity_dataframe.py index a74afc4d504..34f63c1410e 100644 --- a/python/pyspark/sql/tests/connect/test_parity_dataframe.py +++ b/python/pyspark/sql/tests/connect/test_parity_dataframe.py @@ -84,6 +84,11 @@ class DataFrameParityTests(DataFrameTestsMixin, ReusedConnectTestCase): def test_to_pandas_from_mixed_dataframe(self): self.check_to_pandas_from_mixed_dataframe() +# TODO(SPARK-43502): DataFrame.drop should support empty column +@unittest.skip("Fails in Spark Connect, should enable.") +def test_drop_empty_column(self): +super().test_drop_empty_column() + if __name__ == "__main__": import unittest diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index 715cd1d142c..527a51cc239 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -156,6 +156,30 @@ class DataFrameTestsMixin: self.assertEqual(df3.drop("name", df3.age, "unknown").columns, ["height"]) self.assertEqual(df3.drop("name", "age", df3.height).columns, []) +def test_drop_empty_column(self): +df = self.spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) + +self.assertEqual(df.drop().columns, ["age", "name"]) +self.assertEqual(df.drop(*[]).columns, ["age", "name"]) + +def test_drop_column_name_with_dot(self): +df = ( +self.spark.range(1, 3) +.withColumn("first.name", lit("Peter")) +.withColumn("city.name", lit("raleigh")) +.withColumn("state", lit("nc")) +) + +self.assertEqual(df.drop("first.name").columns, ["id", "city.name", "state"]) +self.assertEqual(df.drop("city.name").columns, ["id", "first.name", "state"]) +self.assertEqual(df.drop("first.name", "city.name").columns, ["id", "state"]) +self.assertEqual( +df.drop("first.name", "city.name", "unknown.unknown").columns, ["id", "state"] +) +self.assertEqual( +df.drop("unknown.unknown").columns, ["id", "first.name", "city.name", "state"] +) + def test_dropna(self): schema = StructType( [ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] mridulm commented on pull request #462: Add Jie Yang to committers
mridulm commented on PR #462: URL: https://github.com/apache/spark-website/pull/462#issuecomment-1547339193 Congratulations ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org