(spark) branch master updated: [SPARK-46389][CORE] Manually close the `RocksDB/LevelDB` instance when `checkVersion` throw Exception
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5e1b904ca54f [SPARK-46389][CORE] Manually close the `RocksDB/LevelDB` instance when `checkVersion` throw Exception 5e1b904ca54f is described below commit 5e1b904ca54f8eddc5315933e43edc8bdd0d2982 Author: yangjie01 AuthorDate: Sun Dec 17 13:22:13 2023 -0800 [SPARK-46389][CORE] Manually close the `RocksDB/LevelDB` instance when `checkVersion` throw Exception ### What changes were proposed in this pull request? In the process of initializing the `DB` in `RocksDBProvider/LevelDBProvider`, there is a `checkVersion` step that may throw an exception. After the exception is thrown, the upper-level caller cannot hold the already opened `RockDB/LevelDB` instance, so it cannot perform resource cleanup, which poses a potential risk of handle leakage. So this PR manually closes the `RocksDB/LevelDB` instance when `checkVersion` throws an exception. ### Why are the changes needed? Should close the `RocksDB/LevelDB` instance when `checkVersion` throw Exception ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44327 from LuciferYang/SPARK-46389. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../main/java/org/apache/spark/network/util/LevelDBProvider.java | 7 ++- .../main/java/org/apache/spark/network/util/RocksDBProvider.java | 4 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/common/network-common/src/main/java/org/apache/spark/network/util/LevelDBProvider.java b/common/network-common/src/main/java/org/apache/spark/network/util/LevelDBProvider.java index b27e3beb77ef..aa8be0c663bc 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/util/LevelDBProvider.java +++ b/common/network-common/src/main/java/org/apache/spark/network/util/LevelDBProvider.java @@ -80,7 +80,12 @@ public class LevelDBProvider { } } // if there is a version mismatch, we throw an exception, which means the service is unusable - checkVersion(tmpDb, version, mapper); + try { +checkVersion(tmpDb, version, mapper); + } catch (IOException ioe) { +tmpDb.close(); +throw ioe; + } } return tmpDb; } diff --git a/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java b/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java index f1f702c44245..f3b7b48355a0 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java +++ b/common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java @@ -100,7 +100,11 @@ public class RocksDBProvider { // is unusable checkVersion(tmpDb, version, mapper); } catch (RocksDBException e) { + tmpDb.close(); throw new IOException(e.getMessage(), e); +} catch (IOException ioe) { + tmpDb.close(); + throw ioe; } } return tmpDb; - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46376][SQL][TESTS] Simplify the code to generate the Spark tarball `filename` in the `HiveExternalCatalogVersionsSuite`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0745bb507f36 [SPARK-46376][SQL][TESTS] Simplify the code to generate the Spark tarball `filename` in the `HiveExternalCatalogVersionsSuite` 0745bb507f36 is described below commit 0745bb507f36b8201d49d886fc5da436274e8b85 Author: yangjie01 AuthorDate: Sun Dec 17 13:20:51 2023 -0800 [SPARK-46376][SQL][TESTS] Simplify the code to generate the Spark tarball `filename` in the `HiveExternalCatalogVersionsSuite` ### What changes were proposed in this pull request? This PR simplifies the code used to generate the Spark tarball `filename` in `HiveExternalCatalogVersionsSuite` because the minimum tested version is Spark 3.4. ### Why are the changes needed? Simplify the code to generate the Spark tarball `filename` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44307 from LuciferYang/SPARK-46376. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../spark/sql/hive/HiveExternalCatalogVersionsSuite.scala | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala index 52f20595a10a..ee2e64bc1905 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala @@ -40,8 +40,8 @@ import org.apache.spark.sql.catalyst.catalog.CatalogTableType import org.apache.spark.sql.internal.StaticSQLConf.WAREHOUSE_PATH import org.apache.spark.sql.test.SQLTestUtils import org.apache.spark.tags.{ExtendedHiveTest, SlowHiveTest} -import org.apache.spark.util.{Utils, VersionUtils} import org.apache.spark.util.ArrayImplicits._ +import org.apache.spark.util.Utils /** * Test HiveExternalCatalog backward compatibility. @@ -95,13 +95,7 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { mirrors.distinct :+ "https://archive.apache.org/dist; :+ PROCESS_TABLES.releaseMirror logInfo(s"Trying to download Spark $version from $sites") for (site <- sites) { - val filename = VersionUtils.majorMinorPatchVersion(version) match { -case Some((major, _, _)) if major > 3 => s"spark-$version-bin-hadoop3.tgz" -case Some((3, minor, _)) if minor >= 3 => s"spark-$version-bin-hadoop3.tgz" -case Some((3, minor, _)) if minor < 3 => s"spark-$version-bin-hadoop3.2.tgz" -case Some((_, _, _)) => s"spark-$version-bin-hadoop2.7.tgz" -case None => s"spark-$version-bin-hadoop2.7.tgz" - } + val filename = s"spark-$version-bin-hadoop3.tgz" val url = s"$site/spark/spark-$version/$filename" logInfo(s"Downloading Spark $version from $url") try { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-45502][BUILD] Upgrade Kafka to 3.6.1
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b702f63bed27 [SPARK-45502][BUILD] Upgrade Kafka to 3.6.1 b702f63bed27 is described below commit b702f63bed27b73bae748e232236da2f2ed19dfb Author: dengziming AuthorDate: Sat Dec 16 14:17:39 2023 -0800 [SPARK-45502][BUILD] Upgrade Kafka to 3.6.1 ### What changes were proposed in this pull request? Upgrade Apache Kafka from 3.4.1 to 3.6.1 ### Why are the changes needed? - https://downloads.apache.org/kafka/3.6.1/RELEASE_NOTES.html - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html - https://downloads.apache.org/kafka/3.5.0/RELEASE_NOTES.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GitHub CI. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43348 from dengziming/kafka-3.6.0. Authored-by: dengziming Signed-off-by: Dongjoon Hyun Closes #44312 from dengziming/kafka-3.6.1. Authored-by: dengziming Signed-off-by: Dongjoon Hyun --- .../sql/kafka010/KafkaMicroBatchSourceSuite.scala | 6 -- .../apache/spark/sql/kafka010/KafkaTestUtils.scala | 4 ++-- .../spark/streaming/kafka010/KafkaRDDSuite.scala | 16 -- .../spark/streaming/kafka010/KafkaTestUtils.scala | 4 ++-- .../streaming/kafka010/mocks/MockScheduler.scala | 25 +++--- pom.xml| 2 +- 6 files changed, 30 insertions(+), 27 deletions(-) diff --git a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala index 02e4e909734a..5b4567aa2881 100644 --- a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala +++ b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala @@ -154,7 +154,7 @@ abstract class KafkaSourceTest extends StreamTest with SharedSparkSession with K } val offset = KafkaSourceOffset(testUtils.getLatestOffsets(topics)) - logInfo(s"Added data, expected offset $offset") + logInfo(s"Added data to topic: $topic, expected offset: $offset") (kafkaSource, offset) } @@ -2691,7 +2691,9 @@ class KafkaSourceStressSuite extends KafkaSourceTest { start + Random.nextInt(start + end - 1) } - test("stress test with multiple topics and partitions") { + override val brokerProps = Map("auto.create.topics.enable" -> "false") + + test("stress test with multiple topics and partitions") { topics.foreach { topic => testUtils.createTopic(topic, partitions = nextInt(1, 6)) testUtils.sendMessages(topic, (101 to 105).map { _.toString }.toArray) diff --git a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala index 1fa1dda9faf2..64e54ad63bdc 100644 --- a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala +++ b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala @@ -28,7 +28,6 @@ import scala.io.Source import scala.jdk.CollectionConverters._ import com.google.common.io.Files -import kafka.api.Request import kafka.server.{HostedPartition, KafkaConfig, KafkaServer} import kafka.server.checkpoints.OffsetCheckpointFile import kafka.zk.KafkaZkClient @@ -40,6 +39,7 @@ import org.apache.kafka.clients.producer._ import org.apache.kafka.common.TopicPartition import org.apache.kafka.common.config.SaslConfigs import org.apache.kafka.common.network.ListenerName +import org.apache.kafka.common.requests.FetchRequest import org.apache.kafka.common.security.auth.SecurityProtocol.{PLAINTEXT, SASL_PLAINTEXT} import org.apache.kafka.common.serialization.StringSerializer import org.apache.kafka.common.utils.SystemTime @@ -603,7 +603,7 @@ class KafkaTestUtils( .getPartitionInfo(topic, partition) match { case Some(partitionState) => zkClient.getLeaderForPartition(new TopicPartition(topic, partition)).isDefined && - Request.isValidBrokerId(partitionState.leader) && + FetchRequest.isValidBrokerId(partitionState.leader) && !partitionState.replicas.isEmpty case _ => diff --git a/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala b/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/
(spark) branch master updated: [MINOR][DOCS] Use user-facing feature names in Python metrics documentation
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 20be0b0bffcb [MINOR][DOCS] Use user-facing feature names in Python metrics documentation 20be0b0bffcb is described below commit 20be0b0bffcb3875c50af955ca89b9bd7d59fc8c Author: Hyukjin Kwon AuthorDate: Sat Dec 16 14:14:05 2023 -0800 [MINOR][DOCS] Use user-facing feature names in Python metrics documentation ### What changes were proposed in this pull request? This PR proposes to use proper user-facing feature names instead of the names of logical plans in Python metrics documentation. ### Why are the changes needed? - For better maintenance of the codes (e.g., renaming the plan) - Users won't likely know much about the plan. ### Does this PR introduce _any_ user-facing change? Yes, it changes the user-facing documentation for Python metrics documentation. ### How was this patch tested? CI in this PR should verify it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44380 from HyukjinKwon/minor-doc-update. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- docs/web-ui.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/web-ui.md b/docs/web-ui.md index cdf62e0d8ec0..c500860a201b 100644 --- a/docs/web-ui.md +++ b/docs/web-ui.md @@ -406,8 +406,8 @@ Here is the list of SQL metrics: time to build hash map the time spent on building hash map ShuffledHashJoin task commit time the time spent on committing the output of a task after the writes succeed any write operation on a file-based table job commit time the time spent on committing the output of a job after the writes succeed any write operation on a file-based table - data sent to Python workers the number of bytes of serialized data sent to the Python workers ArrowEvalPython, AggregateInPandas, BatchEvalPython, FlatMapGroupsInPandas, FlatMapsCoGroupsInPandas, FlatMapsCoGroupsInPandasWithState, MapInPandas, PythonMapInArrow, WindowsInPandas - data returned from Python workers the number of bytes of serialized data received back from the Python workers ArrowEvalPython, AggregateInPandas, BatchEvalPython, FlatMapGroupsInPandas, FlatMapsCoGroupsInPandas, FlatMapsCoGroupsInPandasWithState, MapInPandas, PythonMapInArrow, WindowsInPandas + data sent to Python workers the number of bytes of serialized data sent to the Python workers Python UDFs, Pandas UDFs, Pandas Functions API and Python Data Source + data returned from Python workers the number of bytes of serialized data received back from the Python workers Python UDFs, Pandas UDFS, Pandas Functions API and Python Data Source ## Structured Streaming Tab - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) tag v3.3.4 created (now 18db204995b)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to tag v3.3.4 in repository https://gitbox.apache.org/repos/asf/spark.git at 18db204995b (commit) No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r66087 - /release/spark/spark-3.3.3/
Author: dongjoon Date: Sat Dec 16 00:33:47 2023 New Revision: 66087 Log: Remove Apache Spark 3.3.3 after releasing 3.3.4 Removed: release/spark/spark-3.3.3/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r66086 - /dev/spark/v3.3.4-rc1-bin/ /release/spark/spark-3.3.4/
Author: dongjoon Date: Sat Dec 16 00:25:31 2023 New Revision: 66086 Log: Release Apache Spark 3.3.4 Added: release/spark/spark-3.3.4/ - copied from r66085, dev/spark/v3.3.4-rc1-bin/ Removed: dev/spark/v3.3.4-rc1-bin/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46353][CORE] Refactor to improve `RegisterWorker` unit test coverage
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2215cef40043 [SPARK-46353][CORE] Refactor to improve `RegisterWorker` unit test coverage 2215cef40043 is described below commit 2215cef40043a3205446f8daecafed8f2360a742 Author: Dongjoon Hyun AuthorDate: Tue Dec 12 09:57:43 2023 -0800 [SPARK-46353][CORE] Refactor to improve `RegisterWorker` unit test coverage ### What changes were proposed in this pull request? This PR aims to improve the unit test coverage for `RegisterWorker` message handling. - Add `handleRegisterWorker` helper method which is testable easily. - Add new unit tests for three conditional branches. ### Why are the changes needed? It's easily to test and improve. We can add more tests in this way in the future. ### Does this PR introduce _any_ user-facing change? No. This is a refactoring on the main code and only additions to the test methods. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44284 from dongjoon-hyun/SPARK-46353. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/deploy/master/Master.scala| 75 +- .../apache/spark/deploy/master/MasterSuite.scala | 59 - 2 files changed, 102 insertions(+), 32 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/Master.scala b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala index a550f44fc0a4..c8679c185ad7 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/Master.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala @@ -37,7 +37,7 @@ import org.apache.spark.internal.config.Deploy._ import org.apache.spark.internal.config.UI._ import org.apache.spark.internal.config.Worker._ import org.apache.spark.metrics.{MetricsSystem, MetricsSystemInstances} -import org.apache.spark.resource.{ResourceProfile, ResourceRequirement, ResourceUtils} +import org.apache.spark.resource.{ResourceInformation, ResourceProfile, ResourceRequirement, ResourceUtils} import org.apache.spark.rpc._ import org.apache.spark.serializer.{JavaSerializer, KryoSerializer, Serializer} import org.apache.spark.util.{SparkUncaughtExceptionHandler, ThreadUtils, Utils} @@ -75,7 +75,8 @@ private[deploy] class Master( private val waitingApps = new ArrayBuffer[ApplicationInfo] val apps = new HashSet[ApplicationInfo] - private val idToWorker = new HashMap[String, WorkerInfo] + // Visible for testing + private[master] val idToWorker = new HashMap[String, WorkerInfo] private val addressToWorker = new HashMap[RpcAddress, WorkerInfo] private val endpointToApp = new HashMap[RpcEndpointRef, ApplicationInfo] @@ -106,7 +107,7 @@ private[deploy] class Master( private[master] var state = RecoveryState.STANDBY - private var persistenceEngine: PersistenceEngine = _ + private[master] var persistenceEngine: PersistenceEngine = _ private var leaderElectionAgent: LeaderElectionAgent = _ @@ -281,33 +282,8 @@ private[deploy] class Master( case RegisterWorker( id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress, resources) => - logInfo("Registering worker %s:%d with %d cores, %s RAM".format( -workerHost, workerPort, cores, Utils.megabytesToString(memory))) - if (state == RecoveryState.STANDBY) { -workerRef.send(MasterInStandby) - } else if (idToWorker.contains(id)) { -if (idToWorker(id).state == WorkerState.UNKNOWN) { - logInfo("Worker has been re-registered: " + id) - idToWorker(id).state = WorkerState.ALIVE -} -workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress, true)) - } else { -val workerResources = - resources.map(r => r._1 -> WorkerResourceInfo(r._1, r._2.addresses.toImmutableArraySeq)) -val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory, - workerRef, workerWebUiUrl, workerResources) -if (registerWorker(worker)) { - persistenceEngine.addWorker(worker) - workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress, false)) - schedule() -} else { - val workerAddress = worker.endpoint.address - logWarning("Worker registration failed. Attempted to re-register worker at same " + -"address: " + workerAddress) - workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: " -+ workerAddress)) -} - } + handle
(spark) branch branch-3.4 updated: [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new b813e2e100f [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` b813e2e100f is described below commit b813e2e100faf7bab88c23ba9bba6e3197b169aa Author: Dongjoon Hyun AuthorDate: Mon Dec 11 15:05:21 2023 -0800 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` ### What changes were proposed in this pull request? This PR aims to remove `kill` hyperlink from `RELAUNCHING` drivers in `MasterPage`. ### Why are the changes needed? Since Apache Spark 1.4.0 (SPARK-5495), `RELAUNCHING` drivers have `kill` hyperlinks in the `Completed Drivers` table. ![Screenshot 2023-12-11 at 1 02 29 PM](https://github.com/apache/spark/assets/9700541/38f4bf08-efb9-47e5-8a7a-f7d127429012) However, this is a bug because the driver was already terminated by definition. Newly relaunched driver has an independent ID and there is no relationship with the previously terminated ID. https://github.com/apache/spark/blob/7db85642600b1e3b39ca11e41d4e3e0bf1c8962b/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala#L27 If we clicked the `kill` link, `Master` always complains like the following. ``` 23/12/11 21:25:50 INFO Master: Asked to kill driver 202312112113-0 23/12/11 21:25:50 WARN Master: Driver 202312112113-0 has already finished or does not exist ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44301 from dongjoon-hyun/SPARK-46369. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit e434c9f0d5792b7af43c87dd6145fd8a6a04d8e2) Signed-off-by: Dongjoon Hyun (cherry picked from commit ac031d68a01f14cc73f05e83a790a6787aa6453d) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index a71eb33a2fe..e7e90aa0a37 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -322,8 +322,7 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { private def driverRow(driver: DriverInfo, showDuration: Boolean): Seq[Node] = { val killLink = if (parent.killEnabled && (driver.state == DriverState.RUNNING || -driver.state == DriverState.SUBMITTED || -driver.state == DriverState.RELAUNCHING)) { +driver.state == DriverState.SUBMITTED)) { val confirm = s"if (window.confirm('Are you sure you want to kill driver ${driver.id} ?')) " + "{ this.parentNode.submit(); return true; } else { return false; }" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new ac031d68a01 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` ac031d68a01 is described below commit ac031d68a01f14cc73f05e83a790a6787aa6453d Author: Dongjoon Hyun AuthorDate: Mon Dec 11 15:05:21 2023 -0800 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` ### What changes were proposed in this pull request? This PR aims to remove `kill` hyperlink from `RELAUNCHING` drivers in `MasterPage`. ### Why are the changes needed? Since Apache Spark 1.4.0 (SPARK-5495), `RELAUNCHING` drivers have `kill` hyperlinks in the `Completed Drivers` table. ![Screenshot 2023-12-11 at 1 02 29 PM](https://github.com/apache/spark/assets/9700541/38f4bf08-efb9-47e5-8a7a-f7d127429012) However, this is a bug because the driver was already terminated by definition. Newly relaunched driver has an independent ID and there is no relationship with the previously terminated ID. https://github.com/apache/spark/blob/7db85642600b1e3b39ca11e41d4e3e0bf1c8962b/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala#L27 If we clicked the `kill` link, `Master` always complains like the following. ``` 23/12/11 21:25:50 INFO Master: Asked to kill driver 202312112113-0 23/12/11 21:25:50 WARN Master: Driver 202312112113-0 has already finished or does not exist ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44301 from dongjoon-hyun/SPARK-46369. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit e434c9f0d5792b7af43c87dd6145fd8a6a04d8e2) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index a71eb33a2fe..e7e90aa0a37 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -322,8 +322,7 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { private def driverRow(driver: DriverInfo, showDuration: Boolean): Seq[Node] = { val killLink = if (parent.killEnabled && (driver.state == DriverState.RUNNING || -driver.state == DriverState.SUBMITTED || -driver.state == DriverState.RELAUNCHING)) { +driver.state == DriverState.SUBMITTED)) { val confirm = s"if (window.confirm('Are you sure you want to kill driver ${driver.id} ?')) " + "{ this.parentNode.submit(); return true; } else { return false; }" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (3e0808c33f1 -> e434c9f0d57)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 3e0808c33f1 [SPARK-46351][SQL] Require an error class in `AnalysisException` add e434c9f0d57 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46358][CONNECT] Simplify the condition check in the `ResponseValidator#verifyResponse`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 051b1781827 [SPARK-46358][CONNECT] Simplify the condition check in the `ResponseValidator#verifyResponse` 051b1781827 is described below commit 051b1781827dd3a4e1e95a5354caa747ff41ae1a Author: yangjie01 AuthorDate: Mon Dec 11 08:53:10 2023 -0800 [SPARK-46358][CONNECT] Simplify the condition check in the `ResponseValidator#verifyResponse` ### What changes were proposed in this pull request? This PR has made the following refactoring to the `verifyResponse` function in `ResponseValidator`: 1. The check condition `response.hasField(field)` is moved before getting `value`, and only when `response.hasField(field)` is true, `value` is obtained, which seems more in line with the existing comments. 2. Removed the `value != ""` condition check in the case match, because only when `value.nonEmpty` is true will it enter the `if` branch, and the condition `value.nonEmpty` has already covered the check for `value != ""`. 3. The condition check `value != id` is moved inside `case Some(id)`. After the modification, an `IllegalStateException` will still be thrown when the id exists and `value != id`, but `serverSideSessionId` will no longer be reassigned when the id exists and `value == id`. 4. Removed the redundant `toString` operation when reassigning `serverSideSessionId`, because `value` is String type. 5. Removed the No-op `case _` match, because it is unreachable code after the above modifications. ### Why are the changes needed? Simplify the condition check in the `verifyResponse` function ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44291 from LuciferYang/Simplify-ResponseValidator-verifyResponse. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: Dongjoon Hyun --- .../spark/sql/connect/client/ResponseValidator.scala | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala index 67f29c727ef..22c5505e7d4 100644 --- a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala +++ b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala @@ -35,18 +35,20 @@ class ResponseValidator extends Logging { val field = response.getDescriptorForType.findFieldByName("server_side_session_id") // If the field does not exist, we ignore it. New / Old message might not contain it and this // behavior allows us to be compatible. -if (field != null) { +if (field != null && response.hasField(field)) { val value = response.getField(field).asInstanceOf[String] // Ignore, if the value is unset. - if (response.hasField(field) && value != null && value.nonEmpty) { + if (value != null && value.nonEmpty) { serverSideSessionId match { - case Some(id) if value != id && value != "" => -throw new IllegalStateException(s"Server side session ID changed from $id to $value") - case _ if value != "" => + case Some(id) => +if (value != id) { + throw new IllegalStateException( +s"Server side session ID changed from $id to $value") +} + case _ => synchronized { - serverSideSessionId = Some(value.toString) + serverSideSessionId = Some(value) } - case _ => // No-op } } } else { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (bab884082c0 -> bb886abcc4a)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from bab884082c0 [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide add bb886abcc4a [SPARK-46356][BUILD] Upgrade `sbt-assembly` to 2.1.5, `sbt-checkstyle-plugin` to 4.0.1 No new revisions were added by this update. Summary of changes: project/plugins.sbt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46215][CORE][FOLLOWUP] Handle symbolic links
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bacdb3b5fec [SPARK-46215][CORE][FOLLOWUP] Handle symbolic links bacdb3b5fec is described below commit bacdb3b5fec9783f4604276480eb2a0f5702 Author: Dongjoon Hyun AuthorDate: Sun Dec 10 18:46:37 2023 -0800 [SPARK-46215][CORE][FOLLOWUP] Handle symbolic links ### What changes were proposed in this pull request? This PR aims to fix a regression on symbolic links. ### Why are the changes needed? To have the same behavior with symbolic links. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with newly added test case. I also verified this on Mac's `/tmp` directory. ``` $ ls -al /tmp lrwxr-xr-x 1 root wheel 11 Nov 17 02:37 /tmp -> private/tmp ``` **MASTER** ``` 23/12/10 16:04:53 INFO FileSystemRecoveryModeFactory: Persisting recovery state to directory: /tmp 23/12/10 16:04:53 INFO Master: I have been elected leader! New state: ALIVE 23/12/10 16:08:39 INFO Master: Registering worker 127.0.0.1:50535 with 8 cores, 15.0 GiB RAM ``` **PERSISTED DATA** ``` $ ls -al /tmp/worker_* -rw-r--r-- 1 dongjoon wheel 1354 Dec 10 16:08 /tmp/worker_worker-20231210160839-127.0.0.1-50535 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44285 from dongjoon-hyun/SPARK-46215-2. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../deploy/master/FileSystemPersistenceEngine.scala | 9 +++-- .../spark/deploy/master/PersistenceEngineSuite.scala | 16 2 files changed, 23 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala b/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala index 785367a0dee..fb067f10c5a 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala @@ -18,7 +18,7 @@ package org.apache.spark.deploy.master import java.io._ -import java.nio.file.{Files, Paths} +import java.nio.file.{FileAlreadyExistsException, Files, Paths} import scala.reflect.ClassTag @@ -42,7 +42,12 @@ private[master] class FileSystemPersistenceEngine( val codec: Option[CompressionCodec] = None) extends PersistenceEngine with Logging { - Files.createDirectories(Paths.get(dir)) + try { +Files.createDirectories(Paths.get(dir)) + } catch { +case _: FileAlreadyExistsException if Files.isSymbolicLink(Paths.get(dir)) => + Files.createDirectories(Paths.get(dir).toRealPath()) + } override def persist(name: String, obj: Object): Unit = { serializeIntoFile(new File(dir + File.separator + name), obj) diff --git a/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala b/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala index 84181ea3fca..b977a114244 100644 --- a/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/master/PersistenceEngineSuite.scala @@ -19,6 +19,7 @@ package org.apache.spark.deploy.master import java.net.ServerSocket +import java.nio.file.{Files, Paths} import java.util.concurrent.ThreadLocalRandom import org.apache.curator.test.TestingServer @@ -72,6 +73,21 @@ class PersistenceEngineSuite extends SparkFunSuite { } } + test("SPARK-46215: FileSystemPersistenceEngine with a symbolic link") { +withTempDir { dir => + val target = Paths.get(dir.getAbsolutePath(), "target") + val link = Paths.get(dir.getAbsolutePath(), "symbolic_link"); + + Files.createDirectories(target) + Files.createSymbolicLink(link, target); + + val conf = new SparkConf() + testPersistenceEngine(conf, serializer => +new FileSystemPersistenceEngine(link.toAbsolutePath.toString, serializer) + ) +} + } + test("SPARK-46205: Support KryoSerializer in FileSystemPersistenceEngine") { withTempDir { dir => val conf = new SparkConf() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-45642][CORE][SQL] Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fd009d652f7 [SPARK-45642][CORE][SQL] Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated` fd009d652f7 is described below commit fd009d652f7922254ccc7cc631b8df3a6b821532 Author: panbingkun AuthorDate: Sun Dec 10 14:11:19 2023 -0800 [SPARK-45642][CORE][SQL] Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated` ### What changes were proposed in this pull request? The pr aims to fix `FileSystem.isFile & FileSystem.isDirectory is deprecated` & make some error message prompts more accurate. ### Why are the changes needed? - Prepare for future Hadoop to truly eliminate this method - Reduce warn prompts. - Make some error message prompts more accurate. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43505 from panbingkun/SPARK-45642. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/deploy/SparkHadoopUtil.scala| 9 - .../apache/spark/deploy/history/FsHistoryProvider.scala| 2 +- core/src/main/scala/org/apache/spark/util/Utils.scala | 2 +- .../scala/org/apache/spark/deploy/SparkSubmitSuite.scala | 2 +- .../execution/datasources/PartitioningAwareFileIndex.scala | 14 +++--- .../spark/sql/execution/streaming/FileStreamSink.scala | 2 +- .../scala/org/apache/spark/streaming/util/HdfsUtils.scala | 2 +- 7 files changed, 24 insertions(+), 9 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 50906f76b6e..628b688dedb 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -17,7 +17,7 @@ package org.apache.spark.deploy -import java.io.{ByteArrayInputStream, ByteArrayOutputStream, DataInputStream, DataOutputStream, File, IOException} +import java.io.{ByteArrayInputStream, ByteArrayOutputStream, DataInputStream, DataOutputStream, File, FileNotFoundException, IOException} import java.net.InetAddress import java.security.PrivilegedExceptionAction import java.text.DateFormat @@ -593,4 +593,11 @@ private[spark] object SparkHadoopUtil extends Logging { } } + def isFile(fs: FileSystem, path: Path): Boolean = { +try { + fs.getFileStatus(path).isFile +} catch { + case _: FileNotFoundException => false +} + } } diff --git a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala index 565499bb610..73fb0086b33 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala @@ -860,7 +860,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) try { // Fetch the entry first to avoid an RPC when it's already removed. listing.read(classOf[LogInfo], inProgressLog) -if (!fs.isFile(new Path(inProgressLog))) { +if (!SparkHadoopUtil.isFile(fs, new Path(inProgressLog))) { listing.synchronized { listing.delete(classOf[LogInfo], inProgressLog) } diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala b/core/src/main/scala/org/apache/spark/util/Utils.scala index 04281f5..a074bd53d26 100644 --- a/core/src/main/scala/org/apache/spark/util/Utils.scala +++ b/core/src/main/scala/org/apache/spark/util/Utils.scala @@ -674,7 +674,7 @@ private[spark] object Utils throw new IOException(s"Failed to create directory ${targetDir.getPath}") } val dest = new File(targetDir, filename.getOrElse(path.getName)) -if (fs.isFile(path)) { +if (fs.getFileStatus(path).isFile) { val in = fs.open(path) try { downloadFile(path.toString, in, dest, fileOverwrite) diff --git a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala index e235b8aeb77..d16a15df1b5 100644 --- a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala @@ -1774,7 +1774,7 @@ class TestFileSystem extends org.apache.hadoop.fs.LocalFileSystem { status } - override def isFile(path: Path): Boolean = super.isFile(local(path)) + override def getFileS
(spark) branch branch-3.5 updated: [SPARK-45969][DOCS] Document configuration change of executor failure tracker
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new cbaefe9cc6a [SPARK-45969][DOCS] Document configuration change of executor failure tracker cbaefe9cc6a is described below commit cbaefe9cc6a22c940728b6717aeaa51c7d550ddc Author: Cheng Pan AuthorDate: Sun Dec 10 14:03:37 2023 -0800 [SPARK-45969][DOCS] Document configuration change of executor failure tracker It's a follow-up of SPARK-41210 (use a new JIRA ticket because it was released in 3.5.0), this PR updates docs/migration guide about configuration change of executor failure tracker Docs update is missing in previous changes, also is requested https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb#r132516892 by tgravescs Yes, docs changed Review No Closes #43863 from pan3793/SPARK-45969. Authored-by: Cheng Pan Signed-off-by: Dongjoon Hyun (cherry picked from commit 7a43de193aa5a0856e098088728dccea37f169c5) Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/internal/config/package.scala | 4 ++-- docs/configuration.md | 21 + docs/core-migration-guide.md| 6 ++ docs/running-on-yarn.md | 17 - 4 files changed, 29 insertions(+), 19 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index 600cbf151e1..c5e23cae1f8 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -924,7 +924,7 @@ package object config { private[spark] val MAX_EXECUTOR_FAILURES = ConfigBuilder("spark.executor.maxNumFailures") - .doc("Spark exits if the number of failed executors exceeds this threshold. " + + .doc("The maximum number of executor failures before failing the application. " + "This configuration only takes effect on YARN, or Kubernetes when " + "`spark.kubernetes.allocation.pods.allocator` is set to 'direct'.") .version("3.5.0") @@ -933,7 +933,7 @@ package object config { private[spark] val EXECUTOR_ATTEMPT_FAILURE_VALIDITY_INTERVAL_MS = ConfigBuilder("spark.executor.failuresValidityInterval") - .doc("Interval after which Executor failures will be considered independent and not " + + .doc("Interval after which executor failures will be considered independent and not " + "accumulate towards the attempt count. This configuration only takes effect on YARN, " + "or Kubernetes when `spark.kubernetes.allocation.pods.allocator` is set to 'direct'.") .version("3.5.0") diff --git a/docs/configuration.md b/docs/configuration.md index f79406c5b6d..645c3e8208a 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -514,6 +514,27 @@ of the most common options to set are: 3.2.0 + + spark.executor.maxNumFailures + numExecutors * 2, with minimum of 3 + +The maximum number of executor failures before failing the application. +This configuration only takes effect on YARN, or Kubernetes when +`spark.kubernetes.allocation.pods.allocator` is set to 'direct'. + + 3.5.0 + + + spark.executor.failuresValidityInterval + (none) + +Interval after which executor failures will be considered independent and +not accumulate towards the attempt count. +This configuration only takes effect on YARN, or Kubernetes when +`spark.kubernetes.allocation.pods.allocator` is set to 'direct'. + + 3.5.0 + Apart from these, the following properties are also available, and may be useful in some situations: diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md index 3f97a484e1a..36465cc3f4e 100644 --- a/docs/core-migration-guide.md +++ b/docs/core-migration-guide.md @@ -22,6 +22,12 @@ license: | * Table of contents {:toc} +## Upgrading from Core 3.4 to 3.5 + +- Since Spark 3.5, `spark.yarn.executor.failuresValidityInterval` is deprecated. Use `spark.executor.failuresValidityInterval` instead. + +- Since Spark 3.5, `spark.yarn.max.executor.failures` is deprecated. Use `spark.executor.maxNumFailures` instead. + ## Upgrading from Core 3.3 to 3.4 - Since Spark 3.4, Spark driver will own `PersistentVolumnClaim`s and try to reuse if they are not assigned to live executors. To restore the behavior before Spark 3.4, you can set `spark.kubernetes.driver.ownPersistentVolumeClaim` to `false` and `spark.kubernetes.driver.reusePersistentVolumeClaim` to `false`.
(spark) branch master updated: [SPARK-45969][DOCS] Document configuration change of executor failure tracker
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7a43de193aa [SPARK-45969][DOCS] Document configuration change of executor failure tracker 7a43de193aa is described below commit 7a43de193aa5a0856e098088728dccea37f169c5 Author: Cheng Pan AuthorDate: Sun Dec 10 14:03:37 2023 -0800 [SPARK-45969][DOCS] Document configuration change of executor failure tracker ### What changes were proposed in this pull request? It's a follow-up of SPARK-41210 (use a new JIRA ticket because it was released in 3.5.0), this PR updates docs/migration guide about configuration change of executor failure tracker ### Why are the changes needed? Docs update is missing in previous changes, also is requested https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb#r132516892 by tgravescs ### Does this PR introduce _any_ user-facing change? Yes, docs changed ### How was this patch tested? Review ### Was this patch authored or co-authored using generative AI tooling? No Closes #43863 from pan3793/SPARK-45969. Authored-by: Cheng Pan Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/internal/config/package.scala | 4 ++-- docs/configuration.md | 21 + docs/core-migration-guide.md| 6 ++ docs/running-on-yarn.md | 17 - 4 files changed, 29 insertions(+), 19 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index 2c710e6025d..2823b7cdb60 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -931,7 +931,7 @@ package object config { private[spark] val MAX_EXECUTOR_FAILURES = ConfigBuilder("spark.executor.maxNumFailures") - .doc("Spark exits if the number of failed executors exceeds this threshold. " + + .doc("The maximum number of executor failures before failing the application. " + "This configuration only takes effect on YARN, or Kubernetes when " + "`spark.kubernetes.allocation.pods.allocator` is set to 'direct'.") .version("3.5.0") @@ -940,7 +940,7 @@ package object config { private[spark] val EXECUTOR_ATTEMPT_FAILURE_VALIDITY_INTERVAL_MS = ConfigBuilder("spark.executor.failuresValidityInterval") - .doc("Interval after which Executor failures will be considered independent and not " + + .doc("Interval after which executor failures will be considered independent and not " + "accumulate towards the attempt count. This configuration only takes effect on YARN, " + "or Kubernetes when `spark.kubernetes.allocation.pods.allocator` is set to 'direct'.") .version("3.5.0") diff --git a/docs/configuration.md b/docs/configuration.md index f261e3b2deb..b45d647fde8 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -522,6 +522,27 @@ of the most common options to set are: 3.2.0 + + spark.executor.maxNumFailures + numExecutors * 2, with minimum of 3 + +The maximum number of executor failures before failing the application. +This configuration only takes effect on YARN, or Kubernetes when +`spark.kubernetes.allocation.pods.allocator` is set to 'direct'. + + 3.5.0 + + + spark.executor.failuresValidityInterval + (none) + +Interval after which executor failures will be considered independent and +not accumulate towards the attempt count. +This configuration only takes effect on YARN, or Kubernetes when +`spark.kubernetes.allocation.pods.allocator` is set to 'direct'. + + 3.5.0 + Apart from these, the following properties are also available, and may be useful in some situations: diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md index 09ba4b474e9..179b0b3fae1 100644 --- a/docs/core-migration-guide.md +++ b/docs/core-migration-guide.md @@ -32,6 +32,12 @@ license: | - In Spark 4.0, support for Apache Mesos as a resource manager was removed. +## Upgrading from Core 3.4 to 3.5 + +- Since Spark 3.5, `spark.yarn.executor.failuresValidityInterval` is deprecated. Use `spark.executor.failuresValidityInterval` instead. + +- Since Spark 3.5, `spark.yarn.max.executor.failures` is deprecated. Use `spark.executor.maxNumFailures` instead. + ## Upgrading from Core 3.3 to 3.4 - Since Spark 3.4, Spark driver will own `PersistentVolumnClai
(spark) branch master updated: [SPARK-46285][SQL] Add `foreachWithSubqueries`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a77713081a8 [SPARK-46285][SQL] Add `foreachWithSubqueries` a77713081a8 is described below commit a77713081a8065a5feaa3b438a79fdeb6a4b6782 Author: Rui Wang AuthorDate: Sun Dec 10 14:00:15 2023 -0800 [SPARK-46285][SQL] Add `foreachWithSubqueries` ### What changes were proposed in this pull request? We can have a `foreachWithSubqueries` which also traverse the subqueries in the query plan. ### Why are the changes needed? Add a new way to access subqueries in the query plan. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? UT ### Was this patch authored or co-authored using generative AI tooling? no Closes #44206 from amaliujia/foreachsubqueries. Authored-by: Rui Wang Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/catalyst/plans/QueryPlan.scala | 11 +++ .../apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala | 13 + 2 files changed, 24 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala index 9656a93575d..ef7cd7401f2 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala @@ -534,6 +534,17 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] transformDownWithPruning(cond, ruleId)(g) } + /** + * A variant of [[foreach]] which considers plan nodes inside subqueries as well. + */ + def foreachWithSubqueries(f: PlanType => Unit): Unit = { +def actualFunc(plan: PlanType): Unit = { + f(plan) + plan.subqueries.foreach(_.foreachWithSubqueries(f)) +} +foreach(actualFunc) + } + /** * A variant of `collect`. This method not only apply the given function to all elements in this * plan, also considering all the plans in its (nested) subqueries diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala index 3eba9eebc3d..31f7e07143c 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala @@ -18,6 +18,7 @@ package org.apache.spark.sql.catalyst.plans import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.dsl.plans._ import org.apache.spark.sql.catalyst.expressions._ @@ -145,4 +146,16 @@ class LogicalPlanSuite extends SparkFunSuite { assert(query.where(Literal.FalseLiteral).maxRows.contains(0)) assert(query.where(Literal.FalseLiteral).maxRowsPerPartition.contains(0)) } + + test("SPARK-46285: foreachWithSubqueries") { +val input = UnresolvedRelation(Seq("subquery_table")) +val input2 = UnresolvedRelation(Seq("t")) +val plan = Filter(Exists(input), input2) +val tableNames = scala.collection.mutable.Set[String]() +plan.foreachWithSubqueries { + case e: UnresolvedRelation => tableNames.add(e.name) + case _ => +} +assert(tableNames.contains("subquery_table")) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46342][SQL] Replace `IllegalStateException` by `SparkException.internalError` in sql
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d02fbba6491 [SPARK-46342][SQL] Replace `IllegalStateException` by `SparkException.internalError` in sql d02fbba6491 is described below commit d02fbba6491fd17dc6bfc1a416971af7544952f3 Author: Max Gekk AuthorDate: Sun Dec 10 11:24:02 2023 -0800 [SPARK-46342][SQL] Replace `IllegalStateException` by `SparkException.internalError` in sql ### What changes were proposed in this pull request? In the PR, I propose to replace all `IllegalStateException` exception in the `sql` project except of `streaming` by `SparkException.internalError`. ### Why are the changes needed? This is a part of migration onto new error framework and error classes. ### Does this PR introduce _any_ user-facing change? No, users shouldn't face to `IllegalStateException` in regular cases. ### How was this patch tested? Using existing GAs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44275 from MaxGekk/replace-ise-by-internal-error. Authored-by: Max Gekk Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/util/TimestampFormatter.scala| 5 +++-- .../scala/org/apache/spark/sql/util/ArrowUtils.scala| 3 ++- .../execution/datasources/v2/DataSourceV2Relation.scala | 2 +- .../org/apache/spark/sql/util/ArrowUtilsSuite.scala | 12 +++- .../main/scala/org/apache/spark/sql/SparkSession.scala | 6 +++--- .../apache/spark/sql/api/python/PythonSQLUtils.scala| 3 ++- .../sql/catalyst/analysis/ResolveSessionCatalog.scala | 3 ++- .../spark/sql/execution/OptimizeMetadataOnlyQuery.scala | 3 ++- .../org/apache/spark/sql/execution/SQLExecution.scala | 4 ++-- .../org/apache/spark/sql/execution/SparkSqlParser.scala | 3 ++- .../apache/spark/sql/execution/SparkStrategies.scala| 17 + .../spark/sql/execution/WholeStageCodegenExec.scala | 4 ++-- .../sql/execution/adaptive/AQEShuffleReadExec.scala | 8 .../spark/sql/execution/adaptive/QueryStageExec.scala | 8 .../execution/aggregate/AggregateCodegenSupport.scala | 3 ++- .../sql/execution/aggregate/BaseAggregateExec.scala | 7 --- .../execution/aggregate/ObjectAggregationIterator.scala | 4 ++-- .../aggregate/TungstenAggregationIterator.scala | 4 ++-- .../sql/execution/aggregate/UpdatingSessionsExec.scala | 3 ++- .../execution/analysis/DetectAmbiguousSelfJoin.scala| 4 +++- .../spark/sql/execution/basicPhysicalOperators.scala| 8 .../spark/sql/execution/columnar/InMemoryRelation.scala | 4 ++-- .../columnar/compression/compressionSchemes.scala | 5 +++-- .../spark/sql/execution/datasources/DataSource.scala| 4 ++-- .../sql/execution/datasources/DataSourceUtils.scala | 6 +++--- .../sql/execution/datasources/jdbc/DriverRegistry.scala | 3 ++- .../datasources/parquet/ParquetWriteSupport.scala | 4 ++-- .../execution/datasources/v2/DataSourceV2Strategy.scala | 6 -- .../sql/execution/exchange/ShuffleExchangeExec.scala| 4 ++-- .../apache/spark/sql/execution/metric/SQLMetrics.scala | 4 ++-- .../spark/sql/execution/python/EvaluatePython.scala | 5 +++-- .../spark/sql/execution/python/ExtractPythonUDFs.scala | 7 --- .../python/FlatMapGroupsInPandasWithStateExec.scala | 4 ++-- .../spark/sql/execution/window/AggregateProcessor.scala | 3 ++- .../execution/window/WindowEvaluatorFactoryBase.scala | 9 + .../apache/spark/sql/expressions/ReduceAggregator.scala | 3 ++- .../org/apache/spark/sql/SparkSessionBuilderSuite.scala | 2 +- .../org/apache/spark/sql/execution/SparkPlanSuite.scala | 13 - .../spark/sql/execution/WholeStageCodegenSuite.scala| 13 - .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala | 12 +++- .../spark/sql/expressions/ReduceAggregatorSuite.scala | 4 ++-- .../org/apache/spark/sql/hive/HiveInspectors.scala | 3 ++- .../apache/spark/sql/hive/client/HiveClientImpl.scala | 2 +- .../apache/spark/sql/hive/execution/HiveTempPath.scala | 5 +++-- 44 files changed, 136 insertions(+), 103 deletions(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala index df146e0dbfd..9539ced52dc 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala @@ -29,6 +29,7 @@ import scala.util.control.NonFatal import org.apache.commons.lang3.time.FastDateFormat +import org.apache.spark.SparkException import
(spark) branch master updated: [SPARK-46346][CORE] Fix Master to update a worker from `UNKNOWN` to `ALIVE` on `RegisterWorker` msg
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 109b1e4a79d [SPARK-46346][CORE] Fix Master to update a worker from `UNKNOWN` to `ALIVE` on `RegisterWorker` msg 109b1e4a79d is described below commit 109b1e4a79d9a5ec33944887a34c92d453016902 Author: Dongjoon Hyun AuthorDate: Sun Dec 10 11:19:46 2023 -0800 [SPARK-46346][CORE] Fix Master to update a worker from `UNKNOWN` to `ALIVE` on `RegisterWorker` msg ### What changes were proposed in this pull request? This PR aims to fix `Spark Master`'s recovery process to update a worker status from `UNKNOWN` to `ALIVE` when it receives a `RegisterWroker` message from that worker. ### Why are the changes needed? This only happens during the recovery. - `Master` already has the recovered worker information in memory with `UNKNOWN` status. - `Worker` sends `RegisterWorker` message correctly. - `Master` keeps its worker status in `UNKNOWN` and informs the worker with `RegisteredWorker` message with `duplicated` flag. - Since `Worker` received like the following and will not try to reconnect. ``` 23/12/09 23:49:57 INFO Worker: Retrying connection to master (attempt # 3) 23/12/09 23:49:57 INFO Worker: Connecting to master ...:7077... 23/12/09 23:50:04 INFO TransportClientFactory: Successfully created connection to master...:7077 after 7089 ms (0 ms spent in bootstraps) 23/12/09 23:50:04 WARN Worker: Duplicate registration at master spark://... 23/12/09 23:50:04 INFO Worker: Successfully registered with master spark://... ``` The `UNKNOWN`-status workers blocks the recovery process and causes a long delay. https://github.com/apache/spark/blob/bac3492980a3e793065a9e9d511ddf0fb66357b3/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L604-L606 After the delay, master simply kills them all. https://github.com/apache/spark/blob/bac3492980a3e793065a9e9d511ddf0fb66357b3/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L647-L649 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This case is a little hard to make a unit test. Manually test. - Master ``` 23/12/10 04:58:30 WARN OneWayOutboxMessage: Failed to send one-way RPC. java.io.IOException: Connecting to /***:1024 timed out (1 ms) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:291) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:226) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) 23/12/10 04:58:54 INFO Master: Registering worker ***:1024 with 2 cores, 23.0 GiB RAM 23/12/10 04:58:54 INFO Master: Worker has been re-registered: worker-20231210045613-***-1024 ``` - Worker ``` 23/12/10 04:58:45 INFO Worker: Retrying connection to master (attempt # 5) 23/12/10 04:58:45 INFO Worker: Connecting to master master:7077... 23/12/10 04:58:54 INFO TransportClientFactory: Successfully created connection to master/...:7077 after 63957 ms (0 ms spent in bootstraps) 23/12/10 04:58:54 WARN Worker: Duplicate registration at master spark://master:7077 23/12/10 04:58:54 INFO Worker: Successfully registered with master spark://master:7077 23/12/10 04:58:54 INFO Worker: WorkerWebUI is available at https://...-1***-1024 23/12/10 04:58:54 INFO Worker: Worker cleanup enabled; old application directories will be deleted in: /data/spark ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44280 from dongjoon-hyun/SPARK-46346. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/master/Master.scala | 4 1 file changed, 4 insertions(+) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/Master.scala b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala index 7346c80aff4..a550f44fc0a
(spark) branch master updated: [SPARK-46348][CORE] Support `spark.deploy.recoveryTimeout`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 82a02321b87 [SPARK-46348][CORE] Support `spark.deploy.recoveryTimeout` 82a02321b87 is described below commit 82a02321b873cc67a521e98e25b9b07fd84b5684 Author: Dongjoon Hyun AuthorDate: Sun Dec 10 00:44:33 2023 -0800 [SPARK-46348][CORE] Support `spark.deploy.recoveryTimeout` ### What changes were proposed in this pull request? This PR aims to support a new configuration, `spark.deploy.recoveryTimeout`. ### Why are the changes needed? To allow the users to control this independently from `spark.worker.timeout`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44282 from dongjoon-hyun/SPARK-46348. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/master/Master.scala | 4 +++- .../src/main/scala/org/apache/spark/internal/config/Deploy.scala | 9 + docs/spark-standalone.md | 9 + 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/Master.scala b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala index be787dd29f8..7346c80aff4 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/Master.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala @@ -65,6 +65,8 @@ private[deploy] class Master( private val retainedDrivers = conf.get(RETAINED_DRIVERS) private val maxDrivers = conf.get(MAX_DRIVERS) private val reaperIterations = conf.get(REAPER_ITERATIONS) + private val recoveryTimeoutMs = +conf.get(RECOVERY_TIMEOUT).map(_ * 1000).getOrElse(workerTimeoutMs) private val recoveryMode = conf.get(RECOVERY_MODE) private val maxExecutorRetries = conf.get(MAX_EXECUTOR_RETRIES) @@ -246,7 +248,7 @@ private[deploy] class Master( override def run(): Unit = Utils.tryLogNonFatalError { self.send(CompleteRecovery) } -}, workerTimeoutMs, TimeUnit.MILLISECONDS) +}, recoveryTimeoutMs, TimeUnit.MILLISECONDS) } case CompleteRecovery => completeRecovery() diff --git a/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala b/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala index b52ea356789..6585d62b3b9 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/Deploy.scala @@ -18,6 +18,7 @@ package org.apache.spark.internal.config import java.util.Locale +import java.util.concurrent.TimeUnit private[spark] object Deploy { val RECOVERY_MODE = ConfigBuilder("spark.deploy.recoveryMode") @@ -56,6 +57,14 @@ private[spark] object Deploy { .stringConf .createWithDefault("") + val RECOVERY_TIMEOUT = ConfigBuilder("spark.deploy.recoveryTimeout") +.doc("Configures the timeout for recovery process. The default value is the same " + + "with ${WORKER_TIMEOUT.key}.") +.version("4.0.0") +.timeConf(TimeUnit.SECONDS) +.checkValue(_ > 0, "spark.deploy.recoveryTimeout must be positive.") +.createOptional + val ZOOKEEPER_URL = ConfigBuilder("spark.deploy.zookeeper.url") .doc(s"When `${RECOVERY_MODE.key}` is set to ZOOKEEPER, this " + "configuration is used to set the zookeeper URL to connect to.") diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 19935307221..0bc73978570 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -793,6 +793,15 @@ In order to enable this recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spa A compression codec for persistence engines. none (default), lz4, lzf, snappy, and zstd. Currently, only FILESYSTEM mode supports this configuration. 4.0.0 + +spark.deploy.recoveryTimeout +(none) + + The timeout for recovery process. The default value is the same with + spark.worker.timeout. + +4.0.0 + spark.deploy.recoveryMode.factory "" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46344][CORE] Warn properly when a driver exists successfully but master is disconnected
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 166df8b1df59 [SPARK-46344][CORE] Warn properly when a driver exists successfully but master is disconnected 166df8b1df59 is described below commit 166df8b1df5965fe3f15fecd5574545746b0b18f Author: Dongjoon Hyun AuthorDate: Sat Dec 9 15:33:48 2023 -0800 [SPARK-46344][CORE] Warn properly when a driver exists successfully but master is disconnected ### What changes were proposed in this pull request? This PR aims to warn properly when a driver exists successfully but master is disconnected. ### Why are the changes needed? In this case, `Master` considers them `Error` eventually. ![Screenshot 2023-12-09 at 3 05 27 PM](https://github.com/apache/spark/assets/9700541/1323819b-4a0c-466d-afaa-845f507a905e) **Worker Log** ``` 23/12/09 15:13:21 INFO Worker: Driver driver-20231209151301-0003 exited successfully === Master is disconnected here === 23/12/09 15:13:53 WARN Worker: Driver driver-20231209151332-0004 exited successfully while master is disconnected. === A new master starts and is connected here === 23/12/09 15:17:10 INFO Worker: Driver driver-20231209151707- exited successfully ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44278 from dongjoon-hyun/SPARK-46344. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala index 1422a1484f8d..785129e1d818 100755 --- a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala @@ -878,7 +878,12 @@ private[deploy] class Worker( case DriverState.FAILED => logWarning(s"Driver $driverId exited with failure") case DriverState.FINISHED => -logInfo(s"Driver $driverId exited successfully") +registrationRetryTimer match { + case Some(_) => +logWarning(s"Driver $driverId exited successfully while master is disconnected.") + case _ => +logInfo(s"Driver $driverId exited successfully") +} case DriverState.KILLED => logInfo(s"Driver $driverId was killed by user") case _ => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 4e80b3a09407 [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log 4e80b3a09407 is described below commit 4e80b3a09407042f7c596963dcb4fc59e68755ab Author: Liang-Chi Hsieh AuthorDate: Sat Dec 9 15:20:55 2023 -0800 [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log ### What changes were proposed in this pull request? This patch updates the document of `CheckpointFileManager.list` method to reflect the fact it is used to return both files and directories to reduce confusion. For the usage like `HDFSMetadataLog` where it assumes returned file status by `list` are all files, we add a filter there to avoid confusing error. ### Why are the changes needed? `HDFSMetadataLog` takes a metadata path as parameter. When it goes to retrieves all batches metadata, it calls `CheckpointFileManager.list` to get all files under the metadata path. However, currently all implementations of `CheckpointFileManager.list` returns all files/directories under the given path. So if there is a dictionary with name of batch number (a long value), the directory will be returned too and cause trouble when `HDFSMetadataLog` goes to read it. Actually, `CheckpointFileManager.list` method clearly defines that it lists the "files" in a path. That's being said, current implementations don't follow the doc. We tried to make `list` method implementations only return files but some usage (state metadata) of `list` method already break the assumption and they use dictionaries returned by `list` method. So we simply update `list` method document to explicitly define it returns both files/dictionaries. We add a filter in `HDFSMetad [...] ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44272 from viirya/fix_metadatalog. Authored-by: Liang-Chi Hsieh Signed-off-by: Dongjoon Hyun (cherry picked from commit 75805f07f5caeb01104a7352b02790d03a043ded) Signed-off-by: Dongjoon Hyun (cherry picked from commit 28a8b181e96d4ce71e2f9888910214d14a859b7d) Signed-off-by: Dongjoon Hyun --- .../sql/execution/streaming/CheckpointFileManager.scala | 4 ++-- .../spark/sql/execution/streaming/HDFSMetadataLog.scala | 2 ++ .../spark/sql/execution/streaming/HDFSMetadataLogSuite.scala | 12 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala index 013efd3c7bae..b2a3b8d73d4e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala @@ -65,10 +65,10 @@ trait CheckpointFileManager { /** Open a file for reading, or throw exception if it does not exist. */ def open(path: Path): FSDataInputStream - /** List the files in a path that match a filter. */ + /** List the files/directories in a path that match a filter. */ def list(path: Path, filter: PathFilter): Array[FileStatus] - /** List all the files in a path. */ + /** List all the files/directories in a path. */ def list(path: Path): Array[FileStatus] = { list(path, (_: Path) => true) } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala index 2b0172bb9555..9a811db679d0 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala @@ -325,6 +325,8 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: /** List the available batches on file system. */ protected def listBatches: Array[Long] = { val batchIds = fileManager.list(metadataPath, batchFilesFilter) + // Batches must be files + .filter(f => f.isFile) .map(f => pathToBatchId(f.getPath)) ++ // Iterate over keySet is not thread safe. We call `toArray` to make a copy in the lock to // elimiate the race condition. diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFS
(spark) branch branch-3.5 updated: [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 28a8b181e96d [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log 28a8b181e96d is described below commit 28a8b181e96d4ce71e2f9888910214d14a859b7d Author: Liang-Chi Hsieh AuthorDate: Sat Dec 9 15:20:55 2023 -0800 [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log ### What changes were proposed in this pull request? This patch updates the document of `CheckpointFileManager.list` method to reflect the fact it is used to return both files and directories to reduce confusion. For the usage like `HDFSMetadataLog` where it assumes returned file status by `list` are all files, we add a filter there to avoid confusing error. ### Why are the changes needed? `HDFSMetadataLog` takes a metadata path as parameter. When it goes to retrieves all batches metadata, it calls `CheckpointFileManager.list` to get all files under the metadata path. However, currently all implementations of `CheckpointFileManager.list` returns all files/directories under the given path. So if there is a dictionary with name of batch number (a long value), the directory will be returned too and cause trouble when `HDFSMetadataLog` goes to read it. Actually, `CheckpointFileManager.list` method clearly defines that it lists the "files" in a path. That's being said, current implementations don't follow the doc. We tried to make `list` method implementations only return files but some usage (state metadata) of `list` method already break the assumption and they use dictionaries returned by `list` method. So we simply update `list` method document to explicitly define it returns both files/dictionaries. We add a filter in `HDFSMetad [...] ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44272 from viirya/fix_metadatalog. Authored-by: Liang-Chi Hsieh Signed-off-by: Dongjoon Hyun (cherry picked from commit 75805f07f5caeb01104a7352b02790d03a043ded) Signed-off-by: Dongjoon Hyun --- .../sql/execution/streaming/CheckpointFileManager.scala | 4 ++-- .../spark/sql/execution/streaming/HDFSMetadataLog.scala | 2 ++ .../spark/sql/execution/streaming/HDFSMetadataLogSuite.scala | 12 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala index ad3212871fc9..677e2fccb6b4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala @@ -65,10 +65,10 @@ trait CheckpointFileManager { /** Open a file for reading, or throw exception if it does not exist. */ def open(path: Path): FSDataInputStream - /** List the files in a path that match a filter. */ + /** List the files/directories in a path that match a filter. */ def list(path: Path, filter: PathFilter): Array[FileStatus] - /** List all the files in a path. */ + /** List all the files/directories in a path. */ def list(path: Path): Array[FileStatus] = { list(path, (_: Path) => true) } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala index 2b0172bb9555..9a811db679d0 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala @@ -325,6 +325,8 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: /** List the available batches on file system. */ protected def listBatches: Array[Long] = { val batchIds = fileManager.list(metadataPath, batchFilesFilter) + // Batches must be files + .filter(f => f.isFile) .map(f => pathToBatchId(f.getPath)) ++ // Iterate over keySet is not thread safe. We call `toArray` to make a copy in the lock to // elimiate the race condition. diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala index 980d532dd477..08f245135f58 100644 --- a/sql/core/src/test/scala/org/apac
(spark) branch master updated: [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 75805f07f5ca [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log 75805f07f5ca is described below commit 75805f07f5caeb01104a7352b02790d03a043ded Author: Liang-Chi Hsieh AuthorDate: Sat Dec 9 15:20:55 2023 -0800 [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log ### What changes were proposed in this pull request? This patch updates the document of `CheckpointFileManager.list` method to reflect the fact it is used to return both files and directories to reduce confusion. For the usage like `HDFSMetadataLog` where it assumes returned file status by `list` are all files, we add a filter there to avoid confusing error. ### Why are the changes needed? `HDFSMetadataLog` takes a metadata path as parameter. When it goes to retrieves all batches metadata, it calls `CheckpointFileManager.list` to get all files under the metadata path. However, currently all implementations of `CheckpointFileManager.list` returns all files/directories under the given path. So if there is a dictionary with name of batch number (a long value), the directory will be returned too and cause trouble when `HDFSMetadataLog` goes to read it. Actually, `CheckpointFileManager.list` method clearly defines that it lists the "files" in a path. That's being said, current implementations don't follow the doc. We tried to make `list` method implementations only return files but some usage (state metadata) of `list` method already break the assumption and they use dictionaries returned by `list` method. So we simply update `list` method document to explicitly define it returns both files/dictionaries. We add a filter in `HDFSMetad [...] ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44272 from viirya/fix_metadatalog. Authored-by: Liang-Chi Hsieh Signed-off-by: Dongjoon Hyun --- .../sql/execution/streaming/CheckpointFileManager.scala | 4 ++-- .../spark/sql/execution/streaming/HDFSMetadataLog.scala | 2 ++ .../spark/sql/execution/streaming/HDFSMetadataLogSuite.scala | 12 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala index af2c97b21138..34c5dee0997b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala @@ -65,10 +65,10 @@ trait CheckpointFileManager { /** Open a file for reading, or throw exception if it does not exist. */ def open(path: Path): FSDataInputStream - /** List the files in a path that match a filter. */ + /** List the files/directories in a path that match a filter. */ def list(path: Path, filter: PathFilter): Array[FileStatus] - /** List all the files in a path. */ + /** List all the files/directories in a path. */ def list(path: Path): Array[FileStatus] = { list(path, (_: Path) => true) } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala index 79627030e1eb..b3eedbf93f04 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala @@ -327,6 +327,8 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: /** List the available batches on file system. */ protected def listBatches: Array[Long] = { val batchIds = fileManager.list(metadataPath, batchFilesFilter) + // Batches must be files + .filter(f => f.isFile) .map(f => pathToBatchId(f.getPath)) ++ // Iterate over keySet is not thread safe. We call `toArray` to make a copy in the lock to // elimiate the race condition. diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala index 980d532dd477..08f245135f58 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/s
(spark) branch master updated (1af4b658064d -> cae4bdc88161)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 1af4b658064d [SPARK-46338][PS][TESTS] Re-enable the `get_item` test for `BasicIndexingTests` add cae4bdc88161 [SPARK-46322][PYTHON][DOCS] Replace external link with internal link for error documentation No new revisions were added by this update. Summary of changes: python/pyspark/errors_doc_gen.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46338][PS][TESTS] Re-enable the `get_item` test for `BasicIndexingTests`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1af4b658064d [SPARK-46338][PS][TESTS] Re-enable the `get_item` test for `BasicIndexingTests` 1af4b658064d is described below commit 1af4b658064d05b15c4a8409a8aa13df63b64ca4 Author: Haejoon Lee AuthorDate: Sat Dec 9 13:46:13 2023 -0800 [SPARK-46338][PS][TESTS] Re-enable the `get_item` test for `BasicIndexingTests` ### What changes were proposed in this pull request? This PR proposes to re-enable the `get_item` test for `BasicIndexingTests`. ### Why are the changes needed? To improve the test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable the test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44271 from itholic/enable_test_getitem. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- python/pyspark/pandas/tests/test_indexing.py | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/python/pyspark/pandas/tests/test_indexing.py b/python/pyspark/pandas/tests/test_indexing.py index a4ca03005b33..eb86c9ffabc4 100644 --- a/python/pyspark/pandas/tests/test_indexing.py +++ b/python/pyspark/pandas/tests/test_indexing.py @@ -742,13 +742,13 @@ class IndexingTest(ComparisonTestBase): self.assertRaises(AttributeError, lambda: psdf.X) # not str/unicode -# TODO?: pdf = pd.DataFrame(np.random.randn(10, 5)) -# TODO?: psdf = ps.from_pandas(pdf) -# TODO?: self.assert_eq(psdf[0], pdf[0]) -# TODO?: self.assert_eq(psdf[[1, 2]], pdf[[1, 2]]) +pdf = pd.DataFrame(np.random.randn(10, 5)) +psdf = ps.from_pandas(pdf) +self.assert_eq(psdf[0], pdf[0]) +self.assert_eq(psdf[[1, 2]], pdf[[1, 2]]) -# TODO?: self.assertRaises(KeyError, lambda: pdf[8]) -# TODO?: self.assertRaises(KeyError, lambda: pdf[[1, 8]]) +self.assertRaises(KeyError, lambda: pdf[8]) +self.assertRaises(KeyError, lambda: pdf[[1, 8]]) # non-string column names pdf = pd.DataFrame( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46334][INFRA][PS] Upgrade `Pandas` to 2.1.4
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 48e773451eeb [SPARK-46334][INFRA][PS] Upgrade `Pandas` to 2.1.4 48e773451eeb is described below commit 48e773451eeb7feb84b3f11883bd76217f2233c4 Author: Bjørn Jørgensen AuthorDate: Fri Dec 8 18:39:10 2023 -0800 [SPARK-46334][INFRA][PS] Upgrade `Pandas` to 2.1.4 ### What changes were proposed in this pull request? Upgrade pandas from 2.1.3 to 2.1.4 ### Why are the changes needed? [Release notes](https://pandas.pydata.org/docs/whatsnew/v2.1.4.html) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44266 from bjornjorgensen/pandas2.1.4. Authored-by: Bjørn Jørgensen Signed-off-by: Dongjoon Hyun --- dev/infra/Dockerfile | 4 ++-- python/pyspark/pandas/supported_api_gen.py | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index 3e449bcb6c82..cade845d911b 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -92,10 +92,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \ ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \ ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3 -RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.3' scipy coverage matplotlib +RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib -ARG BASIC_PIP_PKGS="numpy pyarrow>=14.0.0 six==1.16.0 pandas<=2.1.3 scipy unittest-xml-reporting plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" +ARG BASIC_PIP_PKGS="numpy pyarrow>=14.0.0 six==1.16.0 pandas<=2.1.4 scipy unittest-xml-reporting plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" # Python deps for Spark Connect ARG CONNECT_PIP_PKGS="grpcio==1.59.3 grpcio-status==1.59.3 protobuf==4.25.1 googleapis-common-protos==1.56.4" diff --git a/python/pyspark/pandas/supported_api_gen.py b/python/pyspark/pandas/supported_api_gen.py index 1f893520d2ce..a598fc816d96 100644 --- a/python/pyspark/pandas/supported_api_gen.py +++ b/python/pyspark/pandas/supported_api_gen.py @@ -37,7 +37,7 @@ from pyspark.pandas.exceptions import PandasNotImplementedError MAX_MISSING_PARAMS_SIZE = 5 COMMON_PARAMETER_SET = {"kwargs", "args", "cls"} MODULE_GROUP_MATCH = [(pd, ps), (pdw, psw), (pdg, psg)] -PANDAS_LATEST_VERSION = "2.1.3" +PANDAS_LATEST_VERSION = "2.1.4" RST_HEADER = """ = - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46275][3.4] Protobuf: Return null in permissive mode when deserialization fails
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 4745138601b7 [SPARK-46275][3.4] Protobuf: Return null in permissive mode when deserialization fails 4745138601b7 is described below commit 4745138601b74e805459bd240f748fcf3e7ddec2 Author: Raghu Angadi AuthorDate: Fri Dec 8 14:40:03 2023 -0800 [SPARK-46275][3.4] Protobuf: Return null in permissive mode when deserialization fails This is a cherry-pick of #44214 into 3.4 branch. From the original PR: ### What changes were proposed in this pull request? This updates the the behavior of `from_protobuf()` built function when underlying record fails to deserialize. * **Current behvior**: * By default, this would throw an error and the query fails. [This part is not changed in the PR] * When `mode` is set to 'PERMISSIVE' it returns a non-null struct with each of the inner fields set to null e.g. `{ "field_a": null, "field_b": null }` etc. * This is not very convenient to the users. They don't know if this was due to malformed record or if the input itself has null. It is very hard to check for each field for null in SQL query (imagine a sql query with a struct that has 10 fields). * **New behavior** * When `mode` is set to 'PERMISSIVE' it simply returns `null`. ### Why are the changes needed? This makes it easier for users to detect and handle malformed records. ### Does this PR introduce _any_ user-facing change? Yes, but this does not change the contract. In fact, it clarifies it. ### How was this patch tested? - Unit tests are updated. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44265 from rangadi/protobuf-null-3.4. Authored-by: Raghu Angadi Signed-off-by: Dongjoon Hyun --- .../sql/protobuf/ProtobufDataToCatalyst.scala | 31 -- .../ProtobufCatalystDataConversionSuite.scala | 13 + 2 files changed, 6 insertions(+), 38 deletions(-) diff --git a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala index da44f94d5eac..78e995190045 100644 --- a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala +++ b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala @@ -21,12 +21,12 @@ import scala.util.control.NonFatal import com.google.protobuf.DynamicMessage -import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, SpecificInternalRow, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, CodeGenerator, ExprCode} import org.apache.spark.sql.catalyst.util.{FailFastMode, ParseMode, PermissiveMode} import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryExecutionErrors} import org.apache.spark.sql.protobuf.utils.{ProtobufOptions, ProtobufUtils, SchemaConverters} -import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType, StructType} +import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType} private[protobuf] case class ProtobufDataToCatalyst( child: Expression, @@ -38,16 +38,8 @@ private[protobuf] case class ProtobufDataToCatalyst( override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType) - override lazy val dataType: DataType = { -val dt = SchemaConverters.toSqlType(messageDescriptor, protobufOptions).dataType -parseMode match { - // With PermissiveMode, the output Catalyst row might contain columns of null values for - // corrupt records, even if some of the columns are not nullable in the user-provided schema. - // Therefore we force the schema to be all nullable here. - case PermissiveMode => dt.asNullable - case _ => dt -} - } + override lazy val dataType: DataType = +SchemaConverters.toSqlType(messageDescriptor, protobufOptions).dataType override def nullable: Boolean = true @@ -75,22 +67,9 @@ private[protobuf] case class ProtobufDataToCatalyst( mode } - @transient private lazy val nullResultRow: Any = dataType match { -case st: StructType => - val resultRow = new SpecificInternalRow(st.map(_.dataType)) - for (i <- 0 until st.length) { -resultRow.setNullAt(i) - } - resultRow - -case _ => - null - } - private def handleException(e: Throwable): Any = { parseMode match { - case PermissiveMode =&
(spark) branch master updated: [SPARK-46325][CONNECT] Remove unnecessary override functions when constructing `WrappedCloseableIterator` in `ResponseValidator#wrapIterator`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 75cb37011fa2 [SPARK-46325][CONNECT] Remove unnecessary override functions when constructing `WrappedCloseableIterator` in `ResponseValidator#wrapIterator` 75cb37011fa2 is described below commit 75cb37011fa2a1cb00e633fac312eb15ab412df9 Author: yangjie01 AuthorDate: Fri Dec 8 14:00:28 2023 -0800 [SPARK-46325][CONNECT] Remove unnecessary override functions when constructing `WrappedCloseableIterator` in `ResponseValidator#wrapIterator` ### What changes were proposed in this pull request? This pr removes the overridden `hasNext` and `close` functions in the construction of `WrappedCloseableIterator` in `ResponseValidator#wrapIterator`, as these functions are identical to those defined in `WrappedCloseableIterator`. - WrappedCloseableIterator https://github.com/apache/spark/blob/9ffdcc398ed5560f34778d005da697f6ad0a15ee/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CloseableIterator.scala#L30-L42 - ResponseValidator#wrapIterator https://github.com/apache/spark/blob/9ffdcc398ed5560f34778d005da697f6ad0a15ee/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala#L62-L85 ### Why are the changes needed? Remove unnecessary override functions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44255 from LuciferYang/SPARK-46325. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/connect/client/ResponseValidator.scala | 11 --- 1 file changed, 11 deletions(-) diff --git a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala index 2081196d4671..67f29c727ef4 100644 --- a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala +++ b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala @@ -65,22 +65,11 @@ class ResponseValidator extends Logging { override def innerIterator: Iterator[T] = inner - override def hasNext: Boolean = { -innerIterator.hasNext - } - override def next(): T = { verifyResponse { innerIterator.next() } } - - override def close(): Unit = { -innerIterator match { - case it: CloseableIterator[T] => it.close() - case _ => // nothing -} - } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46332][SQL] Migrate `CatalogNotFoundException` to the error class `CATALOG_NOT_FOUND`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d9f0fccd967b [SPARK-46332][SQL] Migrate `CatalogNotFoundException` to the error class `CATALOG_NOT_FOUND` d9f0fccd967b is described below commit d9f0fccd967b5c8686353d524d2b31e27b7a473b Author: Max Gekk AuthorDate: Fri Dec 8 12:54:20 2023 -0800 [SPARK-46332][SQL] Migrate `CatalogNotFoundException` to the error class `CATALOG_NOT_FOUND` ### What changes were proposed in this pull request? In the PR, I propose to migrate the `CatalogNotFoundException` exception to the new error class `CATALOG_NOT_FOUND`, improve the format of the exception message, and prohibit creation of the exception without the error class. ### Why are the changes needed? This is a part of the migration process onto error classes and new error framework. The changes improve user experience w/ Spark SQL, and make `CatalogNotFoundException` consistent to other Spark exceptions. ### Does this PR introduce _any_ user-facing change? Yes, if user's code depends on the error message format of `CatalogNotFoundException`. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44259 from MaxGekk/catalog-plugin-not-found. Authored-by: Max Gekk Signed-off-by: Dongjoon Hyun --- R/pkg/tests/fulltests/test_sparkSQL.R | 5 + common/utils/src/main/resources/error/error-classes.json | 6 ++ .../jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala | 6 +++--- docs/sql-error-conditions.md | 6 ++ .../spark/sql/connector/catalog/CatalogNotFoundException.scala | 10 +++--- .../org/apache/spark/sql/connector/catalog/Catalogs.scala | 2 +- .../org/apache/spark/sql/errors/QueryExecutionErrors.scala | 7 +-- .../spark/sql/connector/catalog/CatalogLoadingSuite.java | 7 ++- .../spark/sql/catalyst/analysis/TableLookupCacheSuite.scala| 6 +++--- .../spark/sql/connector/catalog/LookupCatalogSuite.scala | 5 +++-- .../org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala | 6 -- .../sql/execution/command/AlignAssignmentsSuiteBase.scala | 5 +++-- .../spark/sql/execution/command/PlanResolutionSuite.scala | 9 - 13 files changed, 48 insertions(+), 32 deletions(-) diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R b/R/pkg/tests/fulltests/test_sparkSQL.R index f2bef7a00446..0d96f708a544 100644 --- a/R/pkg/tests/fulltests/test_sparkSQL.R +++ b/R/pkg/tests/fulltests/test_sparkSQL.R @@ -4103,10 +4103,7 @@ test_that("catalog APIs, listCatalogs, setCurrentCatalog, currentCatalog", { expect_equal(currentCatalog(), "spark_catalog") expect_error(setCurrentCatalog("spark_catalog"), NA) expect_error(setCurrentCatalog("zxwtyswklpf"), - paste0("Error in setCurrentCatalog : ", - "org.apache.spark.sql.connector.catalog.CatalogNotFoundException: ", - "Catalog 'zxwtyswklpf' plugin class not found: ", - "spark.sql.catalog.zxwtyswklpf is not defined")) + "[CATALOG_NOT_FOUND]*`zxwtyswklpf`*") catalogs <- collect(listCatalogs()) }) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 7a672fa5e557..62d10c0d34cb 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -407,6 +407,12 @@ ], "sqlState" : "22003" }, + "CATALOG_NOT_FOUND" : { +"message" : [ + "The catalog not found. Consider to set the SQL config to a catalog plugin." +], +"sqlState" : "42P08" + }, "CHECKPOINT_RDD_BLOCK_ID_NOT_FOUND" : { "message" : [ "Checkpoint block not found!", diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala index cefa63ecd353..d646fad00c07 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala @@ -66,10 +66,10 @@ class CatalogSuite extends RemoteSparkSession with SQLHelper { val catalogs = spark.catalog.listCatalogs().collect()
(spark) branch master updated: [SPARK-46324][SQL][PYTHON] Fix the output name of pyspark.sql.functions.user and session_user
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3224cddcf9da [SPARK-46324][SQL][PYTHON] Fix the output name of pyspark.sql.functions.user and session_user 3224cddcf9da is described below commit 3224cddcf9da913c964b775b5912a67cd1e968b2 Author: Hyukjin Kwon AuthorDate: Fri Dec 8 12:48:24 2023 -0800 [SPARK-46324][SQL][PYTHON] Fix the output name of pyspark.sql.functions.user and session_user ### What changes were proposed in this pull request? This PR proposes to fix `user()` and `session_user()` to have the same names in its output name. ### Why are the changes needed? To show the correct name of the functions being used. ### Does this PR introduce _any_ user-facing change? Yes. ```scala spark.range(1).select(user(), session_user()).show() ``` Before: ``` +--+--+ |current_user()|current_user()| +--+--+ | hyukjin.kwon| hyukjin.kwon| +--+--+ ``` After: ``` +--+--+ |user()|session_user()| +--+--+ | hyukjin.kwon| hyukjin.kwon| +--+--+ ``` ### How was this patch tested? Manually tested, and unittests were added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44253 from HyukjinKwon/user-name. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- .../query-tests/explain-results/function_session_user.explain| 2 +- .../resources/query-tests/explain-results/function_user.explain | 2 +- python/pyspark/sql/functions/builtin.py | 4 ++-- python/pyspark/sql/tests/test_functions.py | 9 + .../scala/org/apache/spark/sql/catalyst/expressions/misc.scala | 5 +++-- .../src/test/resources/sql-functions/sql-expression-schema.md| 6 +++--- 6 files changed, 19 insertions(+), 9 deletions(-) diff --git a/connector/connect/common/src/test/resources/query-tests/explain-results/function_session_user.explain b/connector/connect/common/src/test/resources/query-tests/explain-results/function_session_user.explain index 82f5d2adcec0..b6205d9fb56c 100644 --- a/connector/connect/common/src/test/resources/query-tests/explain-results/function_session_user.explain +++ b/connector/connect/common/src/test/resources/query-tests/explain-results/function_session_user.explain @@ -1,2 +1,2 @@ -Project [current_user() AS current_user()#0] +Project [session_user() AS session_user()#0] +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0] diff --git a/connector/connect/common/src/test/resources/query-tests/explain-results/function_user.explain b/connector/connect/common/src/test/resources/query-tests/explain-results/function_user.explain index 82f5d2adcec0..52746c58c000 100644 --- a/connector/connect/common/src/test/resources/query-tests/explain-results/function_user.explain +++ b/connector/connect/common/src/test/resources/query-tests/explain-results/function_user.explain @@ -1,2 +1,2 @@ -Project [current_user() AS current_user()#0] +Project [user() AS user()#0] +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0] diff --git a/python/pyspark/sql/functions/builtin.py b/python/pyspark/sql/functions/builtin.py index 997b641080cf..e1c01018 100644 --- a/python/pyspark/sql/functions/builtin.py +++ b/python/pyspark/sql/functions/builtin.py @@ -8967,7 +8967,7 @@ def user() -> Column: >>> import pyspark.sql.functions as sf >>> spark.range(1).select(sf.user()).show() # doctest: +SKIP +--+ -|current_user()| +|user()| +--+ | ruifeng.zheng| +--+ @@ -8986,7 +8986,7 @@ def session_user() -> Column: >>> import pyspark.sql.functions as sf >>> spark.range(1).select(sf.session_user()).show() # doctest: +SKIP +--+ -|current_user()| +|session_user()| +--+ | ruifeng.zheng| +--+ diff --git a/python/pyspark/sql/tests/test_functions.py b/python/pyspark/sql/tests/test_functions.py index b59417d8a310..5352ee04d7fe 100644 --- a/python/pyspark/sql/tests/test_functions.py +++ b/python/pyspark/sql/tests/test_functions.py @@ -1355,6 +1355,15 @@ class FunctionsTestsMixin: message_parameters={"arg_name": "gapDuration", "arg_type": "int"}, ) +def test_current_user(self): +df = self.spark.range(1).select(F.current_user()) +self.assertIsInstance(df.first()[0], str) +self.assertEqual(df.
svn commit: r65952 - in /dev/spark/v3.3.4-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/deps/ _site/api/R/deps/bootstrap-5.3.1/ _site/api/R/deps/bootstrap-5.3.1/fonts/
Author: dongjoon Date: Fri Dec 8 20:21:58 2023 New Revision: 65952 Log: Apache Spark v3.3.4-rc1 docs [This commit notification would consist of 2683 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r65951 - /dev/spark/v3.4.2-rc1-docs/
Author: dongjoon Date: Fri Dec 8 19:49:07 2023 New Revision: 65951 Log: Remove Apache Spark 3.4.2 RC1 docs after 3.4.2 release Removed: dev/spark/v3.4.2-rc1-docs/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r65950 - /dev/spark/v3.3.4-rc1-bin/
Author: dongjoon Date: Fri Dec 8 19:47:54 2023 New Revision: 65950 Log: Apache Spark v3.3.4-rc1 Added: dev/spark/v3.3.4-rc1-bin/ dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz (with props) dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz.asc dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz.sha512 dev/spark/v3.3.4-rc1-bin/pyspark-3.3.4.tar.gz (with props) dev/spark/v3.3.4-rc1-bin/pyspark-3.3.4.tar.gz.asc dev/spark/v3.3.4-rc1-bin/pyspark-3.3.4.tar.gz.sha512 dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-hadoop2.tgz (with props) dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-hadoop2.tgz.asc dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-hadoop2.tgz.sha512 dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-hadoop3-scala2.13.tgz (with props) dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-hadoop3-scala2.13.tgz.asc dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-hadoop3-scala2.13.tgz.sha512 dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-hadoop3.tgz (with props) dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-hadoop3.tgz.asc dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-hadoop3.tgz.sha512 dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-without-hadoop.tgz (with props) dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-without-hadoop.tgz.asc dev/spark/v3.3.4-rc1-bin/spark-3.3.4-bin-without-hadoop.tgz.sha512 dev/spark/v3.3.4-rc1-bin/spark-3.3.4.tgz (with props) dev/spark/v3.3.4-rc1-bin/spark-3.3.4.tgz.asc dev/spark/v3.3.4-rc1-bin/spark-3.3.4.tgz.sha512 Added: dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz.asc == --- dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz.asc (added) +++ dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz.asc Fri Dec 8 19:47:54 2023 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJIBAABCgAyFiEE8oycklwYjDXjRWFN7aAM6DTw/FwFAmVzcocUHGRvbmdqb29u +QGFwYWNoZS5vcmcACgkQ7aAM6DTw/FxycA//Vd2YOoAE4QBm3xcAVuHdGesXVBhG +Ja7TIc2WLkxi4P/wRKi6vYtacgMuayygetDqIDNOv8w/xk20omwLvA7ComVv2kbz +abIyWOZggUOhOumYY5FyIskEZBygCiTgWqyZSSi13tj8DT2bUL931N4J/E+B8V+/ +tY77UvemmjNOxRvzmqDnZz3pp+faJ5J18UDMXG4J0atM7eLjT7vEpL4cAHUDuYGO +YHvwTxrUJxFyGafNkAUrFdAQmHaW+ULKawHPTBfex/sdojUqkQ0oiC7Z6th2ZpZ4 +R+yrs0GxaMaMLseifiK7sy6t/cHmGNghAPo0bl+sVe1rY8NDN075eaHOXQ5I+WqL +dNBarAEHJ7s50ClVNQtUIu4QWE7lydrW2ykRfKHAa0v5Qffe+vJGBmmv/RHRa2Wi +u9mohwbu3i6q1SRJFz2tpSF1ZxeAPshAy+/TMH0BDznf7lNcca/dFZGmgWNL6B4S +5YCbtK5JZ6JUjcVUSFi/1qQy2PfjDPmTYYDBqPNzx8M7kVXx2mbvWwslC0+KdnBK +OpMiSCSR5p3jz9sekKAI96vtzX5Q/zJc/+0J+wpk8Gpe5K2ajOG5oAbH0WQn5ex4 +rZ3eh9NH+VH5szmA9ZoJe7vf9fpLk3I2PPAhAoZPRMnx048+/kyzmsisJ52h14Tj +/b6FUTAqwOGzM8I= +=b7JY +-END PGP SIGNATURE- Added: dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz.sha512 == --- dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz.sha512 (added) +++ dev/spark/v3.3.4-rc1-bin/SparkR_3.3.4.tar.gz.sha512 Fri Dec 8 19:47:54 2023 @@ -0,0 +1 @@ +33b8848b7b409302945e4ef91ff254507e5f3766ad19404473889ce9ce5d88c5fa9d6a806d1120c8f43c16137a76426183af43e974a2f0bb12cfefce566fb048 SparkR_3.3.4.tar.gz Added: dev/spark/v3.3.4-rc1-bin/pyspark-3.3.4.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.3.4-rc1-bin/pyspark-3.3.4.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.3.4-rc1-bin/pyspark-3.3.4.tar.gz.asc == --- dev/spark/v3.3.4-rc1-bin/pyspark-3.3.4.tar.gz.asc (added) +++ dev/spark/v3.3.4-rc1-bin/pyspark-3.3.4.tar.gz.asc Fri Dec 8 19:47:54 2023 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJIBAABCgAyFiEE8oycklwYjDXjRWFN7aAM6DTw/FwFAmVzcokUHGRvbmdqb29u +QGFwYWNoZS5vcmcACgkQ7aAM6DTw/FwUPQ//ST9CejnIpZvrG8L6IEDk/G2QMPZR +LaFUy3DVdZsJ6JBdbo4wQWYl51CO5OvUzYqJC0WYtSdkOS34FagSccwjSg3mzhIa +FObQ86fdYk9wk+2fJl7+jTV6z1SZ99lRjLlN2JyFJLICm+SPVekCDMBmTDS/Ohrg +AVm/yMwjrtSs2QzxlmUIF+8IrriFHQtKUENx1kb0aYWSHgarGqJvzB6UcBOzAter +47vjj0nCfTSCafqyME2TZ6XVopOtrUK+ZENH9MXv1VyKfp4F6tLLMOf7ZyOa44S1 +PfEEOO+SllP9/nW30ydRtcD3ZsMT8YL6+sOuBt+y7nc8c6cJKVUjC62kAIA0yTga +XJU8fu+vT6HGELjME3V0LAiOKZ/IIhICgJENZRHXPVFZPa+gWSBhTGQyZD5e0ujD +iXr5Rrenp17fwAKtt1OeuJDkAht1J4Z/2zt6VzlqbM2GR0EbtDIbNl0TMaFoRkmH +HzwSsY2AsKCks4Ze7DBp9S0Pji93cWybqndm/SDEIRXSKegRus/i+8SFJv03IlZr +Mzkn7IW1RiM17SFd29q5hYyVyZ7Zff40dqYN7rkGWXQBul4U8vRI3WGeOrSB2Oce +jEqgBA1gArm+iojSkAdiAlrDgudNZViTLg0kUtCcdYSb3qFFtRAWEEBXvAAbMeBU +wSHbZ1fFgEELKNk= +=KLJ9 +-END PGP SIGNATURE- Added: dev/spark
(spark) branch master updated: [SPARK-46328][SQL] Allocate capacity of array list of TColumns by columns size in TRowSet generation
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 20c9b3dc4fac [SPARK-46328][SQL] Allocate capacity of array list of TColumns by columns size in TRowSet generation 20c9b3dc4fac is described below commit 20c9b3dc4fac283f895c8d860b4c6e0144697302 Author: liangbowen AuthorDate: Fri Dec 8 11:24:35 2023 -0800 [SPARK-46328][SQL] Allocate capacity of array list of TColumns by columns size in TRowSet generation ### What changes were proposed in this pull request? Allocate enough capacity by columns size for assembling array list of TColumns in TRowSet generation. ### Why are the changes needed? ArrayLists is created for TColumn value collections in RowSetUtils for TRowSet generation. Currently, they are created with Java's default capacity of 16, rather than by the number of columns, which could cause array copying in assembling each TColumn collection when the column number exceeds the default capacity. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44258 from bowenliang123/rowset-cap. Authored-by: liangbowen Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala index 94046adca0d8..502e29619027 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala @@ -57,15 +57,16 @@ object RowSetUtils { val tRows = new java.util.ArrayList[TRow](rowSize) while (i < rowSize) { val row = rows(i) - val tRow = new TRow() var j = 0 val columnSize = row.length + val tColumnValues = new java.util.ArrayList[TColumnValue](columnSize) while (j < columnSize) { val columnValue = toTColumnValue(j, row, schema(j), timeFormatters) -tRow.addToColVals(columnValue) +tColumnValues.add(columnValue) j += 1 } i += 1 + val tRow = new TRow(tColumnValues) tRows.add(tRow) } new TRowSet(startRowOffSet, tRows) @@ -80,11 +81,13 @@ object RowSetUtils { val tRowSet = new TRowSet(startRowOffSet, new java.util.ArrayList[TRow](rowSize)) var i = 0 val columnSize = schema.length +val tColumns = new java.util.ArrayList[TColumn](columnSize) while (i < columnSize) { val tColumn = toTColumn(rows, i, schema(i), timeFormatters) - tRowSet.addToColumns(tColumn) + tColumns.add(tColumn) i += 1 } +tRowSet.setColumns(tColumns) tRowSet } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) 01/01: Preparing development version 3.3.5-SNAPSHOT
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git commit 0837e78d9cee986f3f791072df2488947dcf0b9a Author: Dongjoon Hyun AuthorDate: Fri Dec 8 18:28:19 2023 + Preparing development version 3.3.5-SNAPSHOT --- R/pkg/DESCRIPTION | 2 +- assembly/pom.xml | 2 +- common/kvstore/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml | 2 +- common/network-yarn/pom.xml| 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml| 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 6 +++--- examples/pom.xml | 2 +- external/avro/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml| 2 +- external/kafka-0-10-token-provider/pom.xml | 2 +- external/kafka-0-10/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml| 2 +- graphx/pom.xml | 2 +- hadoop-cloud/pom.xml | 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml| 2 +- mllib/pom.xml | 2 +- pom.xml| 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/kubernetes/core/pom.xml | 2 +- resource-managers/kubernetes/integration-tests/pom.xml | 2 +- resource-managers/mesos/pom.xml| 2 +- resource-managers/yarn/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 39 files changed, 41 insertions(+), 41 deletions(-) diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION index ac01697363e4..6c1d59584cd0 100644 --- a/R/pkg/DESCRIPTION +++ b/R/pkg/DESCRIPTION @@ -1,6 +1,6 @@ Package: SparkR Type: Package -Version: 3.3.4 +Version: 3.3.5 Title: R Front End for 'Apache Spark' Description: Provides an R Front end for 'Apache Spark' <https://spark.apache.org>. Authors@R: diff --git a/assembly/pom.xml b/assembly/pom.xml index e3e15e41efb8..11a746d6701d 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.12 -3.3.4 +3.3.5-SNAPSHOT ../pom.xml diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml index 66fc33f5bb0b..99f36c865fba 100644 --- a/common/kvstore/pom.xml +++ b/common/kvstore/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4 +3.3.5-SNAPSHOT ../../pom.xml diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index fd81fe02bb03..e67d16c8eb9a 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4 +3.3.5-SNAPSHOT ../../pom.xml diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 9de437c7291a..44432ad56ce2 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4 +3.3.5-SNAPSHOT ../../pom.xml diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index 919717f81a94..6a9377402e77 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4 +3.3.5-SNAPSHOT ../../pom.xml diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index 429544784335..e6baac59e0f2 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4 +3.3.5-SNAPSHOT ../../pom.xml diff --git a/common/tags/pom.xml b/
(spark) 01/01: Preparing Spark release v3.3.4-rc1
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to tag v3.3.4-rc1 in repository https://gitbox.apache.org/repos/asf/spark.git commit 18db204995b32e87a650f2f09f9bcf047ddafa90 Author: Dongjoon Hyun AuthorDate: Fri Dec 8 18:28:15 2023 + Preparing Spark release v3.3.4-rc1 --- assembly/pom.xml | 2 +- common/kvstore/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml | 2 +- common/network-yarn/pom.xml| 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml| 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 2 +- examples/pom.xml | 2 +- external/avro/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml| 2 +- external/kafka-0-10-token-provider/pom.xml | 2 +- external/kafka-0-10/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml| 2 +- graphx/pom.xml | 2 +- hadoop-cloud/pom.xml | 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml| 2 +- mllib/pom.xml | 2 +- pom.xml| 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/kubernetes/core/pom.xml | 2 +- resource-managers/kubernetes/integration-tests/pom.xml | 2 +- resource-managers/mesos/pom.xml| 2 +- resource-managers/yarn/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 38 files changed, 38 insertions(+), 38 deletions(-) diff --git a/assembly/pom.xml b/assembly/pom.xml index b3d7ee94bb3c..e3e15e41efb8 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.12 -3.3.4-SNAPSHOT +3.3.4 ../pom.xml diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml index 8a8d44dfd941..66fc33f5bb0b 100644 --- a/common/kvstore/pom.xml +++ b/common/kvstore/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4-SNAPSHOT +3.3.4 ../../pom.xml diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index e76b78305aa7..fd81fe02bb03 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4-SNAPSHOT +3.3.4 ../../pom.xml diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 92324bf17469..9de437c7291a 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4-SNAPSHOT +3.3.4 ../../pom.xml diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index 6eaf00a3dca6..919717f81a94 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4-SNAPSHOT +3.3.4 ../../pom.xml diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index 3fcff5046da0..429544784335 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4-SNAPSHOT +3.3.4 ../../pom.xml diff --git a/common/tags/pom.xml b/common/tags/pom.xml index c04f0eb556e9..6a3243424567 100644 --- a/common/tags/pom.xml +++ b/common/tags/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.12 -3.3.4-SNAPSHOT +3.3.4 ../../pom.xml diff --git a/common/unsafe/pom.xml b/common/unsafe/pom.xml index 84a5d404faec..0675a3554c47 100644 --- a/common/unsafe/pom.xml +++ b/common/unsafe/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark
(spark) branch branch-3.3 updated (6a4488f2f486 -> 0837e78d9cee)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git from 6a4488f2f486 [SPARK-45580][SQL][3.3] Handle case where a nested subquery becomes an existence join add 18db204995b3 Preparing Spark release v3.3.4-rc1 new 0837e78d9cee Preparing development version 3.3.5-SNAPSHOT The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: R/pkg/DESCRIPTION | 2 +- assembly/pom.xml | 2 +- common/kvstore/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml | 2 +- common/network-yarn/pom.xml| 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml| 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 6 +++--- examples/pom.xml | 2 +- external/avro/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml| 2 +- external/kafka-0-10-token-provider/pom.xml | 2 +- external/kafka-0-10/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml| 2 +- graphx/pom.xml | 2 +- hadoop-cloud/pom.xml | 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml| 2 +- mllib/pom.xml | 2 +- pom.xml| 2 +- python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/kubernetes/core/pom.xml | 2 +- resource-managers/kubernetes/integration-tests/pom.xml | 2 +- resource-managers/mesos/pom.xml| 2 +- resource-managers/yarn/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 39 files changed, 41 insertions(+), 41 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) tag v3.3.4-rc1 created (now 18db204995b3)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to tag v3.3.4-rc1 in repository https://gitbox.apache.org/repos/asf/spark.git at 18db204995b3 (commit) This tag includes the following new commits: new 18db204995b3 Preparing Spark release v3.3.4-rc1 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (c06d41859f08 -> 9ffdcc398ed5)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from c06d41859f08 [SPARK-46320][CORE] Support `spark.master.rest.host` add 9ffdcc398ed5 [SPARK-46321][PS][TESTS] Re-ennable `IndexesTests.test_asof` that was skipped due to Pandas bug No new revisions were added by this update. Summary of changes: python/pyspark/pandas/tests/indexes/test_base.py | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (b6b450927ec8 -> c06d41859f08)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b6b450927ec8 [SPARK-46317][PYTHON][CONNECT] Match minor behaviour matching in SparkSession with full test coverage add c06d41859f08 [SPARK-46320][CORE] Support `spark.master.rest.host` No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/deploy/master/Master.scala | 3 ++- .../src/main/scala/org/apache/spark/internal/config/package.scala | 6 ++ docs/spark-standalone.md | 8 3 files changed, 16 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46316][CORE] Enable `buf-lint-action` on `core` module
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7e4a63a0588f [SPARK-46316][CORE] Enable `buf-lint-action` on `core` module 7e4a63a0588f is described below commit 7e4a63a0588f1b4b16e76d4d7d1add19cb2f0a82 Author: Dongjoon Hyun AuthorDate: Thu Dec 7 19:57:08 2023 -0800 [SPARK-46316][CORE] Enable `buf-lint-action` on `core` module ### What changes were proposed in this pull request? This PR aims to enable `buf-lint-action` on `core` module. ### Why are the changes needed? To enforce the community guideline. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ![Screenshot 2023-12-07 at 7 40 54 PM](https://github.com/apache/spark/assets/9700541/b23f-d8be-410a-bc61-88f8b477a3b0) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44246 from dongjoon-hyun/SPARK-46316. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 4 core/src/main/protobuf/buf.yaml | 23 +++ 2 files changed, 27 insertions(+) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 0e1a4a810f8a..e54883552920 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -583,6 +583,10 @@ jobs: uses: bufbuild/buf-setup-action@v1 with: github_token: ${{ secrets.GITHUB_TOKEN }} +- name: Protocol Buffers Linter + uses: bufbuild/buf-lint-action@v1 + with: +input: core/src/main/protobuf # Change 'branch-3.5' to 'branch-4.0' in master branch after cutting branch-4.0 branch. - name: Breaking change detection against branch-3.5 uses: bufbuild/buf-breaking-action@v1 diff --git a/core/src/main/protobuf/buf.yaml b/core/src/main/protobuf/buf.yaml new file mode 100644 index ..47f69191a5c7 --- /dev/null +++ b/core/src/main/protobuf/buf.yaml @@ -0,0 +1,23 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +version: v1 +breaking: + use: +- FILE +lint: + use: +- BASIC - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: Revert "[SPARK-46316][CORE] Enable `buf-lint-action` on `core` module"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 80dc64a573e1 Revert "[SPARK-46316][CORE] Enable `buf-lint-action` on `core` module" 80dc64a573e1 is described below commit 80dc64a573e1c7678f92f8690f09a52329f7d30b Author: Dongjoon Hyun AuthorDate: Thu Dec 7 20:03:01 2023 -0800 Revert "[SPARK-46316][CORE] Enable `buf-lint-action` on `core` module" This reverts commit dcbae0643ce145df6cd0a7a68af3fdd1a062587b. --- .github/workflows/build_and_test.yml | 4 core/src/main/protobuf/buf.yaml | 23 --- 2 files changed, 27 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index e54883552920..0e1a4a810f8a 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -583,10 +583,6 @@ jobs: uses: bufbuild/buf-setup-action@v1 with: github_token: ${{ secrets.GITHUB_TOKEN }} -- name: Protocol Buffers Linter - uses: bufbuild/buf-lint-action@v1 - with: -input: core/src/main/protobuf # Change 'branch-3.5' to 'branch-4.0' in master branch after cutting branch-4.0 branch. - name: Breaking change detection against branch-3.5 uses: bufbuild/buf-breaking-action@v1 diff --git a/core/src/main/protobuf/buf.yaml b/core/src/main/protobuf/buf.yaml deleted file mode 100644 index 47f69191a5c7.. --- a/core/src/main/protobuf/buf.yaml +++ /dev/null @@ -1,23 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -version: v1 -breaking: - use: -- FILE -lint: - use: -- BASIC - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (105eee73cfa0 -> dcbae0643ce1)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 105eee73cfa0 [SPARK-46313][CORE] Log `Spark HA` recovery duration add dcbae0643ce1 [SPARK-46316][CORE] Enable `buf-lint-action` on `core` module No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 4 {connector/connect/common => core}/src/main/protobuf/buf.yaml | 4 +--- 2 files changed, 5 insertions(+), 3 deletions(-) copy {connector/connect/common => core}/src/main/protobuf/buf.yaml (94%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (61a3e0587df6 -> 105eee73cfa0)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 61a3e0587df6 [SPARK-46312][CORE] Use `lower_camel_case` in `store_types.proto` add 105eee73cfa0 [SPARK-46313][CORE] Log `Spark HA` recovery duration No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/deploy/master/Master.scala | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46312][CORE] Use `lower_camel_case` in `store_types.proto`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 61a3e0587df6 [SPARK-46312][CORE] Use `lower_camel_case` in `store_types.proto` 61a3e0587df6 is described below commit 61a3e0587df6be881cdc115fefb77482fa446b5c Author: Dongjoon Hyun AuthorDate: Thu Dec 7 18:16:21 2023 -0800 [SPARK-46312][CORE] Use `lower_camel_case` in `store_types.proto` ### What changes were proposed in this pull request? This PR aims to use 'lower_camel_case` in `store_types.proto`. ### Why are the changes needed? According to our guideline, we had better follow [FIELD_LOWER_SNAKE_CASE](https://buf.build/docs/lint/rules#field_lower_snake_case) https://github.com/apache/spark/blob/9585cf6d56e3af37142609668dda1eeda3ec876f/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto#L23 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44242 from dongjoon-hyun/SPARK-46312. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../main/protobuf/org/apache/spark/status/protobuf/store_types.proto | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto index 93365add3a64..386c660b16de 100644 --- a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto +++ b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto @@ -164,7 +164,7 @@ message ExecutorStageSummaryWrapper { message ExecutorResourceRequest { optional string resource_name = 1; int64 amount = 2; - optional string discoveryScript = 3; + optional string discovery_script = 3; optional string vendor = 4; } @@ -277,7 +277,7 @@ message RDDStorageInfoWrapper { } message ResourceProfileWrapper { - ResourceProfileInfo rpInfo = 1; + ResourceProfileInfo rp_info = 1; } message CachedQuantile { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46309][PS][TESTS] Remove unused code in `pyspark.pandas.tests.indexes.* `
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9585cf6d56e3 [SPARK-46309][PS][TESTS] Remove unused code in `pyspark.pandas.tests.indexes.* ` 9585cf6d56e3 is described below commit 9585cf6d56e3af37142609668dda1eeda3ec876f Author: Ruifeng Zheng AuthorDate: Thu Dec 7 15:04:07 2023 -0800 [SPARK-46309][PS][TESTS] Remove unused code in `pyspark.pandas.tests.indexes.* ` ### What changes were proposed in this pull request? Remove unused code in `pyspark.pandas.tests.indexes.* ` ### Why are the changes needed? clean up the code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44239 from zhengruifeng/ps_index_cleanup. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .../pyspark/pandas/tests/data_type_ops/test_string_ops.py | 4 python/pyspark/pandas/tests/indexes/test_align.py | 14 -- python/pyspark/pandas/tests/indexes/test_base_slow.py | 7 --- python/pyspark/pandas/tests/indexes/test_reindex.py| 14 -- python/pyspark/pandas/tests/indexes/test_rename.py | 14 -- 5 files changed, 53 deletions(-) diff --git a/python/pyspark/pandas/tests/data_type_ops/test_string_ops.py b/python/pyspark/pandas/tests/data_type_ops/test_string_ops.py index 2870aed8e75e..340153b06335 100644 --- a/python/pyspark/pandas/tests/data_type_ops/test_string_ops.py +++ b/python/pyspark/pandas/tests/data_type_ops/test_string_ops.py @@ -35,10 +35,6 @@ class StringOpsTestsMixin: def bool_pdf(self): return pd.DataFrame({"this": ["x", "y", "z"], "that": ["z", "y", "x"]}) -@property -def bool_psdf(self): -return ps.from_pandas(self.bool_pdf) - @property def bool_non_numeric_pdf(self): return pd.concat([self.bool_pdf, self.non_numeric_pdf], axis=1) diff --git a/python/pyspark/pandas/tests/indexes/test_align.py b/python/pyspark/pandas/tests/indexes/test_align.py index 56fde9b4f28b..73e10d441078 100644 --- a/python/pyspark/pandas/tests/indexes/test_align.py +++ b/python/pyspark/pandas/tests/indexes/test_align.py @@ -16,7 +16,6 @@ # import unittest -import numpy as np import pandas as pd from pyspark import pandas as ps @@ -25,19 +24,6 @@ from pyspark.testing.sqlutils import SQLTestUtils class FrameAlignMixin: -@property -def pdf(self): -return pd.DataFrame( -{"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 0]}, -index=np.random.rand(9), -) - -@property -def df_pair(self): -pdf = self.pdf -psdf = ps.from_pandas(pdf) -return pdf, psdf - def test_align(self): pdf1 = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}, index=[10, 20, 30]) psdf1 = ps.from_pandas(pdf1) diff --git a/python/pyspark/pandas/tests/indexes/test_base_slow.py b/python/pyspark/pandas/tests/indexes/test_base_slow.py index c890f5004b43..eb417fe47ef8 100644 --- a/python/pyspark/pandas/tests/indexes/test_base_slow.py +++ b/python/pyspark/pandas/tests/indexes/test_base_slow.py @@ -24,13 +24,6 @@ from pyspark.testing.pandasutils import ComparisonTestBase, TestUtils class IndexesSlowTestsMixin: -@property -def pdf(self): -return pd.DataFrame( -{"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 0]}, -index=[0, 1, 3, 5, 6, 8, 9, 9, 9], -) - def test_append(self): # Index pidx = pd.Index(range(1)) diff --git a/python/pyspark/pandas/tests/indexes/test_reindex.py b/python/pyspark/pandas/tests/indexes/test_reindex.py index 1d544ea221bf..1229a613846b 100644 --- a/python/pyspark/pandas/tests/indexes/test_reindex.py +++ b/python/pyspark/pandas/tests/indexes/test_reindex.py @@ -16,7 +16,6 @@ # import unittest -import numpy as np import pandas as pd from pyspark import pandas as ps @@ -25,19 +24,6 @@ from pyspark.testing.sqlutils import SQLTestUtils class FrameReindexMixin: -@property -def pdf(self): -return pd.DataFrame( -{"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 0]}, -index=np.random.rand(9), -) - -@property -def df_pair(self): -pdf = self.pdf -psdf = ps.from_pandas(pdf) -return pdf, psdf - def test_reindex(self): index = pd.Index(["A", "B", "C", "D", &q
(spark) branch master updated (027aeb1764a8 -> 82e67461511e)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 027aeb1764a8 [SPARK-46277][PYTHON] Validate startup urls with the config being set add 82e67461511e [SPARK-46311][CORE] Log the final state of drivers during `Master.removeDriver` No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/deploy/master/Master.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46293][CONNECT][PYTHON] Use `protobuf` transitive dependency
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e359318c4493 [SPARK-46293][CONNECT][PYTHON] Use `protobuf` transitive dependency e359318c4493 is described below commit e359318c4493e16a7546d70c9340ffc5015aacff Author: Haejoon Lee AuthorDate: Thu Dec 7 10:28:27 2023 -0800 [SPARK-46293][CONNECT][PYTHON] Use `protobuf` transitive dependency ### What changes were proposed in this pull request? This PR proposes to remove `protobuf` from required package. ### Why are the changes needed? `protobuf` is automatically installed when installing `grpcio` and `grpcio-status`, so we don't need to specify the specific version explicitly. ### Does this PR introduce _any_ user-facing change? No API changes. ### How was this patch tested? The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44221 from itholic/protobuf_docs. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- dev/requirements.txt | 1 - 1 file changed, 1 deletion(-) diff --git a/dev/requirements.txt b/dev/requirements.txt index 0f1f1aee5b63..51facfeb5088 100644 --- a/dev/requirements.txt +++ b/dev/requirements.txt @@ -53,7 +53,6 @@ py # Spark Connect (required) grpcio>=1.59.3 grpcio-status>=1.59.3 -protobuf==4.25.1 googleapis-common-protos>=1.56.4 # Spark Connect python proto generation plugin (optional) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (0692856bb124 -> 8132e1700c81)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 0692856bb124 [SPARK-46307][PS][TESTS] Enable `fill_value` tests for `GroupByTests.test_shift` add 8132e1700c81 [SPARK-46261][CONNECT] `DataFrame.withColumnsRenamed` should keep the dict/map ordering No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/sql/Dataset.scala | 8 ++- .../main/protobuf/spark/connect/relations.proto| 14 +++- .../queries/withColumnRenamed_java_map.json| 11 +-- .../queries/withColumnRenamed_java_map.proto.bin | Bin 72 -> 72 bytes .../queries/withColumnRenamed_scala_map.json | 11 +-- .../queries/withColumnRenamed_scala_map.proto.bin | Bin 72 -> 72 bytes .../queries/withColumnRenamed_single.json | 7 +- .../queries/withColumnRenamed_single.proto.bin | Bin 60 -> 60 bytes .../sql/connect/planner/SparkConnectPlanner.scala | 19 +++-- python/pyspark/sql/connect/plan.py | 8 ++- python/pyspark/sql/connect/proto/relations_pb2.py | 80 +++-- python/pyspark/sql/connect/proto/relations_pb2.pyi | 34 - .../sql/tests/connect/test_parity_dataframe.py | 5 -- .../main/scala/org/apache/spark/sql/Dataset.scala | 2 +- 14 files changed, 133 insertions(+), 66 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46301][CORE] Support `spark.worker.(initial|max)RegistrationRetries`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b92d64d6ef0c [SPARK-46301][CORE] Support `spark.worker.(initial|max)RegistrationRetries` b92d64d6ef0c is described below commit b92d64d6ef0c99b6b444f41ebdfe95f3260312aa Author: Dongjoon Hyun AuthorDate: Thu Dec 7 01:01:56 2023 -0800 [SPARK-46301][CORE] Support `spark.worker.(initial|max)RegistrationRetries` ### What changes were proposed in this pull request? This PR aims to support `spark.worker.(initial|max)RegistrationRetries` to parameterize the hard-coded magic numbers. ```scala - private val INITIAL_REGISTRATION_RETRIES = 6 - private val TOTAL_REGISTRATION_RETRIES = INITIAL_REGISTRATION_RETRIES + 10 + private val INITIAL_REGISTRATION_RETRIES = conf.get(WORKER_INITIAL_REGISTRATION_RETRIES) + private val TOTAL_REGISTRATION_RETRIES = conf.get(WORKER_MAX_REGISTRATION_RETRIES) ``` ### Why are the changes needed? To allow users to control these. ### Does this PR introduce _any_ user-facing change? No. The default values are consistent with the existing behavior. ### How was this patch tested? Pass the CIs. ![Screenshot 2023-12-06 at 8 58 05 PM](https://github.com/apache/spark/assets/9700541/985ff3f7-d8c9-4803-a207-a6c16388e4d0) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44229 from dongjoon-hyun/SPARK-46301. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/deploy/worker/Worker.scala | 14 ++ .../org/apache/spark/internal/config/Worker.scala | 17 + docs/spark-standalone.md | 18 ++ 3 files changed, 45 insertions(+), 4 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala index eae12648b95a..1422a1484f8d 100755 --- a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala @@ -96,12 +96,18 @@ private[deploy] class Worker( private val HEARTBEAT_MILLIS = conf.get(WORKER_TIMEOUT) * 1000 / 4 // Model retries to connect to the master, after Hadoop's model. - // The first six attempts to reconnect are in shorter intervals (between 5 and 15 seconds) - // Afterwards, the next 10 attempts are between 30 and 90 seconds. + // The total number of retries are less than or equal to WORKER_MAX_REGISTRATION_RETRIES. + // Within the upper limit, WORKER_MAX_REGISTRATION_RETRIES, + // the first WORKER_INITIAL_REGISTRATION_RETRIES attempts to reconnect are in shorter intervals + // (between 5 and 15 seconds). Afterwards, the next attempts are between 30 and 90 seconds while // A bit of randomness is introduced so that not all of the workers attempt to reconnect at // the same time. - private val INITIAL_REGISTRATION_RETRIES = 6 - private val TOTAL_REGISTRATION_RETRIES = INITIAL_REGISTRATION_RETRIES + 10 + private val INITIAL_REGISTRATION_RETRIES = conf.get(WORKER_INITIAL_REGISTRATION_RETRIES) + private val TOTAL_REGISTRATION_RETRIES = conf.get(WORKER_MAX_REGISTRATION_RETRIES) + if (INITIAL_REGISTRATION_RETRIES > TOTAL_REGISTRATION_RETRIES) { +logInfo(s"${WORKER_INITIAL_REGISTRATION_RETRIES.key} ($INITIAL_REGISTRATION_RETRIES) is " + + s"capped by ${WORKER_MAX_REGISTRATION_RETRIES.key} ($TOTAL_REGISTRATION_RETRIES)") + } private val FUZZ_MULTIPLIER_INTERVAL_LOWER_BOUND = 0.500 private val REGISTRATION_RETRY_FUZZ_MULTIPLIER = { val randomNumberGenerator = new Random(UUID.randomUUID.getMostSignificantBits) diff --git a/core/src/main/scala/org/apache/spark/internal/config/Worker.scala b/core/src/main/scala/org/apache/spark/internal/config/Worker.scala index f160470edd8f..c53e181df002 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/Worker.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/Worker.scala @@ -37,6 +37,23 @@ private[spark] object Worker { .longConf .createWithDefault(60) + val WORKER_INITIAL_REGISTRATION_RETRIES = ConfigBuilder("spark.worker.initialRegistrationRetries") +.version("4.0.0") +.internal() +.doc("The number of retries to reconnect in short intervals (between 5 and 15 seconds).") +.intConf +.checkValue(_ > 0, "The number of initial registration retries should be positive") +.createWithDefault(6) + + val WORKER_MAX_REGISTRATION_RETRIES = ConfigBuilder("spark.worker.maxRegistrationRetries") +.version("4.0.0") +.internal() +.doc(&
(spark) branch branch-3.3 updated: [SPARK-45580][SQL][3.3] Handle case where a nested subquery becomes an existence join
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 6a4488f2f48 [SPARK-45580][SQL][3.3] Handle case where a nested subquery becomes an existence join 6a4488f2f48 is described below commit 6a4488f2f4861df41025480cceda643e9e74484e Author: Bruce Robbins AuthorDate: Wed Dec 6 19:24:13 2023 -0800 [SPARK-45580][SQL][3.3] Handle case where a nested subquery becomes an existence join ### What changes were proposed in this pull request? This is a back-port of https://github.com/apache/spark/pull/44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44223 from bersprockets/schema_change_br33. Authored-by: Bruce Robbins Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/optimizer/subquery.scala| 9 +++-- .../scala/org/apache/spark/sql/SubquerySuite.scala | 46 ++ 2 files changed, 52 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala index 7ef5ef55fab..ff198c798b9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala @@ -113,16 +113,19 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { withSubquery.foldLeft(newFilter) { case (p, Exists(sub, _, _, conditions)) => val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p) - buildJoin(outerPlan, sub, LeftSemi, joinCond) + val join = buildJoin(outerPlan, sub, LeftSemi, joinCond) + Project(p.output, join) case (p, Not(Exists(sub, _, _, conditions))) => val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p) - buildJoin(outerPlan, sub, LeftAnti, joinCond) + val join = buildJoin(outerPlan, sub, LeftAnti, joinCond) + Project(p.output, join) case (p, InSubquery(values, ListQuery(sub, _, _, _, conditions))) => // Deduplicate conflicting attributes if any. val newSub = dedupSubqueryOnSelfJoin(p, sub, Some(values)) val inConditions = values.zip(newSub.output).map(EqualTo.tupled)
(spark) branch branch-3.4 updated: [SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 8e40ec6fa52 [SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join 8e40ec6fa52 is described below commit 8e40ec6fa525420c1da5ce3b8846ef9f540b9d49 Author: Bruce Robbins AuthorDate: Wed Dec 6 19:23:19 2023 -0800 [SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join ### What changes were proposed in this pull request? This is a back-port of https://github.com/apache/spark/pull/44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44219 from bersprockets/schema_change_br34. Authored-by: Bruce Robbins Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/optimizer/subquery.scala| 9 +++-- .../scala/org/apache/spark/sql/SubquerySuite.scala | 46 ++ 2 files changed, 52 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala index 1d2f5602630..861f2f2fabf 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala @@ -118,16 +118,19 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { withSubquery.foldLeft(newFilter) { case (p, Exists(sub, _, _, conditions, subHint)) => val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p) - buildJoin(outerPlan, sub, LeftSemi, joinCond, subHint) + val join = buildJoin(outerPlan, sub, LeftSemi, joinCond, subHint) + Project(p.output, join) case (p, Not(Exists(sub, _, _, conditions, subHint))) => val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p) - buildJoin(outerPlan, sub, LeftAnti, joinCond, subHint) + val join = buildJoin(outerPlan, sub, LeftAnti, joinCond, subHint) + Project(p.output, join) case (p, InSubquery(values, ListQuery(sub, _, _, _, conditions, subHint))) => // Deduplicate conflicting attributes if any. val newSub = dedupSubqueryOnSelfJoin(p, sub, Some(values)) val inCond
(spark) branch master updated: [SPARK-46299][DOCS] Make `spark.deploy.recovery*` docs up-to-date
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 43ca0b929ab [SPARK-46299][DOCS] Make `spark.deploy.recovery*` docs up-to-date 43ca0b929ab is described below commit 43ca0b929ab3c2f10d1879e5df622195564f8885 Author: Dongjoon Hyun AuthorDate: Wed Dec 6 19:19:41 2023 -0800 [SPARK-46299][DOCS] Make `spark.deploy.recovery*` docs up-to-date ### What changes were proposed in this pull request? This PR aims to update `Spark Standalone` cluster recovery configurations. ### Why are the changes needed? We need to document - #44173 - #44129 - #44113 ![Screenshot 2023-12-06 at 7 15 24 PM](https://github.com/apache/spark/assets/9700541/04f0be6f-cdfb-4d87-b1b5-c4bf131f460a) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44227 from dongjoon-hyun/SPARK-46299. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- docs/spark-standalone.md | 26 +++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 7a89c8124bd..25d2fba47ce 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -735,18 +735,38 @@ In order to enable this recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spa spark.deploy.recoveryMode NONE -The recovery mode setting to recover submitted Spark jobs with cluster mode when it failed and relaunches. - Set to FILESYSTEM to enable single-node recovery mode, ZOOKEEPER to use Zookeeper-based recovery mode, and +The recovery mode setting to recover submitted Spark jobs with cluster mode when it failed and relaunches. Set to + FILESYSTEM to enable file-system-based single-node recovery mode, + ROCKSDB to enable RocksDB-based single-node recovery mode, + ZOOKEEPER to use Zookeeper-based recovery mode, and CUSTOM to provide a customer provider class via additional `spark.deploy.recoveryMode.factory` configuration. + NONE is the default value which disables this recovery mode. 0.8.1 spark.deploy.recoveryDirectory "" -The directory in which Spark will store recovery state, accessible from the Master's perspective. +The directory in which Spark will store recovery state, accessible from the Master's perspective. + Note that the directory should be clearly manualy if spark.deploy.recoveryMode, + spark.deploy.recoverySerializer, or spark.deploy.recoveryCompressionCodec is changed. + 0.8.1 + +spark.deploy.recoverySerializer +JAVA +A serializer for writing/reading objects to/from persistence engines; JAVA (default) or KRYO. + Java serializer has been the default mode since Spark 0.8.1. + Kryo serializer is a new fast and compact mode from Spark 4.0.0. +4.0.0 + + +spark.deploy.recoveryCompressionCodec +(none) +A compression codec for persistence engines. none (default), lz4, lzf, snappy, and zstd. Currently, only FILESYSTEM mode supports this configuration. +4.0.0 + spark.deploy.recoveryMode.factory "" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46292][CORE][UI] Show a summary of workers in MasterPage
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e2441c41de4 [SPARK-46292][CORE][UI] Show a summary of workers in MasterPage e2441c41de4 is described below commit e2441c41de476b09542db60836d7d853d47f6158 Author: Dongjoon Hyun AuthorDate: Wed Dec 6 17:49:37 2023 -0800 [SPARK-46292][CORE][UI] Show a summary of workers in MasterPage ### What changes were proposed in this pull request? This PR aims to show a summary of workers in MasterPage. ### Why are the changes needed? Although `Alive Workers` is a useful information, it's insufficient to analyze the whole cluster status because we don't know how many workers are in other status. Especially, this is useful during the recovery process of Spark Master HA setting. In short, this helps the users identify the issues intuitively. ``` - Alive Workers: 1 + Workers: 1 Alive, 1 Dead, 0 Decommissioned, 0 Unknown ``` Here is a screenshot. ![Screenshot 2023-12-06 at 3 13 43 PM](https://github.com/apache/spark/assets/9700541/f078b6ae-ab22-4721-8c67-661121bb9807) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44218 from dongjoon-hyun/SPARK-46292. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index b2f35984d37..f25e3495d79 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -144,7 +144,11 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { }.getOrElse { Seq.empty } } - Alive Workers: {aliveWorkers.length} + Workers: {aliveWorkers.length} Alive, +{workers.count(_.state == WorkerState.DEAD)} Dead, +{workers.count(_.state == WorkerState.DECOMMISSIONED)} Decommissioned, +{workers.count(_.state == WorkerState.UNKNOWN)} Unknown + Cores in use: {aliveWorkers.map(_.cores).sum} Total, {aliveWorkers.map(_.coresUsed).sum} Used Memory in use: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-45580][SQL][3.5] Handle case where a nested subquery becomes an existence join
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new dbb61981b80 [SPARK-45580][SQL][3.5] Handle case where a nested subquery becomes an existence join dbb61981b80 is described below commit dbb61981b804dbc03cf140c7c76653348e2ac740 Author: Bruce Robbins AuthorDate: Wed Dec 6 15:24:48 2023 -0800 [SPARK-45580][SQL][3.5] Handle case where a nested subquery becomes an existence join ### What changes were proposed in this pull request? This is a back-port of #44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44215 from bersprockets/schema_change_br35. Authored-by: Bruce Robbins Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/optimizer/subquery.scala| 9 +++-- .../scala/org/apache/spark/sql/SubquerySuite.scala | 46 ++ 2 files changed, 52 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala index 91cd838ad61..ee200531578 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala @@ -118,16 +118,19 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { withSubquery.foldLeft(newFilter) { case (p, Exists(sub, _, _, conditions, subHint)) => val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p) - buildJoin(outerPlan, sub, LeftSemi, joinCond, subHint) + val join = buildJoin(outerPlan, sub, LeftSemi, joinCond, subHint) + Project(p.output, join) case (p, Not(Exists(sub, _, _, conditions, subHint))) => val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p) - buildJoin(outerPlan, sub, LeftAnti, joinCond, subHint) + val join = buildJoin(outerPlan, sub, LeftAnti, joinCond, subHint) + Project(p.output, join) case (p, InSubquery(values, ListQuery(sub, _, _, _, conditions, subHint))) => // Deduplicate conflicting attributes if any. val newSub = dedupSubqueryOnSelfJoin(p, sub, Some(values)) val inConditions = values.zip(newSub.output)
(spark) branch master updated: [SPARK-46230][PYTHON] Migrate `RetriesExceeded` into PySpark error
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 31a48381f51 [SPARK-46230][PYTHON] Migrate `RetriesExceeded` into PySpark error 31a48381f51 is described below commit 31a48381f5139a51045a10df344df3ce7ad1adb7 Author: Haejoon Lee AuthorDate: Wed Dec 6 11:00:43 2023 -0800 [SPARK-46230][PYTHON] Migrate `RetriesExceeded` into PySpark error ### What changes were proposed in this pull request? This PR proposes to migrate `RetriesExceeded` into PySpark error. ### Why are the changes needed? All errors defined from PySpark should be inherits `PySparkException` to keep the consistency of error messages generated from PySpark. ### Does this PR introduce _any_ user-facing change? No, it's internal refactoring for better error handling. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44147 from itholic/retires_exception. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- python/docs/source/reference/pyspark.errors.rst| 1 + python/pyspark/errors/__init__.py | 2 ++ python/pyspark/errors/error_classes.py | 5 + python/pyspark/errors/exceptions/base.py | 7 +++ python/pyspark/sql/connect/client/retries.py | 11 ++- python/pyspark/sql/tests/connect/client/test_client.py | 2 +- python/pyspark/sql/tests/connect/test_connect_basic.py | 3 ++- 7 files changed, 20 insertions(+), 11 deletions(-) diff --git a/python/docs/source/reference/pyspark.errors.rst b/python/docs/source/reference/pyspark.errors.rst index a4997506b41..270a8a8c716 100644 --- a/python/docs/source/reference/pyspark.errors.rst +++ b/python/docs/source/reference/pyspark.errors.rst @@ -48,6 +48,7 @@ Classes PySparkIndexError PythonException QueryExecutionException +RetriesExceeded SessionNotSameException SparkRuntimeException SparkUpgradeException diff --git a/python/pyspark/errors/__init__.py b/python/pyspark/errors/__init__.py index 07033d21643..a4f64e85f87 100644 --- a/python/pyspark/errors/__init__.py +++ b/python/pyspark/errors/__init__.py @@ -46,6 +46,7 @@ from pyspark.errors.exceptions.base import ( # noqa: F401 PySparkAssertionError, PySparkNotImplementedError, PySparkPicklingError, +RetriesExceeded, PySparkKeyError, ) @@ -78,5 +79,6 @@ __all__ = [ "PySparkAssertionError", "PySparkNotImplementedError", "PySparkPicklingError", +"RetriesExceeded", "PySparkKeyError", ] diff --git a/python/pyspark/errors/error_classes.py b/python/pyspark/errors/error_classes.py index c93ffa94149..965fd04a913 100644 --- a/python/pyspark/errors/error_classes.py +++ b/python/pyspark/errors/error_classes.py @@ -813,6 +813,11 @@ ERROR_CLASSES_JSON = """ "Columns do not match in their data type: ." ] }, + "RETRIES_EXCEEDED" : { +"message" : [ + "The maximum number of retries has been exceeded." +] + }, "SCHEMA_MISMATCH_FOR_PANDAS_UDF" : { "message" : [ "Result vector from pandas_udf was not the required length: expected , got ." diff --git a/python/pyspark/errors/exceptions/base.py b/python/pyspark/errors/exceptions/base.py index b7d8ed88ec0..b60800da3ff 100644 --- a/python/pyspark/errors/exceptions/base.py +++ b/python/pyspark/errors/exceptions/base.py @@ -260,6 +260,13 @@ class PySparkPicklingError(PySparkException, PicklingError): """ +class RetriesExceeded(PySparkException): +""" +Represents an exception which is considered retriable, but retry limits +were exceeded +""" + + class PySparkKeyError(PySparkException, KeyError): """ Wrapper class for KeyError to support error classes. diff --git a/python/pyspark/sql/connect/client/retries.py b/python/pyspark/sql/connect/client/retries.py index 88fc3fe1ffd..44e5e1834a2 100644 --- a/python/pyspark/sql/connect/client/retries.py +++ b/python/pyspark/sql/connect/client/retries.py @@ -22,7 +22,7 @@ import typing from typing import Optional, Callable, Generator, List, Type from types import TracebackType from pyspark.sql.connect.client.logging import logger -from pyspark.errors import PySparkRuntimeError +from pyspark.errors import PySparkRuntimeError, RetriesExceeded """ This module contains retry system. The system is designed to be @@ -233,7 +233,7 @@ class Retrying: # Exceeded retries logger.debug(f"Given
(spark) branch master updated: [SPARK-46270][SQL][CORE][SS] Use java16 instanceof expressions to replace the java8 instanceof statement
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 231d89f89ed [SPARK-46270][SQL][CORE][SS] Use java16 instanceof expressions to replace the java8 instanceof statement 231d89f89ed is described below commit 231d89f89ede2cac6cad596f2a3b36673ad0b2f3 Author: Jiaan Geng AuthorDate: Wed Dec 6 10:59:39 2023 -0800 [SPARK-46270][SQL][CORE][SS] Use java16 instanceof expressions to replace the java8 instanceof statement ### What changes were proposed in this pull request? This PR uses java14 `instanceof` to replace the java8 `instanceof`. For example: ``` if (obj instanceof String) { String s = (String) obj;// grr... ... } ``` We can change it to ``` if (obj instanceof String s) { // Let pattern matching do the work! ... } ``` ### Why are the changes needed? Using [`[JEP 394: Pattern Matching for instanceof]` ](https://openjdk.org/jeps/394)can bring the following benefits: 1. **More concise syntax**: Pattern matching allows the desired "shape" of an object to be expressed concisely (the pattern), and for various statements and expressions to test that "shape" against their input (the matching). 2. **Safer**: The motto is: "A pattern variable is in scope where it has definitely matched". This allows for the safe reuse of pattern variables and is both intuitive and familiar, since Java developers are already used to flow sensitive analyses. 3. **Avoid explicit casts**: The use of pattern matching in instanceof should significantly reduce the overall number of explicit casts in Java programs. Type test patterns are particularly useful when writing equality methods. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44187 from beliefer/SPARK-46270. Authored-by: Jiaan Geng Signed-off-by: Dongjoon Hyun --- .../apache/spark/util/kvstore/ArrayWrappers.java | 28 .../spark/util/kvstore/KVStoreSerializer.java | 4 +- .../apache/spark/util/kvstore/LevelDBTypeInfo.java | 12 ++-- .../apache/spark/util/kvstore/RocksDBTypeInfo.java | 12 ++-- .../spark/network/client/StreamInterceptor.java| 4 +- .../network/client/TransportResponseHandler.java | 21 ++ .../protocol/EncryptedMessageWithHeader.java | 14 ++-- .../spark/network/protocol/MessageWithHeader.java | 8 +-- .../spark/network/protocol/SslMessageEncoder.java | 7 +- .../spark/network/sasl/SaslClientBootstrap.java| 4 +- .../network/server/TransportChannelHandler.java| 8 +-- .../network/server/TransportRequestHandler.java| 28 .../network/ssl/ReloadingX509TrustManager.java | 4 +- .../org/apache/spark/network/ssl/SSLFactory.java | 4 +- .../org/apache/spark/network/util/NettyLogger.java | 13 ++-- .../apache/spark/network/TestManagedBuffer.java| 4 +- .../spark/network/crypto/AuthIntegrationSuite.java | 4 +- .../apache/spark/network/shuffle/ErrorHandler.java | 4 +- .../network/shuffle/ExternalBlockHandler.java | 27 +++- .../shuffle/RetryingBlockTransferorSuite.java | 8 +-- .../network/yarn/YarnShuffleServiceMetrics.java| 24 +++ .../apache/spark/util/sketch/BloomFilterImpl.java | 16 ++--- .../spark/util/sketch/CountMinSketchImpl.java | 16 ++--- .../java/org/apache/spark/util/sketch/Utils.java | 16 ++--- .../org/apache/spark/unsafe/types/UTF8String.java | 6 +- .../org/apache/spark/io/ReadAheadInputStream.java | 4 +- .../unsafe/sort/UnsafeExternalSorter.java | 4 +- .../unsafe/sort/UnsafeInMemorySorter.java | 5 +- .../org/apache/spark/launcher/LauncherServer.java | 4 +- .../expressions/SpecializedGettersReader.java | 8 +-- .../sql/catalyst/expressions/UnsafeDataUtils.java | 6 +- .../spark/sql/catalyst/expressions/UnsafeRow.java | 12 ++-- .../spark/sql/connector/read/streaming/Offset.java | 4 +- .../sql/connector/util/V2ExpressionSQLBuilder.java | 17 ++--- .../spark/sql/vectorized/ArrowColumnVector.java| 76 +++--- .../spark/sql/vectorized/ColumnarBatchRow.java | 8 +-- .../apache/spark/sql/vectorized/ColumnarRow.java | 4 +- .../datasources/orc/OrcAtomicColumnVector.java | 20 +++--- .../execution/datasources/orc/OrcFooterReader.java | 14 ++-- .../parquet/ParquetVectorUpdaterFactory.java | 9 ++- .../parquet/VectorizedColumnReader.java| 7 +- .../execution/vectorized/ConstantColumnVector.java | 4 +- .../execution/vectorized/MutableColumnarRow.java | 4 +- .../JavaAdvancedDataSourceV2WithV2Filter.jav
(spark) branch master updated: [SPARK-46186][CONNECT][TESTS][FOLLOWUP] Remove flakiness of `ReattachableExecuteSuite`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 982c7268f4c [SPARK-46186][CONNECT][TESTS][FOLLOWUP] Remove flakiness of `ReattachableExecuteSuite` 982c7268f4c is described below commit 982c7268f4c7ea1fa03ea679146f9e83f31bece7 Author: Juliusz Sompolski AuthorDate: Wed Dec 6 10:57:45 2023 -0800 [SPARK-46186][CONNECT][TESTS][FOLLOWUP] Remove flakiness of `ReattachableExecuteSuite` ### What changes were proposed in this pull request? The test added in https://github.com/apache/spark/pull/44095 could be flaky because `MEDIUM_RESULTS_QUERY` could very quickly finish before interrupt was sent. Replace it with a query that sleeps 30 seconds, so that we are sure that interrupt runs before it finishes. ### Why are the changes needed? Remove test flakiness. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Rerun ReattachableExecuteSuite 100+ times to check it isn't flaky. ### Was this patch authored or co-authored using generative AI tooling? Github Copilot was assisting in some boilerplate auto-completion. Generated-by: Github Copilot Closes #44189 from juliuszsompolski/SPARK-46186-followup. Authored-by: Juliusz Sompolski Signed-off-by: Dongjoon Hyun --- .../sql/connect/execution/ReattachableExecuteSuite.scala | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala index 02b75f04495..f80229c6198 100644 --- a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala +++ b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala @@ -298,6 +298,15 @@ class ReattachableExecuteSuite extends SparkConnectServerTest { } test("SPARK-46186 interrupt directly after query start") { +// register a sleep udf in the session +val serverSession = + SparkConnectService.getOrCreateIsolatedSession(defaultUserId, defaultSessionId).session +serverSession.udf.register( + "sleep", + ((ms: Int) => { +Thread.sleep(ms); +ms + })) // This test depends on fast timing. // If something is wrong, it can fail only from time to time. withRawBlockingStub { stub => @@ -309,12 +318,13 @@ class ReattachableExecuteSuite extends SparkConnectServerTest { .setOperationId(operationId) .build() val iter = stub.executePlan( -buildExecutePlanRequest(buildPlan(MEDIUM_RESULTS_QUERY), operationId = operationId)) +buildExecutePlanRequest(buildPlan("select sleep(3) as s"), operationId = operationId)) // wait for execute holder to exist, but the execute thread may not have started yet. Eventually.eventually(timeout(eventuallyTimeout)) { assert(SparkConnectService.executionManager.listExecuteHolders.length == 1) } stub.interrupt(interruptRequest) + // make sure the client gets the OPERATION_CANCELED error val e = intercept[StatusRuntimeException] { while (iter.hasNext) iter.next() } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c96fef2ea55 [SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join c96fef2ea55 is described below commit c96fef2ea55ee85ac66905584e9dee31471de9f1 Author: Bruce Robbins AuthorDate: Wed Dec 6 10:55:15 2023 -0800 [SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join ### What changes were proposed in this pull request? In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44193 from bersprockets/schema_change. Authored-by: Bruce Robbins Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/optimizer/subquery.scala| 9 +++-- .../scala/org/apache/spark/sql/SubquerySuite.scala | 46 ++ 2 files changed, 52 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala index 1f1a16e9093..6ca2cb79aaf 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala @@ -132,19 +132,22 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { withSubquery.foldLeft(newFilter) { case (p, Exists(sub, _, _, conditions, subHint)) => val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p) - buildJoin(outerPlan, rewriteDomainJoinsIfPresent(outerPlan, sub, joinCond), + val join = buildJoin(outerPlan, rewriteDomainJoinsIfPresent(outerPlan, sub, joinCond), LeftSemi, joinCond, subHint) + Project(p.output, join) case (p, Not(Exists(sub, _, _, conditions, subHint))) => val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p) - buildJoin(outerPlan, rewriteDomainJoinsIfPresent(outerPlan, sub, joinCond), + val join = buildJoin(outerPlan, rewriteDomainJoinsIfPresent(outerPlan, sub, joinCond), LeftAnti, joinCond, subHint) + Project(p.output, join) case (p, InSubquery(values, ListQuery(sub, _, _, _, conditions, subHint))) => // Deduplicate conflicting attributes if any.
(spark) branch master updated: [SPARK-46232][PYTHON][FOLLOWUP] Migrate `ValueError` into `PySparkValueError`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 467df65ceba [SPARK-46232][PYTHON][FOLLOWUP] Migrate `ValueError` into `PySparkValueError` 467df65ceba is described below commit 467df65ceba5f6a8957ca7d72f5537434bf32e81 Author: Haejoon Lee AuthorDate: Wed Dec 6 10:51:22 2023 -0800 [SPARK-46232][PYTHON][FOLLOWUP] Migrate `ValueError` into `PySparkValueError` ### What changes were proposed in this pull request? This PR followups for https://github.com/apache/spark/pull/44149 to address missing case. ### Why are the changes needed? To improve error handling. ### Does this PR introduce _any_ user-facing change? No API changes. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44202 from itholic/SPARK-46232-followup. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/pandas/serializers.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/pandas/serializers.py b/python/pyspark/sql/pandas/serializers.py index 8b2b583ddaa..834f22c86c0 100644 --- a/python/pyspark/sql/pandas/serializers.py +++ b/python/pyspark/sql/pandas/serializers.py @@ -738,8 +738,9 @@ class CogroupPandasUDFSerializer(ArrowStreamPandasUDFSerializer): ) elif dataframes_in_group != 0: -raise ValueError( -"Invalid number of pandas.DataFrames in group {0}".format(dataframes_in_group) +raise PySparkValueError( +error_class="INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP", +message_parameters={"dataframes_in_group": str(dataframes_in_group)}, ) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46283][INFRA] Remove `streaming-kinesis-asl` module from `MODULES_TO_TEST` for branch-3.x daily tests
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 809cec01205 [SPARK-46283][INFRA] Remove `streaming-kinesis-asl` module from `MODULES_TO_TEST` for branch-3.x daily tests 809cec01205 is described below commit 809cec012055d6f15987f338122d2fdb5bdd5c92 Author: yangjie01 AuthorDate: Wed Dec 6 10:50:27 2023 -0800 [SPARK-46283][INFRA] Remove `streaming-kinesis-asl` module from `MODULES_TO_TEST` for branch-3.x daily tests ### What changes were proposed in this pull request? After the merge of https://github.com/apache/spark/pull/43736, the master branch began testing the `streaming-kinesis-asl` module. At the same time, because the daily test will reuse `build_and_test.yml`, the daily test of branch-3.x also began testing `streaming-kinesis-asl`. However, in branch-3.x, the env `ENABLE_KINESIS_TESTS` is hard-coded as 1 in `dev/sparktestsupport/modules.py`: https://github.com/apache/spark/blob/1321b4e64deaa1e58bf297c25b72319083056568/dev/sparktestsupport/modules.py#L332-L346 which leads to the failure of the daily test of branch-3.x: - branch-3.3: https://github.com/apache/spark/actions/runs/7111246311 - branch-3.4: https://github.com/apache/spark/actions/runs/7098435892 - branch-3.5: https://github.com/apache/spark/actions/runs/7099811235 ``` [info] org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite *** ABORTED *** (1 second, 14 milliseconds) [info] java.lang.Exception: Kinesis tests enabled using environment variable ENABLE_KINESIS_TESTS [info] but could not find AWS credentials. Please follow instructions in AWS documentation [info] to set the credentials in your system such that the DefaultAWSCredentialsProviderChain [info] can find the credentials. [info] at org.apache.spark.streaming.kinesis.KinesisTestUtils$.getAWSCredentials(KinesisTestUtils.scala:258) [info] at org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient$lzycompute(KinesisTestUtils.scala:58) [info] at org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient(KinesisTestUtils.scala:57) [info] at org.apache.spark.streaming.kinesis.KinesisTestUtils.describeStream(KinesisTestUtils.scala:168) [info] at org.apache.spark.streaming.kinesis.KinesisTestUtils.findNonExistentStreamName(KinesisTestUtils.scala:181) [info] at org.apache.spark.streaming.kinesis.KinesisTestUtils.createStream(KinesisTestUtils.scala:84) [info] at org.apache.spark.streaming.kinesis.KinesisStreamTests.$anonfun$beforeAll$1(KinesisStreamSuite.scala:61) [info] at org.apache.spark.streaming.kinesis.KinesisFunSuite.runIfTestsEnabled(KinesisFunSuite.scala:41) [info] at org.apache.spark.streaming.kinesis.KinesisFunSuite.runIfTestsEnabled$(KinesisFunSuite.scala:39) [info] at org.apache.spark.streaming.kinesis.KinesisStreamTests.runIfTestsEnabled(KinesisStreamSuite.scala:42) [info] at org.apache.spark.streaming.kinesis.KinesisStreamTests.beforeAll(KinesisStreamSuite.scala:59) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.streaming.kinesis.KinesisStreamTests.org$scalatest$BeforeAndAfter$$super$run(KinesisStreamSuite.scala:42) [info] at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:273) [info] at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:271) [info] at org.apache.spark.streaming.kinesis.KinesisStreamTests.run(KinesisStreamSuite.scala:42) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:750) [info] Test run org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite started [info] Test org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite.testJavaKinesisDStreamBuilderOldApi started [info] Test org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite.testJavaKinesisDStreamBuilder started [info] Test run org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite finished: 0 failed, 0
(spark) branch branch-3.3 updated: [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 37d10ec3644 [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` 37d10ec3644 is described below commit 37d10ec3644d41396cd7378fdc3fe405b680203c Author: Kent Yao AuthorDate: Wed Dec 6 10:46:31 2023 -0800 [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation - Missing docs - https://github.com/apache/spark/pull/31502#issuecomment-774792276 potential regression no doc build no Closes #44207 from yaooqinn/SPARK-46286. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun (cherry picked from commit 6b6980de451e655ef4b9f63d502b73c09a513d4c) Signed-off-by: Dongjoon Hyun --- docs/configuration.md | 8 1 file changed, 8 insertions(+) diff --git a/docs/configuration.md b/docs/configuration.md index b96defb2adb..2a205522989 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1545,6 +1545,14 @@ Apart from these, the following properties are also available, and may be useful 2.3.0 + + spark.io.compression.zstd.bufferPool.enabled + true + +If true, enable buffer pool of ZSTD JNI library. + + 3.2.0 + spark.kryo.classesToRegister (none) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 93fef098a0d [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` 93fef098a0d is described below commit 93fef098a0d5d6c95205a46ebf9c959e325c9d7e Author: Kent Yao AuthorDate: Wed Dec 6 10:46:31 2023 -0800 [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation - Missing docs - https://github.com/apache/spark/pull/31502#issuecomment-774792276 potential regression no doc build no Closes #44207 from yaooqinn/SPARK-46286. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun (cherry picked from commit 6b6980de451e655ef4b9f63d502b73c09a513d4c) Signed-off-by: Dongjoon Hyun --- docs/configuration.md | 8 1 file changed, 8 insertions(+) diff --git a/docs/configuration.md b/docs/configuration.md index 198a6dd4b2b..6bd49f398d9 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1727,6 +1727,14 @@ Apart from these, the following properties are also available, and may be useful 2.3.0 + + spark.io.compression.zstd.bufferPool.enabled + true + +If true, enable buffer pool of ZSTD JNI library. + + 3.2.0 + spark.kryo.classesToRegister (none) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new b5cbe1fcdb4 [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` b5cbe1fcdb4 is described below commit b5cbe1fcdb464fc064ffb5fbef3edfa408d6638f Author: Kent Yao AuthorDate: Wed Dec 6 10:46:31 2023 -0800 [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation - Missing docs - https://github.com/apache/spark/pull/31502#issuecomment-774792276 potential regression no doc build no Closes #44207 from yaooqinn/SPARK-46286. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun (cherry picked from commit 6b6980de451e655ef4b9f63d502b73c09a513d4c) Signed-off-by: Dongjoon Hyun --- docs/configuration.md | 8 1 file changed, 8 insertions(+) diff --git a/docs/configuration.md b/docs/configuration.md index 248f9333c9a..f79406c5b6d 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1752,6 +1752,14 @@ Apart from these, the following properties are also available, and may be useful 2.3.0 + + spark.io.compression.zstd.bufferPool.enabled + true + +If true, enable buffer pool of ZSTD JNI library. + + 3.2.0 + spark.kryo.classesToRegister (none) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6b6980de451 [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` 6b6980de451 is described below commit 6b6980de451e655ef4b9f63d502b73c09a513d4c Author: Kent Yao AuthorDate: Wed Dec 6 10:46:31 2023 -0800 [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` ### What changes were proposed in this pull request? This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation ### Why are the changes needed? - Missing docs - https://github.com/apache/spark/pull/31502#issuecomment-774792276 potential regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ### Was this patch authored or co-authored using generative AI tooling? no Closes #44207 from yaooqinn/SPARK-46286. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- docs/configuration.md | 8 1 file changed, 8 insertions(+) diff --git a/docs/configuration.md b/docs/configuration.md index 2ad07cf59f7..f261e3b2deb 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1760,6 +1760,14 @@ Apart from these, the following properties are also available, and may be useful 2.3.0 + + spark.io.compression.zstd.bufferPool.enabled + true + +If true, enable buffer pool of ZSTD JNI library. + + 3.2.0 + spark.io.compression.zstd.workers 0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46287][PYTHON][CONNECT] `DataFrame.isEmpty` should work with all datatypes
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f4e41e0e318 [SPARK-46287][PYTHON][CONNECT] `DataFrame.isEmpty` should work with all datatypes f4e41e0e318 is described below commit f4e41e0e318ea1269de5991f4635637e6e5233f3 Author: Ruifeng Zheng AuthorDate: Wed Dec 6 10:45:12 2023 -0800 [SPARK-46287][PYTHON][CONNECT] `DataFrame.isEmpty` should work with all datatypes ### What changes were proposed in this pull request? `DataFrame.isEmpty` should work with all datatypes the schema maybe not compatible with arrow, so should not use `collect/take` to check `isEmpty` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? before: ``` In [1]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS interval").isEmpty() 23/12/06 20:39:58 WARN CheckAllocator: More than one DefaultAllocationManager on classpath. Choosing first found --- / 1] KeyError Traceback (most recent call last) Cell In[1], line 1 > 1 spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS interval").isEmpty() File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:181, in DataFrame.isEmpty(self) 180 def isEmpty(self) -> bool: --> 181 return len(self.take(1)) == 0 ... File ~/.dev/miniconda3/envs/spark_dev_311/lib/python3.11/site-packages/pyarrow/public-api.pxi:208, in pyarrow.lib.pyarrow_wrap_array() File ~/.dev/miniconda3/envs/spark_dev_311/lib/python3.11/site-packages/pyarrow/array.pxi:3659, in pyarrow.lib.get_array_class_from_type() KeyError: 21 ``` after ``` In [1]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS interval").isEmpty() 23/12/06 20:40:26 WARN CheckAllocator: More than one DefaultAllocationManager on classpath. Choosing first found Out[1]: False ``` ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? no Closes #44209 from zhengruifeng/py_connect_df_isempty. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/connect/dataframe.py| 2 +- python/pyspark/sql/tests/connect/test_connect_basic.py | 5 + 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 6a1d4571216..66059ad96eb 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -178,7 +178,7 @@ class DataFrame: write.__doc__ = PySparkDataFrame.write.__doc__ def isEmpty(self) -> bool: -return len(self.take(1)) == 0 +return len(self.select().take(1)) == 0 isEmpty.__doc__ = PySparkDataFrame.isEmpty.__doc__ diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py b/python/pyspark/sql/tests/connect/test_connect_basic.py index fb5eaece7f4..5e0cf535391 100755 --- a/python/pyspark/sql/tests/connect/test_connect_basic.py +++ b/python/pyspark/sql/tests/connect/test_connect_basic.py @@ -2004,6 +2004,11 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase): self.assertFalse(self.connect.sql("SELECT 1 AS X").isEmpty()) self.assertTrue(self.connect.sql("SELECT 1 AS X LIMIT 0").isEmpty()) +def test_is_empty_with_unsupported_types(self): +df = self.spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS interval") +self.assertEqual(df.count(), 1) +self.assertFalse(df.isEmpty()) + def test_session(self): self.assertEqual(self.connect, self.connect.sql("SELECT 1").sparkSession) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46288][PS][TESTS] Remove unused code in `pyspark.pandas.tests.frame.*`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f6861c3918b [SPARK-46288][PS][TESTS] Remove unused code in `pyspark.pandas.tests.frame.*` f6861c3918b is described below commit f6861c3918bdedf5d8d89dbecced3317cc9dc490 Author: Ruifeng Zheng AuthorDate: Wed Dec 6 10:44:04 2023 -0800 [SPARK-46288][PS][TESTS] Remove unused code in `pyspark.pandas.tests.frame.*` ### What changes were proposed in this pull request? Remove unused code in `pyspark.pandas.tests.frame.*` ### Why are the changes needed? code clean up ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44212 from zhengruifeng/ps_frame_cleanup. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/pandas/tests/frame/test_conversion.py | 6 -- python/pyspark/pandas/tests/frame/test_reindexing.py | 13 - python/pyspark/pandas/tests/frame/test_spark.py | 6 -- python/pyspark/pandas/tests/frame/test_take.py| 14 -- python/pyspark/pandas/tests/frame/test_time_series.py | 13 - python/pyspark/pandas/tests/frame/test_truncate.py| 14 -- 6 files changed, 66 deletions(-) diff --git a/python/pyspark/pandas/tests/frame/test_conversion.py b/python/pyspark/pandas/tests/frame/test_conversion.py index 116a7d31c11..eefb461239e 100644 --- a/python/pyspark/pandas/tests/frame/test_conversion.py +++ b/python/pyspark/pandas/tests/frame/test_conversion.py @@ -34,12 +34,6 @@ class FrameConversionMixin: index=np.random.rand(9), ) -@property -def df_pair(self): -pdf = self.pdf -psdf = ps.from_pandas(pdf) -return pdf, psdf - def test_astype(self): psdf = self.psdf diff --git a/python/pyspark/pandas/tests/frame/test_reindexing.py b/python/pyspark/pandas/tests/frame/test_reindexing.py index 606efd95188..b3639945391 100644 --- a/python/pyspark/pandas/tests/frame/test_reindexing.py +++ b/python/pyspark/pandas/tests/frame/test_reindexing.py @@ -30,19 +30,6 @@ from pyspark.testing.sqlutils import SQLTestUtils # This file contains test cases for 'Reindexing / Selection / Label manipulation' # https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/frame.html#reindexing-selection-label-manipulation class FrameReindexingMixin: -@property -def pdf(self): -return pd.DataFrame( -{"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 0]}, -index=np.random.rand(9), -) - -@property -def df_pair(self): -pdf = self.pdf -psdf = ps.from_pandas(pdf) -return pdf, psdf - def test_add_prefix(self): pdf = pd.DataFrame({"A": [1, 2, 3, 4], "B": [3, 4, 5, 6]}, index=np.random.rand(4)) psdf = ps.from_pandas(pdf) diff --git a/python/pyspark/pandas/tests/frame/test_spark.py b/python/pyspark/pandas/tests/frame/test_spark.py index 4413279e32f..36466695c30 100644 --- a/python/pyspark/pandas/tests/frame/test_spark.py +++ b/python/pyspark/pandas/tests/frame/test_spark.py @@ -43,12 +43,6 @@ class FrameSparkMixin: index=np.random.rand(9), ) -@property -def df_pair(self): -pdf = self.pdf -psdf = ps.from_pandas(pdf) -return pdf, psdf - def test_empty_dataframe(self): pdf = pd.DataFrame({"a": pd.Series([], dtype="i1"), "b": pd.Series([], dtype="str")}) diff --git a/python/pyspark/pandas/tests/frame/test_take.py b/python/pyspark/pandas/tests/frame/test_take.py index 28d20e9bd99..3654436848b 100644 --- a/python/pyspark/pandas/tests/frame/test_take.py +++ b/python/pyspark/pandas/tests/frame/test_take.py @@ -16,7 +16,6 @@ # import unittest -import numpy as np import pandas as pd from pyspark import pandas as ps @@ -25,19 +24,6 @@ from pyspark.testing.sqlutils import SQLTestUtils class FrameTakeMixin: -@property -def pdf(self): -return pd.DataFrame( -{"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 0]}, -index=np.random.rand(9), -) - -@property -def df_pair(self): -pdf = self.pdf -psdf = ps.from_pandas(pdf) -return pdf, psdf - def test_take(self): pdf = pd.DataFrame( {"A": range(0, 5), "B": range(10, 0, -2), "C": range(10, 5, -1)} diff --git a/python/pyspark/pandas/tests/frame/test_time_series.py b/python/pyspark/pandas/tests/frame/
(spark) branch master updated: [SPARK-46268][PS][CONNECT][TESTS] Re-organize `StatsTests`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ef05fb632424 [SPARK-46268][PS][CONNECT][TESTS] Re-organize `StatsTests` ef05fb632424 is described below commit ef05fb632424f8f121bcd4518ab3a8815c295c85 Author: Ruifeng Zheng AuthorDate: Tue Dec 5 10:19:41 2023 -0800 [SPARK-46268][PS][CONNECT][TESTS] Re-organize `StatsTests` ### What changes were proposed in this pull request? Re-organize `StatsTests` ### Why are the changes needed? break the big test file by grouping test cases by topics ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44185 from zhengruifeng/ps_reorg_test_stats. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- dev/sparktestsupport/modules.py| 8 +- .../pyspark/pandas/tests/computation/test_corr.py | 222 .../pandas/tests/{ => computation}/test_stats.py | 278 + .../test_parity_corr.py} | 7 +- .../connect/{ => computation}/test_parity_stats.py | 4 +- .../test_parity_axis.py} | 6 +- python/pyspark/pandas/tests/frame/test_axis.py | 135 ++ 7 files changed, 373 insertions(+), 287 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 9bbe86baa1dc..900329d07c00 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -789,6 +789,7 @@ pyspark_pandas_slow = Module( "pyspark.pandas.tests.computation.test_binary_ops", "pyspark.pandas.tests.computation.test_combine", "pyspark.pandas.tests.computation.test_compute", +"pyspark.pandas.tests.computation.test_corr", "pyspark.pandas.tests.computation.test_corrwith", "pyspark.pandas.tests.computation.test_cov", "pyspark.pandas.tests.computation.test_cumulative", @@ -797,7 +798,9 @@ pyspark_pandas_slow = Module( "pyspark.pandas.tests.computation.test_melt", "pyspark.pandas.tests.computation.test_missing_data", "pyspark.pandas.tests.computation.test_pivot", +"pyspark.pandas.tests.computation.test_stats", "pyspark.pandas.tests.frame.test_attrs", +"pyspark.pandas.tests.frame.test_axis", "pyspark.pandas.tests.frame.test_constructor", "pyspark.pandas.tests.frame.test_conversion", "pyspark.pandas.tests.frame.test_reindexing", @@ -841,7 +844,6 @@ pyspark_pandas_slow = Module( "pyspark.pandas.tests.series.test_series", "pyspark.pandas.tests.series.test_sort", "pyspark.pandas.tests.series.test_stat", -"pyspark.pandas.tests.test_stats", ], excluded_python_implementations=[ "PyPy" # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and @@ -1014,6 +1016,7 @@ pyspark_pandas_connect_part0 = Module( "pyspark.pandas.tests.connect.computation.test_parity_combine", "pyspark.pandas.tests.connect.computation.test_parity_compute", "pyspark.pandas.tests.connect.computation.test_parity_cov", +"pyspark.pandas.tests.connect.computation.test_parity_corr", "pyspark.pandas.tests.connect.computation.test_parity_corrwith", "pyspark.pandas.tests.connect.computation.test_parity_cumulative", "pyspark.pandas.tests.connect.computation.test_parity_describe", @@ -1021,6 +1024,7 @@ pyspark_pandas_connect_part0 = Module( "pyspark.pandas.tests.connect.computation.test_parity_melt", "pyspark.pandas.tests.connect.groupby.test_parity_stat", "pyspark.pandas.tests.connect.frame.test_parity_attrs", +"pyspark.pandas.tests.connect.frame.test_parity_axis", "pyspark.pandas.tests.connect.diff_frames_ops.test_parity_dot_frame", "pyspark.pandas.tests.connect.diff_frames_ops.test_parity_dot_series", ], @@ -1075,7 +1079,6 @@ pyspark_pandas_connect_part1 = Module( "pyspark.pandas.tests.connect.series.test_parity_stat", "pyspark.pandas.tests.connect.data_type_ops.test_parity_num_arithmetic", "pyspark.pandas.tests.connect.test_parity_reshape", -"pyspark.pandas.tests.connect.test_parity_stats", &quo
(spark) branch master updated: [SPARK-46271][PS][TESTS] Remove unused properties in `pyspark.pandas.tests.groupby.*`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a56ca8a673d7 [SPARK-46271][PS][TESTS] Remove unused properties in `pyspark.pandas.tests.groupby.*` a56ca8a673d7 is described below commit a56ca8a673d7a0ce85a5f01e51c76d8dae81804a Author: Ruifeng Zheng AuthorDate: Tue Dec 5 10:17:19 2023 -0800 [SPARK-46271][PS][TESTS] Remove unused properties in `pyspark.pandas.tests.groupby.*` ### What changes were proposed in this pull request? remove unused properties in `pyspark.pandas.tests.groupby.*` ### Why are the changes needed? just code clean up ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44188 from zhengruifeng/ps_test_cleanup. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/pyspark/pandas/tests/groupby/test_aggregate.py| 15 --- python/pyspark/pandas/tests/groupby/test_apply_func.py | 15 --- python/pyspark/pandas/tests/groupby/test_cumulative.py | 15 --- python/pyspark/pandas/tests/groupby/test_describe.py | 15 --- python/pyspark/pandas/tests/groupby/test_groupby.py | 15 --- python/pyspark/pandas/tests/groupby/test_head_tail.py| 15 --- python/pyspark/pandas/tests/groupby/test_index.py| 15 --- python/pyspark/pandas/tests/groupby/test_missing_data.py | 15 --- python/pyspark/pandas/tests/groupby/test_split_apply.py | 15 --- 9 files changed, 135 deletions(-) diff --git a/python/pyspark/pandas/tests/groupby/test_aggregate.py b/python/pyspark/pandas/tests/groupby/test_aggregate.py index 9e7f3f6cd113..f6fb37aa9824 100644 --- a/python/pyspark/pandas/tests/groupby/test_aggregate.py +++ b/python/pyspark/pandas/tests/groupby/test_aggregate.py @@ -24,21 +24,6 @@ from pyspark.testing.sqlutils import SQLTestUtils class GroupbyAggregateMixin: -@property -def pdf(self): -return pd.DataFrame( -{ -"A": [1, 2, 1, 2], -"B": [3.1, 4.1, 4.1, 3.1], -"C": ["a", "b", "b", "a"], -"D": [True, False, False, True], -} -) - -@property -def psdf(self): -return ps.from_pandas(self.pdf) - def test_aggregate(self): pdf = pd.DataFrame( {"A": [1, 1, 2, 2], "B": [1, 2, 3, 4], "C": [0.362, 0.227, 1.267, -0.562]} diff --git a/python/pyspark/pandas/tests/groupby/test_apply_func.py b/python/pyspark/pandas/tests/groupby/test_apply_func.py index e7a30ff57b41..d886d8799444 100644 --- a/python/pyspark/pandas/tests/groupby/test_apply_func.py +++ b/python/pyspark/pandas/tests/groupby/test_apply_func.py @@ -26,21 +26,6 @@ from pyspark.testing.sqlutils import SQLTestUtils class GroupbyApplyFuncMixin: -@property -def pdf(self): -return pd.DataFrame( -{ -"A": [1, 2, 1, 2], -"B": [3.1, 4.1, 4.1, 3.1], -"C": ["a", "b", "b", "a"], -"D": [True, False, False, True], -} -) - -@property -def psdf(self): -return ps.from_pandas(self.pdf) - def test_apply(self): pdf = pd.DataFrame( {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]}, diff --git a/python/pyspark/pandas/tests/groupby/test_cumulative.py b/python/pyspark/pandas/tests/groupby/test_cumulative.py index 64d6812c7bcf..2bcc1ade2d06 100644 --- a/python/pyspark/pandas/tests/groupby/test_cumulative.py +++ b/python/pyspark/pandas/tests/groupby/test_cumulative.py @@ -26,21 +26,6 @@ from pyspark.testing.sqlutils import SQLTestUtils class GroupbyCumulativeMixin: -@property -def pdf(self): -return pd.DataFrame( -{ -"A": [1, 2, 1, 2], -"B": [3.1, 4.1, 4.1, 3.1], -"C": ["a", "b", "b", "a"], -"D": [True, False, False, True], -} -) - -@property -def psdf(self): -return ps.from_pandas(self.pdf) - def test_cumcount(self): pdf = pd.DataFrame( { diff --git a/python/pyspark/pandas/tests/groupby/test_describe.py b/python/pyspark/pandas/tests/groupby/test_describe.py index dcd0e33a3811..ae12aae61ffc 100644 --- a/python/pyspark/pandas/tests
(spark) branch master updated: [SPARK-46266][PS][CONNECT][TESTS] Re-organize `NumOpsTests`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e9ba896f8446 [SPARK-46266][PS][CONNECT][TESTS] Re-organize `NumOpsTests` e9ba896f8446 is described below commit e9ba896f8446885d42bcb27953fe2d6050794be1 Author: Ruifeng Zheng AuthorDate: Tue Dec 5 08:53:05 2023 -0800 [SPARK-46266][PS][CONNECT][TESTS] Re-organize `NumOpsTests` ### What changes were proposed in this pull request? Re-organize `NumOpsTests`, factor out the `astype` tests ### Why are the changes needed? group the tests by topics ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44182 from zhengruifeng/ps_reorg_as_type. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- dev/sparktestsupport/modules.py| 2 + .../connect/data_type_ops/test_parity_as_type.py | 43 + .../pandas/tests/data_type_ops/test_as_type.py | 103 + .../pandas/tests/data_type_ops/test_num_ops.py | 57 4 files changed, 148 insertions(+), 57 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index f35c42d11e58..9bbe86baa1dc 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -698,6 +698,7 @@ pyspark_pandas = Module( "pyspark.pandas.spark.utils", "pyspark.pandas.typedef.typehints", # unittests +"pyspark.pandas.tests.data_type_ops.test_as_type", "pyspark.pandas.tests.data_type_ops.test_base", "pyspark.pandas.tests.data_type_ops.test_binary_ops", "pyspark.pandas.tests.data_type_ops.test_boolean_ops", @@ -952,6 +953,7 @@ pyspark_pandas_connect_part0 = Module( ], python_test_goals=[ # pandas-on-Spark unittests +"pyspark.pandas.tests.connect.data_type_ops.test_parity_as_type", "pyspark.pandas.tests.connect.data_type_ops.test_parity_base", "pyspark.pandas.tests.connect.data_type_ops.test_parity_binary_ops", "pyspark.pandas.tests.connect.data_type_ops.test_parity_boolean_ops", diff --git a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_as_type.py b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_as_type.py new file mode 100644 index ..a2a9e28a5ab5 --- /dev/null +++ b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_as_type.py @@ -0,0 +1,43 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +import unittest + +from pyspark import pandas as ps +from pyspark.pandas.tests.data_type_ops.test_as_type import AsTypeTestsMixin +from pyspark.pandas.tests.connect.data_type_ops.testing_utils import OpsTestBase +from pyspark.testing.pandasutils import PandasOnSparkTestUtils +from pyspark.testing.connectutils import ReusedConnectTestCase + + +class AsTypeParityTests( +AsTypeTestsMixin, PandasOnSparkTestUtils, OpsTestBase, ReusedConnectTestCase +): +@property +def psdf(self): +return ps.from_pandas(self.pdf) + + +if __name__ == "__main__": +from pyspark.pandas.tests.connect.data_type_ops.test_parity_as_type import * # noqa: F401 + +try: +import xmlrunner # type: ignore[import] + +testRunner = xmlrunner.XMLTestRunner(output="target/test-reports", verbosity=2) +except ImportError: +testRunner = None +unittest.main(testRunner=testRunner, verbosity=2) diff --git a/python/pyspark/pandas/tests/data_type_ops/test_as_type.py b/python/pyspark/pandas/tests/data_type_ops/test_as_type.py new file mode 100644 index ..9d5c0d03d548 --- /dev/null +++ b/python/pyspark/pandas/tests/data_type_ops/test_as_type.py @@ -0,0 +1,103 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor
(spark) branch master updated (7c741c8c25fc -> 9808ad3ac322)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 7c741c8c25fc [SPARK-46265][CONNECT] Assertions in AddArtifact RPC make the connect client incompatible with older clusters add 9808ad3ac322 [SPARK-46259][PYTHON][DOCS] Add an appropriate link for error class usage documentation No new revisions were added by this update. Summary of changes: python/docs/source/development/contributing.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46219][SQL] Unwrap cast in join predicates
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8235f1d56bf2 [SPARK-46219][SQL] Unwrap cast in join predicates 8235f1d56bf2 is described below commit 8235f1d56bf232bb713fe24ff6f2ffdaf49d2fcc Author: Yuming Wang AuthorDate: Tue Dec 5 08:37:34 2023 -0800 [SPARK-46219][SQL] Unwrap cast in join predicates ### What changes were proposed in this pull request? In a large data platform, it is very common to join different data types. Similar to [`reorderJoinPredicates`](https://github.com/apache/spark/blob/b03afa7bde5a050eb95284b275eae0aac2257f63/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L321-L338). This PR adds a function in `EnsureRequirements` to unwrap cast in join predicates to reduce shuffle if they are integral types. The key idea here is that casting to either of these two types will not affect the result of join for integral types join keys. For example: `a.intCol = try_cast(b.bigIntCol AS int)`, if the value of `bigIntCol` exceeds the range of int, the result of `try_cast(b.bigIntCol AS int)` is `null`, and the result of `a.intCol = try_cast(b.bigIntCol AS int)` in the join condition is `false`. The result is consistent with `cast(a.intCol AS bigint) = b.bigIntCol`. ### Why are the changes needed? Reduce shuffle to improve query performance. Case 1: Shuffle before join ```sql CREATE TABLE t1(id int) USING parquet; CREATE TABLE t2(id int) USING parquet; CREATE TABLE t3(id bigint) USING parquet; SET spark.sql.autoBroadcastJoinThreshold=-1; explain SELECT * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; explain SELECT * FROM (SELECT *, row_number() OVER (PARTITION BY id ORDER BY id) AS rn FROM t1) t JOIN t2 ON t.id = t2.id WHERE rn = 1; ``` The plan differences after this PR: ```diff == Physical Plan == AdaptiveSparkPlan isFinalPlan=false -+- SortMergeJoin [cast(id#10 as bigint)], [id#12L], Inner - :- Sort [cast(id#10 as bigint) ASC NULLS FIRST], false, 0 - : +- Exchange hashpartitioning(cast(id#10 as bigint), 5), ENSURE_REQUIREMENTS, [plan_id=54] - : +- SortMergeJoin [id#10], [id#11], Inner - ::- Sort [id#10 ASC NULLS FIRST], false, 0 - :: +- Exchange hashpartitioning(id#10, 5), ENSURE_REQUIREMENTS, [plan_id=47] - :: +- Filter isnotnull(id#10) - ::+- FileScan parquet spark_catalog.default.t1[id#10] - :+- Sort [id#11 ASC NULLS FIRST], false, 0 - : +- Exchange hashpartitioning(id#11, 5), ENSURE_REQUIREMENTS, [plan_id=48] - : +- Filter isnotnull(id#11) - : +- FileScan parquet spark_catalog.default.t2[id#11] - +- Sort [id#12L ASC NULLS FIRST], false, 0 - +- Exchange hashpartitioning(id#12L, 5), ENSURE_REQUIREMENTS, [plan_id=55] - +- Filter isnotnull(id#12L) -+- FileScan parquet spark_catalog.default.t3[id#12L] ++- SortMergeJoin [id#20], [try_cast(id#22L as int)], Inner + :- SortMergeJoin [id#20], [id#21], Inner + : :- Sort [id#20 ASC NULLS FIRST], false, 0 + : : +- Exchange hashpartitioning(id#20, 5), ENSURE_REQUIREMENTS, [plan_id=50] + : : +- Filter isnotnull(id#20) + : :+- FileScan parquet spark_catalog.default.t1[id#20] + : +- Sort [id#21 ASC NULLS FIRST], false, 0 + : +- Exchange hashpartitioning(id#21, 5), ENSURE_REQUIREMENTS, [plan_id=51] + :+- Filter isnotnull(id#21) + : +- FileScan parquet spark_catalog.default.t2[id#21] + +- Sort [try_cast(id#22L as int) ASC NULLS FIRST], false, 0 + +- Exchange hashpartitioning(try_cast(id#22L as int), 5), ENSURE_REQUIREMENTS, [plan_id=58] + +- Filter isnotnull(id#22L) ++- FileScan parquet spark_catalog.default.t3[id#22L] ``` ```diff == Physical Plan == AdaptiveSparkPlan isFinalPlan=false -+- SortMergeJoin [cast(id#22 as bigint)], [id#23L], Inner - :- Sort [cast(id#22 as bigint) ASC NULLS FIRST], false, 0 - : +- Exchange hashpartitioning(cast(id#22 as bigint), 5), ENSURE_REQUIREMENTS, [plan_id=62] - : +- Filter (rn#20 = 1) - :+- Window [row_number() windowspecdefinition(id#22, id#22 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#20], [id#22], [id#22 ASC NULLS FIRST] - : +- WindowGroupLimit [id#22], [id#22 ASC NULLS FIRST], row_number(), 1, Final - : +- Sort [id#22 ASC NULLS FIRST, id#22 ASC NULLS FIRST], false, 0 - : +- Exchange hashpartitioning(id#22, 5
(spark) branch master updated (b4bf9ec065bf -> c8451b58a3c4)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b4bf9ec065bf [SPARK-46269][PS] Enable more NumPy compatibility function tests add c8451b58a3c4 [SPARK-46069][SQL][FOLLOWUP] Make sure the cast expression is date type when unwrap timestamp type to date type No new revisions were added by this update. Summary of changes: .../sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46269][PS] Enable more NumPy compatibility function tests
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b4bf9ec065bf [SPARK-46269][PS] Enable more NumPy compatibility function tests b4bf9ec065bf is described below commit b4bf9ec065bf3538e3780a833b6c5dec7b602cf2 Author: Haejoon Lee AuthorDate: Tue Dec 5 08:33:38 2023 -0800 [SPARK-46269][PS] Enable more NumPy compatibility function tests ### What changes were proposed in this pull request? This PR proposes to enable NumPy compatibility function tests for `arccos`, `exp`, `expm1` and `floor_divide`. ### Why are the changes needed? To improve the test coverage for NumPy compatibility function. ### Does this PR introduce _any_ user-facing change? No, this improves the test coverage. ### How was this patch tested? The existing CI should pass. For flaky tests, manually ran 50 times on local envs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44186 from itholic/enable_numpy_tests. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- python/pyspark/pandas/tests/test_numpy_compat.py | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/python/pyspark/pandas/tests/test_numpy_compat.py b/python/pyspark/pandas/tests/test_numpy_compat.py index 986083d0d3af..931e5475c36f 100644 --- a/python/pyspark/pandas/tests/test_numpy_compat.py +++ b/python/pyspark/pandas/tests/test_numpy_compat.py @@ -26,21 +26,17 @@ from pyspark.testing.sqlutils import SQLTestUtils class NumPyCompatTestsMixin: blacklist = [ -# Koalas does not currently support +# Pandas-on-Spark does not currently support "conj", "conjugate", "isnat", "matmul", "frexp", # Values are close enough but tests failed. -"arccos", -"exp", -"expm1", "log", # flaky "log10", # flaky "log1p", # flaky "modf", -"floor_divide", # flaky ] @property - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46258][CORE] Add `RocksDBPersistenceEngine`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dd11075db618 [SPARK-46258][CORE] Add `RocksDBPersistenceEngine` dd11075db618 is described below commit dd11075db61879e200b4121b83d4239954881ddd Author: Dongjoon Hyun AuthorDate: Tue Dec 5 00:49:21 2023 -0800 [SPARK-46258][CORE] Add `RocksDBPersistenceEngine` ### What changes were proposed in this pull request? This PR aims to add `RocksDBPersistenceEngine`. ### Why are the changes needed? To speed up `Spark Master` HA operations by **6.1x**. ``` OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1051-azure AMD EPYC 7763 64-Core Processor 1000 Workers:Best Time(ms) Avg Time(ms) Relative FileSystemPersistenceEngine with JavaSerializer 1571 1616 3.6X RocksDBPersistenceEngine with JavaSerializer 257 258 22.0X ``` ### Does this PR introduce _any_ user-facing change? No. This is a new backend. ### How was this patch tested? Pass the CIs with the newly added test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44173 from dongjoon-hyun/SPARK-46258. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../PersistenceEngineBenchmark-jdk21-results.txt | 28 +++--- .../PersistenceEngineBenchmark-results.txt | 28 +++--- .../org/apache/spark/deploy/master/Master.scala| 4 + .../spark/deploy/master/RecoveryModeFactory.scala | 18 .../deploy/master/RocksDBPersistenceEngine.scala | 103 + .../apache/spark/deploy/master/MasterSuite.scala | 20 .../deploy/master/PersistenceEngineBenchmark.scala | 11 +++ .../deploy/master/PersistenceEngineSuite.scala | 9 ++ 8 files changed, 195 insertions(+), 26 deletions(-) diff --git a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt index 314fb6958b69..99035eb336a3 100644 --- a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt +++ b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt @@ -6,18 +6,20 @@ OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1051-azure AMD EPYC 7763 64-Core Processor 1000 Workers: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -ZooKeeperPersistenceEngine with JavaSerializer 5402 5546 233 0.0 5402030.8 1.0X -ZooKeeperPersistenceEngine with KryoSerializer 4185 4220 32 0.0 4184623.1 1.3X -FileSystemPersistenceEngine with JavaSerializer1591 1634 37 0.0 1590836.4 3.4X -FileSystemPersistenceEngine with JavaSerializer (lz4) 611 623 14 0.0 611256.6 8.8X -FileSystemPersistenceEngine with JavaSerializer (lzf) 626 640 13 0.0 626072.2 8.6X -FileSystemPersistenceEngine with JavaSerializer (snappy)595 628 29 0.0 594744.4 9.1X -FileSystemPersistenceEngine with JavaSerializer (zstd) 755 774 21 0.0 754604.4 7.2X -FileSystemPersistenceEngine with KryoSerializer 479 489 8 0.0 479404.7 11.3X -FileSystemPersistenceEngine with KryoSerializer (lz4) 392 406 12 0.0 392165.7 13.8X -FileSystemPersistenceEngine with KryoSerializer (lzf) 525 536 14 0.0 524916.7 10.3X -FileSystemPersistenceEngine with KryoSerializer (snappy)519 533 14 0.0 518569.3 10.4X -FileSystemPersistenceEngine with KryoSerializer (zstd) 627 663 31 0.0 627233.2 8.6X -BlackHolePersistenceEngine0 0 0 6.0 166.0 32541.8X +ZooKeeperPersistenceEngine with JavaSerializer 5863 6053 265 0.0 5862988.1 1.0X +ZooKeeperPersistenceEngine with KryoSerializer 4553 4612
(spark) branch master updated: [SPARK-46233][PYTHON] Migrate all remaining `AttributeError` into PySpark error framework
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new be49ca6dd71b [SPARK-46233][PYTHON] Migrate all remaining `AttributeError` into PySpark error framework be49ca6dd71b is described below commit be49ca6dd71b87172df9d88f305f06a7b87c9ecf Author: Haejoon Lee AuthorDate: Mon Dec 4 16:18:27 2023 -0800 [SPARK-46233][PYTHON] Migrate all remaining `AttributeError` into PySpark error framework ### What changes were proposed in this pull request? This PR proposes to migrate all remaining `AttributeError` from `pyspark/sql/*` into PySpark error framework, `PySparkAttributeError` with assigning dedicated error classes. ### Why are the changes needed? To improve the error handling in PySpark. ### Does this PR introduce _any_ user-facing change? No API changes, but the user-facing error messages will be improved. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44150 from itholic/migrate_attribute_error. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/connect/dataframe.py | 10 +++--- python/pyspark/sql/dataframe.py | 11 --- python/pyspark/sql/types.py | 13 ++--- 3 files changed, 25 insertions(+), 9 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index a73a24818c0c..6a1d45712163 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -14,7 +14,11 @@ # See the License for the specific language governing permissions and # limitations under the License. # -from pyspark.errors.exceptions.base import SessionNotSameException, PySparkIndexError +from pyspark.errors.exceptions.base import ( +SessionNotSameException, +PySparkIndexError, +PySparkAttributeError, +) from pyspark.sql.connect.utils import check_dependencies check_dependencies(__name__) @@ -1694,8 +1698,8 @@ class DataFrame: ) if name not in self.columns: -raise AttributeError( -"'%s' object has no attribute '%s'" % (self.__class__.__name__, name) +raise PySparkAttributeError( +error_class="ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name} ) return _to_col_with_plan_id( diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index 8b40b222a289..5211d874ba33 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -43,7 +43,12 @@ from py4j.java_gateway import JavaObject, JVMView from pyspark import copy_func, _NoValue from pyspark._globals import _NoValueType from pyspark.context import SparkContext -from pyspark.errors import PySparkTypeError, PySparkValueError, PySparkIndexError +from pyspark.errors import ( +PySparkTypeError, +PySparkValueError, +PySparkIndexError, +PySparkAttributeError, +) from pyspark.rdd import ( RDD, _load_from_socket, @@ -3613,8 +3618,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): +---+ """ if name not in self.columns: -raise AttributeError( -"'%s' object has no attribute '%s'" % (self.__class__.__name__, name) +raise PySparkAttributeError( +error_class="ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name} ) jc = self._jdf.apply(name) return Column(jc) diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index cbfc4ab5df02..d3eed77b3838 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -55,6 +55,7 @@ from pyspark.errors import ( PySparkTypeError, PySparkValueError, PySparkIndexError, +PySparkAttributeError, PySparkKeyError, ) @@ -2574,16 +2575,22 @@ class Row(tuple): def __getattr__(self, item: str) -> Any: if item.startswith("__"): -raise AttributeError(item) +raise PySparkAttributeError( +error_class="ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": item} +) try: # it will be slow when it has many fields, # but this will not be used in normal cases idx = self.__fields__.index(item) return self[idx] except IndexError: -raise AttributeError(item) +raise PySparkAttributeError( +error_class="ATTRIBUTE_NOT_SUPPORTED", me
(spark) branch branch-3.3 updated: [SPARK-46239][CORE] Hide `Jetty` info
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new aaec17fb244c [SPARK-46239][CORE] Hide `Jetty` info aaec17fb244c is described below commit aaec17fb244c175068f4de52e1288acc6125c5e9 Author: Dongjoon Hyun AuthorDate: Mon Dec 4 14:41:27 2023 -0800 [SPARK-46239][CORE] Hide `Jetty` info **What changes were proposed in this pull request?** The PR sets parameters to hide the version of jetty in spark. **Why are the changes needed?** It can avoid obtaining remote WWW service information through HTTP. **Does this PR introduce any user-facing change?** No **How was this patch tested?** Manual review **Was this patch authored or co-authored using generative AI tooling?** No Closes #44158 from chenyu-opensource/branch-SPARK-46239. Lead-authored-by: Dongjoon Hyun Co-authored-by: chenyu <119398199+chenyu-opensou...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun (cherry picked from commit ff4f59341215b7f3a87e6cd8798d49e25562fcd6) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 6 ++ 1 file changed, 6 insertions(+) diff --git a/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala b/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala index 834e4dfc4841..44bbd95fad13 100644 --- a/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala +++ b/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala @@ -311,6 +311,12 @@ private[spark] object JettyUtils extends Logging { logDebug(s"Using requestHeaderSize: $requestHeaderSize") httpConfig.setRequestHeaderSize(requestHeaderSize) + // Hide information. + logDebug("Using setSendServerVersion: false") + httpConfig.setSendServerVersion(false) + logDebug("Using setSendXPoweredBy: false") + httpConfig.setSendXPoweredBy(false) + // If SSL is configured, create the secure connector first. val securePort = sslOptions.createJettySslContextFactory().map { factory => val securePort = sslOptions.port.getOrElse(if (port > 0) Utils.userPort(port, 400) else 0) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46239][CORE] Hide `Jetty` info
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 757c3a9d62b7 [SPARK-46239][CORE] Hide `Jetty` info 757c3a9d62b7 is described below commit 757c3a9d62b7519f5bdc50d09e472b0490b6bae8 Author: Dongjoon Hyun AuthorDate: Mon Dec 4 14:41:27 2023 -0800 [SPARK-46239][CORE] Hide `Jetty` info **What changes were proposed in this pull request?** The PR sets parameters to hide the version of jetty in spark. **Why are the changes needed?** It can avoid obtaining remote WWW service information through HTTP. **Does this PR introduce any user-facing change?** No **How was this patch tested?** Manual review **Was this patch authored or co-authored using generative AI tooling?** No Closes #44158 from chenyu-opensource/branch-SPARK-46239. Lead-authored-by: Dongjoon Hyun Co-authored-by: chenyu <119398199+chenyu-opensou...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun (cherry picked from commit ff4f59341215b7f3a87e6cd8798d49e25562fcd6) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 6 ++ 1 file changed, 6 insertions(+) diff --git a/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala b/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala index d8119fb94984..2407152a5498 100644 --- a/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala +++ b/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala @@ -312,6 +312,12 @@ private[spark] object JettyUtils extends Logging { logDebug(s"Using requestHeaderSize: $requestHeaderSize") httpConfig.setRequestHeaderSize(requestHeaderSize) + // Hide information. + logDebug("Using setSendServerVersion: false") + httpConfig.setSendServerVersion(false) + logDebug("Using setSendXPoweredBy: false") + httpConfig.setSendXPoweredBy(false) + // If SSL is configured, create the secure connector first. val securePort = sslOptions.createJettySslContextFactory().map { factory => val securePort = sslOptions.port.getOrElse(if (port > 0) Utils.userPort(port, 400) else 0) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-46239][CORE] Hide `Jetty` info
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 1321b4e64dea [SPARK-46239][CORE] Hide `Jetty` info 1321b4e64dea is described below commit 1321b4e64deaa1e58bf297c25b72319083056568 Author: Dongjoon Hyun AuthorDate: Mon Dec 4 14:41:27 2023 -0800 [SPARK-46239][CORE] Hide `Jetty` info **What changes were proposed in this pull request?** The PR sets parameters to hide the version of jetty in spark. **Why are the changes needed?** It can avoid obtaining remote WWW service information through HTTP. **Does this PR introduce any user-facing change?** No **How was this patch tested?** Manual review **Was this patch authored or co-authored using generative AI tooling?** No Closes #44158 from chenyu-opensource/branch-SPARK-46239. Lead-authored-by: Dongjoon Hyun Co-authored-by: chenyu <119398199+chenyu-opensou...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun (cherry picked from commit ff4f59341215b7f3a87e6cd8798d49e25562fcd6) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 6 ++ 1 file changed, 6 insertions(+) diff --git a/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala b/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala index 9582bdbf5264..21753361e627 100644 --- a/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala +++ b/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala @@ -312,6 +312,12 @@ private[spark] object JettyUtils extends Logging { logDebug(s"Using requestHeaderSize: $requestHeaderSize") httpConfig.setRequestHeaderSize(requestHeaderSize) + // Hide information. + logDebug("Using setSendServerVersion: false") + httpConfig.setSendServerVersion(false) + logDebug("Using setSendXPoweredBy: false") + httpConfig.setSendXPoweredBy(false) + // If SSL is configured, create the secure connector first. val securePort = sslOptions.createJettySslContextFactory().map { factory => val securePort = sslOptions.port.getOrElse(if (port > 0) Utils.userPort(port, 400) else 0) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46239][CORE] Hide `Jetty` info
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ff4f59341215 [SPARK-46239][CORE] Hide `Jetty` info ff4f59341215 is described below commit ff4f59341215b7f3a87e6cd8798d49e25562fcd6 Author: Dongjoon Hyun AuthorDate: Mon Dec 4 14:41:27 2023 -0800 [SPARK-46239][CORE] Hide `Jetty` info **What changes were proposed in this pull request?** The PR sets parameters to hide the version of jetty in spark. **Why are the changes needed?** It can avoid obtaining remote WWW service information through HTTP. **Does this PR introduce any user-facing change?** No **How was this patch tested?** Manual review **Was this patch authored or co-authored using generative AI tooling?** No Closes #44158 from chenyu-opensource/branch-SPARK-46239. Lead-authored-by: Dongjoon Hyun Co-authored-by: chenyu <119398199+chenyu-opensou...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 6 ++ 1 file changed, 6 insertions(+) diff --git a/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala b/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala index 22adcbc32ed8..50251975d733 100644 --- a/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala +++ b/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala @@ -314,6 +314,12 @@ private[spark] object JettyUtils extends Logging { logDebug(s"Using requestHeaderSize: $requestHeaderSize") httpConfig.setRequestHeaderSize(requestHeaderSize) + // Hide information. + logDebug("Using setSendServerVersion: false") + httpConfig.setSendServerVersion(false) + logDebug("Using setSendXPoweredBy: false") + httpConfig.setSendXPoweredBy(false) + // If SSL is configured, create the secure connector first. val securePort = sslOptions.createJettySslContextFactory().map { factory => val securePort = sslOptions.port.getOrElse(if (port > 0) Utils.userPort(port, 400) else 0) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.3 updated: [SPARK-46092][SQL][3.3] Don't push down Parquet row group filters that overflow
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new c9412307394f [SPARK-46092][SQL][3.3] Don't push down Parquet row group filters that overflow c9412307394f is described below commit c9412307394fd1a277dd7fd5b173ec34e4b123d6 Author: Johan Lasperas AuthorDate: Mon Dec 4 12:50:57 2023 -0800 [SPARK-46092][SQL][3.3] Don't push down Parquet row group filters that overflow This is a cherry-pick from https://github.com/apache/spark/pull/44006 to spark 3.3 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44156 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.3. Authored-by: Johan Lasperas Signed-off-by: Dongjoon Hyun --- .../datasources/parquet/ParquetFilters.scala | 10 ++- .../datasources/parquet/ParquetFilterSuite.scala | 71 ++ .../datasources/parquet/ParquetQuerySuite.scala| 20 ++ 3 files changed, 99 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala index 210f37d473ad..969fbab746ad 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.execution.datasources.parquet -import java.lang.{Boolean => JBoolean, Double => JDouble, Float => JFloat, Long => JLong} +import java.lang.{Boolean => JBoolean, Byte => JByte, Double => JDouble, Float => JFloat, Long => JLong, Short => JShort} import java.math.{BigDecimal => JBigDecimal} import java.nio.charset.StandardCharsets.UTF_8 import java.sql.{Date, Timestamp} @@ -600,7 +600,13 @@ class ParquetFilters( value == null || (nameToParquetField(name).fieldType match { case ParquetBooleanType => value.isInstanceOf[JBoolean] case ParquetIntegerType if value.isInstanceOf[Period] => true - case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number] + case ParquetByteType | ParquetShortType | ParquetIntegerType => value match { +// Byte/Short/Int are all stored as INT32 in Parquet so filters are built using type Int. +// We don't create a filter if the value would overflow. +case _: JByte | _: JShort | _: Integer => true +case v: JLong => v.longValue() >= Int.MinValue && v.longValue() <= Int.MaxValue +case _ => false + } case ParquetLongType =>
(spark) branch master updated: [SPARK-46245][CORE][SQL][SS][YARN][K8S][UI] Replcace `s.c.MapOps.view.filterKeys` with `s.c.MapOps.filter`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ac0bd2eb7b40 [SPARK-46245][CORE][SQL][SS][YARN][K8S][UI] Replcace `s.c.MapOps.view.filterKeys` with `s.c.MapOps.filter` ac0bd2eb7b40 is described below commit ac0bd2eb7b4089096f9fb288482b2f1b5049b7e2 Author: yangjie01 AuthorDate: Mon Dec 4 12:49:52 2023 -0800 [SPARK-46245][CORE][SQL][SS][YARN][K8S][UI] Replcace `s.c.MapOps.view.filterKeys` with `s.c.MapOps.filter` ### What changes were proposed in this pull request? This pr uses `s.c.MapOps.filter` to simplify code pattern `s.c.MapOps.view.filterKeys`. ### Why are the changes needed? The coding pattern of `s.c.MapOps.view.filterKeys` seems verbose, it can be simplified using `s.c.MapOps.filter`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44160 from LuciferYang/SPARK-46245. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../spark/sql/kafka010/KafkaContinuousStream.scala | 2 +- .../scala/org/apache/spark/deploy/master/Master.scala | 3 +-- .../spark/deploy/rest/RestSubmissionClient.scala | 4 ++-- .../spark/executor/CoarseGrainedExecutorBackend.scala | 8 .../org/apache/spark/resource/ResourceProfile.scala| 4 ++-- .../org/apache/spark/scheduler/DAGScheduler.scala | 4 +++- .../cluster/CoarseGrainedSchedulerBackend.scala| 2 +- .../scheduler/cluster/StandaloneSchedulerBackend.scala | 4 ++-- .../spark/storage/ShuffleBlockFetcherIterator.scala| 2 +- .../main/scala/org/apache/spark/ui/PagedTable.scala| 7 +++ .../org/apache/spark/HeartbeatReceiverSuite.scala | 2 +- .../scala/org/apache/spark/SparkThrowableSuite.scala | 5 ++--- .../spark/internal/plugin/PluginContainerSuite.scala | 2 +- .../scheduler/cluster/k8s/ExecutorPodsAllocator.scala | 6 -- .../apache/spark/deploy/yarn/ExecutorRunnable.scala| 2 +- .../org/apache/spark/deploy/yarn/YarnAllocator.scala | 2 +- .../apache/spark/sql/catalyst/catalog/interface.scala | 8 +++- .../catalyst/expressions/codegen/CodeGenerator.scala | 2 +- .../catalyst/plans/logical/basicLogicalOperators.scala | 5 +++-- .../scala/org/apache/spark/sql/DataFrameWriter.scala | 2 +- .../apache/spark/sql/execution/command/tables.scala| 7 --- .../spark/sql/execution/datasources/DataSource.scala | 4 ++-- .../spark/sql/execution/datasources/FileFormat.scala | 2 +- .../sql/execution/datasources/jdbc/JDBCOptions.scala | 3 ++- .../sql/execution/datasources/v2/CacheTableExec.scala | 3 ++- .../execution/datasources/v2/DataSourceV2Utils.scala | 2 +- .../execution/datasources/v2/FileDataSourceV2.scala| 2 +- .../execution/datasources/v2/ShowCreateTableExec.scala | 18 ++ .../execution/datasources/v2/V2SessionCatalog.scala| 4 ++-- .../execution/streaming/state/RocksDBFileManager.scala | 6 +++--- .../apache/spark/sql/execution/ui/ExecutionPage.scala | 4 ++-- .../apache/spark/sql/streaming/DataStreamReader.scala | 2 +- .../apache/spark/sql/streaming/DataStreamWriter.scala | 2 +- .../apache/spark/sql/hive/HiveExternalCatalog.scala| 11 ++- .../apache/spark/sql/hive/HiveMetastoreCatalog.scala | 7 +++ .../apache/spark/sql/hive/execution/HiveOptions.scala | 6 +++--- .../spark/sql/hive/HiveSchemaInferenceSuite.scala | 2 +- .../org/apache/spark/sql/hive/StatisticsSuite.scala| 10 +- .../apache/spark/sql/hive/execution/HiveDDLSuite.scala | 8 .../hive/execution/command/ShowCreateTableSuite.scala | 2 +- 40 files changed, 93 insertions(+), 88 deletions(-) diff --git a/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala b/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala index 026c4d560722..a86acd971a1c 100644 --- a/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala +++ b/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala @@ -102,7 +102,7 @@ class KafkaContinuousStream( } val startOffsets = newPartitionOffsets ++ - oldStartPartitionOffsets.view.filterKeys(!deletedPartitions.contains(_)) + oldStartPartitionOffsets.filter { case (k, _) => !deletedPartitions.contains(k) } knownPartitions = startOffsets.keySet startOffsets.toSeq.map { diff --git a/core/src/main/scala/org/apache/spark/deploy/master/Master.scala b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala index 0fe72e28ea5b..2e1d7b9bc
(spark) branch master updated: [SPARK-32246][BUILD][INFRA] Enable `streaming-kinesis-asl` tests in Github Action CI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 64ec5f1017e1 [SPARK-32246][BUILD][INFRA] Enable `streaming-kinesis-asl` tests in Github Action CI 64ec5f1017e1 is described below commit 64ec5f1017e1f2ca479060ca76f18b1c4a803b81 Author: Junyu Chen AuthorDate: Mon Dec 4 12:44:11 2023 -0800 [SPARK-32246][BUILD][INFRA] Enable `streaming-kinesis-asl` tests in Github Action CI ### What changes were proposed in this pull request? This PR attempts to set up Kinesis tests in one of the existing Github Actions. Note that currently there are totally 57 tests in the Kinesis-asl module, and this PR enabled 35 of them. The remaining tests requires interaction with Amazon Kinesis service which would incur billing costs to users. Hence they are not included in the Github Action. ### Why are the changes needed? Addressing the comments in this PR: https://github.com/apache/spark/pull/42581#issuecomment-1685925739 which attempts to upgrade the AWS SDK to v2 for Spark Kinesis connector. Since Kinesis tests are not being run in the Github Actions, there is no automated mechanism to verify the SDK v2 upgrade changes in this module. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1. All existing Github Actions passed. 2. All Kinesis tests passed when running locally: `export ENABLE_KINESIS_TESTS=1 && mvn test -Pkinesis-asl -pl connector/kinesis-asl` ``` Tests: succeeded 57, failed 0, canceled 0, ignored 0, pending 0 All tests passed. [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 13:25 min [INFO] Finished at: 2023-11-12T00:15:49+08:00 [INFO] ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43736 from junyuc25/junyuc25/kinesis-test. Authored-by: Junyu Chen Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 +- dev/sparktestsupport/modules.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 4612b504ccdf..d58356ec1c5d 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -150,7 +150,7 @@ jobs: - >- mllib-local, mllib, graphx - >- -streaming, sql-kafka-0-10, streaming-kafka-0-10, +streaming, sql-kafka-0-10, streaming-kafka-0-10, streaming-kinesis-asl, yarn, kubernetes, hadoop-cloud, spark-ganglia-lgpl, connect, protobuf # Here, we split Hive and SQL tests into some of slow ones and the rest of them. diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 8995b7de0df9..15b2e8f186e5 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -359,7 +359,7 @@ streaming_kinesis_asl = Module( build_profile_flags=[ "-Pkinesis-asl", ], -environ={"ENABLE_KINESIS_TESTS": "1"}, +environ={"ENABLE_KINESIS_TESTS": "0"}, sbt_test_goals=[ "streaming-kinesis-asl/test", ], - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-46092][SQL][3.4] Don't push down Parquet row group filters that overflow
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 05b5c9e2e3df [SPARK-46092][SQL][3.4] Don't push down Parquet row group filters that overflow 05b5c9e2e3df is described below commit 05b5c9e2e3dfb7641e59895afc8ecb0f4f861127 Author: Johan Lasperas AuthorDate: Mon Dec 4 08:59:21 2023 -0800 [SPARK-46092][SQL][3.4] Don't push down Parquet row group filters that overflow This is a cherry-pick from https://github.com/apache/spark/pull/44006 to spark 3.4 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44155 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.4. Authored-by: Johan Lasperas Signed-off-by: Dongjoon Hyun --- .../datasources/parquet/ParquetFilters.scala | 10 ++- .../datasources/parquet/ParquetFilterSuite.scala | 71 ++ .../datasources/parquet/ParquetQuerySuite.scala| 20 ++ 3 files changed, 99 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala index 6994e1ba39d9..5943dbdfb786 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.execution.datasources.parquet -import java.lang.{Boolean => JBoolean, Double => JDouble, Float => JFloat, Long => JLong} +import java.lang.{Boolean => JBoolean, Byte => JByte, Double => JDouble, Float => JFloat, Long => JLong, Short => JShort} import java.math.{BigDecimal => JBigDecimal} import java.nio.charset.StandardCharsets.UTF_8 import java.sql.{Date, Timestamp} @@ -612,7 +612,13 @@ class ParquetFilters( value == null || (nameToParquetField(name).fieldType match { case ParquetBooleanType => value.isInstanceOf[JBoolean] case ParquetIntegerType if value.isInstanceOf[Period] => true - case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number] + case ParquetByteType | ParquetShortType | ParquetIntegerType => value match { +// Byte/Short/Int are all stored as INT32 in Parquet so filters are built using type Int. +// We don't create a filter if the value would overflow. +case _: JByte | _: JShort | _: Integer => true +case v: JLong => v.longValue() >= Int.MinValue && v.longValue() <= Int.MaxValue +case _ => false + } case ParquetLongType =>
(spark) branch branch-3.5 updated: [SPARK-46092][SQL][3.5] Don't push down Parquet row group filters that overflow
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 97472c91ed56 [SPARK-46092][SQL][3.5] Don't push down Parquet row group filters that overflow 97472c91ed56 is described below commit 97472c91ed5660c5af862e8da99d44a1c24f2815 Author: Johan Lasperas AuthorDate: Mon Dec 4 08:58:03 2023 -0800 [SPARK-46092][SQL][3.5] Don't push down Parquet row group filters that overflow This is a cherry-pick from https://github.com/apache/spark/pull/44006 to spark 3.5 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44154 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.5. Authored-by: Johan Lasperas Signed-off-by: Dongjoon Hyun --- .../datasources/parquet/ParquetFilters.scala | 10 ++- .../datasources/parquet/ParquetFilterSuite.scala | 71 ++ .../datasources/parquet/ParquetQuerySuite.scala| 20 ++ 3 files changed, 99 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala index 5899b6621ad8..0983841dc8c2 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.execution.datasources.parquet -import java.lang.{Boolean => JBoolean, Double => JDouble, Float => JFloat, Long => JLong} +import java.lang.{Boolean => JBoolean, Byte => JByte, Double => JDouble, Float => JFloat, Long => JLong, Short => JShort} import java.math.{BigDecimal => JBigDecimal} import java.nio.charset.StandardCharsets.UTF_8 import java.sql.{Date, Timestamp} @@ -612,7 +612,13 @@ class ParquetFilters( value == null || (nameToParquetField(name).fieldType match { case ParquetBooleanType => value.isInstanceOf[JBoolean] case ParquetIntegerType if value.isInstanceOf[Period] => true - case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number] + case ParquetByteType | ParquetShortType | ParquetIntegerType => value match { +// Byte/Short/Int are all stored as INT32 in Parquet so filters are built using type Int. +// We don't create a filter if the value would overflow. +case _: JByte | _: JShort | _: Integer => true +case v: JLong => v.longValue() >= Int.MinValue && v.longValue() <= Int.MaxValue +case _ => false + } case ParquetLongType =>
(spark) branch master updated: [SPARK-46231][PYTHON] Migrate all remaining `NotImplementedError` & `TypeError` into PySpark error framework
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9666bf37958e [SPARK-46231][PYTHON] Migrate all remaining `NotImplementedError` & `TypeError` into PySpark error framework 9666bf37958e is described below commit 9666bf37958e5381278ca622bf7ec4b4ccb13d79 Author: Haejoon Lee AuthorDate: Mon Dec 4 08:54:50 2023 -0800 [SPARK-46231][PYTHON] Migrate all remaining `NotImplementedError` & `TypeError` into PySpark error framework ### What changes were proposed in this pull request? This PR proposes to migrate all remaining `NotImplementedError` and `TypeError` from `pyspark/sql/*` into PySpark error framework, `PySparkNotImplementedError` with assigning dedicated error classes. ### Why are the changes needed? To improve the error handling in PySpark. ### Does this PR introduce _any_ user-facing change? No API changes, but the user-facing error messages will be improved. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44148 from itholic/not_impl_and_type. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/datasource.py| 21 + .../sql/tests/pandas/test_pandas_udf_grouped_agg.py | 3 +++ python/pyspark/sql/udf.py | 8 +++- 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/python/pyspark/sql/datasource.py b/python/pyspark/sql/datasource.py index 1c5b6d663285..4713ca5366a7 100644 --- a/python/pyspark/sql/datasource.py +++ b/python/pyspark/sql/datasource.py @@ -19,6 +19,7 @@ from typing import final, Any, Dict, Iterator, List, Sequence, Tuple, Type, Unio from pyspark.sql import Row from pyspark.sql.types import StructType +from pyspark.errors import PySparkNotImplementedError if TYPE_CHECKING: from pyspark.sql._typing import OptionalPrimitiveType @@ -103,7 +104,10 @@ class DataSource(ABC): >>> def schema(self): ... return StructType().add("a", "int").add("b", "string") """ -raise NotImplementedError +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "schema"}, +) def reader(self, schema: StructType) -> "DataSourceReader": """ @@ -121,7 +125,10 @@ class DataSource(ABC): reader : DataSourceReader A reader instance for this data source. """ -raise NotImplementedError +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "reader"}, +) def writer(self, schema: StructType, saveMode: str) -> "DataSourceWriter": """ @@ -142,7 +149,10 @@ class DataSource(ABC): writer : DataSourceWriter A writer instance for this data source. """ -raise NotImplementedError +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "writer"}, +) class InputPartition: @@ -239,7 +249,10 @@ class DataSourceReader(ABC): >>> def partitions(self): ... return [RangeInputPartition(1, 3), RangeInputPartition(5, 10)] """ -raise NotImplementedError +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "partitions"}, +) @abstractmethod def read(self, partition: InputPartition) -> Iterator[Union[Tuple, Row]]: diff --git a/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py b/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py index b500be7a9695..455bb09a7dc4 100644 --- a/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py +++ b/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py @@ -720,6 +720,9 @@ class GroupedAggPandasUDFTestsMixin: class GroupedAggPandasUDFTests(GroupedAggPandasUDFTestsMixin, ReusedSQLTestCase): +def test_unsupported_types(self): +super().test_unsupported_types() + pass diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py index 9ffdbb218711..351bcea3f389 100644 --- a/python/pyspark/sql/udf.py +++ b/python/pyspark/sql/udf.py @@ -339,7 +339,13 @@ class UserDefinedFunction: try:
(spark) branch master updated: [SPARK-46237][SQL][TESTS] Make `HiveDDLSuite` independently testable
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 37d19b9ee0e4 [SPARK-46237][SQL][TESTS] Make `HiveDDLSuite` independently testable 37d19b9ee0e4 is described below commit 37d19b9ee0e4e100e37358e71d771a2e42d01d88 Author: yangjie01 AuthorDate: Mon Dec 4 08:52:23 2023 -0800 [SPARK-46237][SQL][TESTS] Make `HiveDDLSuite` independently testable ### What changes were proposed in this pull request? When I test `HiveDDLSuite` with ``` build/sbt "hive/testOnly org.apache.spark.sql.hive.execution.HiveDDLSuite" -Phive ``` This test throws an error: ``` [info] - SPARK-34261: Avoid side effect if create exists temporary function *** FAILED *** (4 milliseconds) [info] java.util.NoSuchElementException: key not found: default [info] at scala.collection.MapOps.default(Map.scala:274) [info] at scala.collection.MapOps.default$(Map.scala:273) [info] at scala.collection.AbstractMap.default(Map.scala:405) [info] at scala.collection.MapOps.apply(Map.scala:176) [info] at scala.collection.MapOps.apply$(Map.scala:175) [info] at scala.collection.AbstractMap.apply(Map.scala:405) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$445(HiveDDLSuite.scala:3275) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withUserDefinedFunction(SQLTestUtils.scala:256) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withUserDefinedFunction$(SQLTestUtils.scala:254) [info] at org.apache.spark.sql.execution.command.DDLSuite.withUserDefinedFunction(DDLSuite.scala:326) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$444(HiveDDLSuite.scala:3267) ``` I manually printed the content of `spark.sparkContext.addedJars`, which is an empty `Map`. However, when I execute ``` build/sbt "hive/testOnly org.apache.spark.sql.hive.execution.SQLQuerySuite org.apache.spark.sql.hive.execution.HiveDDLSuite" -Phive ``` All tests pass, and the content of `spark.sparkContext.addedJars` is ``` Map(default -> Map(spark://localhost:54875/jars/SPARK-21101-1.0.jar -> 1701676986594, spark://localhost:54875/jars/hive-contrib-2.3.9.jar -> 1701676944590, spark://localhost:54875/jars/TestUDTF.jar -> 1701676921340)) ``` The reason why this failure is not reproduced in the GitHub Action test is because `SQLQuerySuite` is indeed executed before `HiveDDLSuite`. So in the current PR, I change to use `.get("default").foreach(_.remove(k))` that the remove operation is only performed when `.get("default")` is not `None`. ### Why are the changes needed? Make `HiveDDLSuite` independently testable. ### Does this PR introduce _any_ user-facing change? No, just for test ### How was this patch tested? - Pass Github Actions - Manual check `HiveDDLSuite` with this pr and all test passed ### Was this patch authored or co-authored using generative AI tooling? No Closes #44153 from LuciferYang/HiveDDLSuite. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala index c3a528da382a..2f5d1fcbb540 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala @@ -3270,7 +3270,7 @@ class HiveDDLSuite val jarName = "TestUDTF.jar" val jar = spark.asInstanceOf[TestHiveSparkSession].getHiveFile(jarName).toURI.toString spark.sparkContext.allAddedJars.keys.find(_.contains(jarName)) -.foreach(spark.sparkContext.addedJars("default").remove) +.foreach(k => spark.sparkContext.addedJars.get("default").foreach(_.remove(k))) assert(!spark.sparkContext.listJars().exists(_.contains(jarName))) val e = intercept[AnalysisException] { sql("CREATE TEMPORARY FUNCTION f1 AS " + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46232][PYTHON] Migrate all remaining ValueError into PySpark error framework
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b23ae15da019 [SPARK-46232][PYTHON] Migrate all remaining ValueError into PySpark error framework b23ae15da019 is described below commit b23ae15da019082891d71853682329c2d24c2e9e Author: Haejoon Lee AuthorDate: Sun Dec 3 22:49:30 2023 -0800 [SPARK-46232][PYTHON] Migrate all remaining ValueError into PySpark error framework ### What changes were proposed in this pull request? This PR proposes to migrate all remaining `ValueError` from `pyspark/sql/*` into PySpark error framework, `PySparkValueError` with assigning dedicated error classes. ### Why are the changes needed? To improve the error handling in PySpark. ### Does this PR introduce _any_ user-facing change? No API changes, but the user-facing error messages will be improved. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44149 from itholic/migrate_value_error. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- python/pyspark/errors/error_classes.py | 19 +-- python/pyspark/sql/pandas/serializers.py | 5 +++-- python/pyspark/sql/pandas/typehints.py | 12 +--- python/pyspark/sql/pandas/types.py | 7 +-- python/pyspark/sql/sql_formatter.py | 7 --- 5 files changed, 38 insertions(+), 12 deletions(-) diff --git a/python/pyspark/errors/error_classes.py b/python/pyspark/errors/error_classes.py index c7199ac938be..d0c0d1c115b0 100644 --- a/python/pyspark/errors/error_classes.py +++ b/python/pyspark/errors/error_classes.py @@ -287,6 +287,11 @@ ERROR_CLASSES_JSON = """ "NumPy array input should be of dimensions." ] }, + "INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP" : { +"message" : [ + "Invalid number of dataframes in group ." +] + }, "INVALID_PANDAS_UDF" : { "message" : [ "Invalid function: " @@ -803,9 +808,9 @@ ERROR_CLASSES_JSON = """ "Expected values for ``, got ." ] }, - "TYPE_HINT_REQUIRED" : { + "TYPE_HINT_SHOULD_BE_SPECIFIED" : { "message" : [ - "A is required ." + "Type hints for should be specified; however, got ." ] }, "UDF_RETURN_TYPE" : { @@ -888,6 +893,11 @@ ERROR_CLASSES_JSON = """ "Unknown response: ." ] }, + "UNKNOWN_VALUE_FOR" : { +"message" : [ + "Unknown value for ``." +] + }, "UNSUPPORTED_DATA_TYPE" : { "message" : [ "Unsupported DataType ``." @@ -983,6 +993,11 @@ ERROR_CLASSES_JSON = """ "Value for `` only supports the 'pearson', got ''." ] }, + "VALUE_NOT_PLAIN_COLUMN_REFERENCE" : { +"message" : [ + "Value in should be a plain column reference such as `df.col` or `col('column')`." +] + }, "VALUE_NOT_POSITIVE" : { "message" : [ "Value for `` must be positive, got ''." diff --git a/python/pyspark/sql/pandas/serializers.py b/python/pyspark/sql/pandas/serializers.py index 8ffb7407714b..6c5bd826a023 100644 --- a/python/pyspark/sql/pandas/serializers.py +++ b/python/pyspark/sql/pandas/serializers.py @@ -707,8 +707,9 @@ class CogroupArrowUDFSerializer(ArrowStreamGroupUDFSerializer): yield batches1, batches2 elif dataframes_in_group != 0: -raise ValueError( -"Invalid number of dataframes in group {0}".format(dataframes_in_group) +raise PySparkValueError( +error_class="INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP", +message_parameters={"dataframes_in_group": str(dataframes_in_group)}, ) diff --git a/python/pyspark/sql/pandas/typehints.py b/python/pyspark/sql/pandas/typehints.py index f0c13e66a63d..37ba02a94d58 100644 --- a/python/pyspark/sql/pandas/typehints.py +++ b/python/pyspark/sql/pandas/typehints.py @@ -18,7 +18,7 @@ from inspect import Signature from typing import Any, Callable, Dict, Optional, Union, TYPE_CHECKING from pyspark.sql.pandas.utils import require_minimum_pandas_version -from pyspark.errors import PySparkNotImplementedError +from pyspark.errors import PySparkNotImplementedError, PySparkValueError if TYPE_CHECKING: from pyspark.sql.pandas._typing import ( @@ -51,12 +51,18 @@ def infer_eval_type( annotati
(spark) branch branch-3.4 updated: [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new b8750d5c0b41 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event b8750d5c0b41 is described below commit b8750d5c0b416137ce802cf73dd92b0fc7ff5467 Author: Xingbo Jiang AuthorDate: Sun Dec 3 22:08:20 2023 -0800 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event ### What changes were proposed in this pull request? We found a race condition between lastTaskRunningTime and lastShuffleMigrationTime that could lead to a decommissioned executor exit before all the shuffle blocks have been discovered. The issue could lead to immediate task retry right after an executor exit, thus longer query execution time. To fix the issue, we choose to update the lastTaskRunningTime only when a task updates its status to finished through the StatusUpdate event. This is better than the current approach (which use a thread to check for number of running tasks every second), because in this way we clearly know whether the shuffle block refresh happened after all tasks finished running or not, thus resolved the race condition mentioned above. ### Why are the changes needed? To fix a race condition that could lead to shuffle data lost, thus longer query execution time. ### How was this patch tested? This is a very subtle race condition that is hard to write a unit test using current unit test framework. And we are confident the change is low risk. Thus only verify by passing all the existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44090 from jiangxb1987/SPARK-46182. Authored-by: Xingbo Jiang Signed-off-by: Dongjoon Hyun (cherry picked from commit 6f112f7b1a50a2b8a59952c69f67dd5f80ab6633) Signed-off-by: Dongjoon Hyun --- .../spark/executor/CoarseGrainedExecutorBackend.scala| 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala index c695a9ec2851..537522326fc7 100644 --- a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala +++ b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala @@ -21,7 +21,7 @@ import java.net.URL import java.nio.ByteBuffer import java.util.Locale import java.util.concurrent.ConcurrentHashMap -import java.util.concurrent.atomic.AtomicBoolean +import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong} import scala.util.{Failure, Success} import scala.util.control.NonFatal @@ -80,6 +80,10 @@ private[spark] class CoarseGrainedExecutorBackend( private var decommissioned = false + // Track the last time in ns that at least one task is running. If no task is running and all + // shuffle/RDD data migration are done, the decommissioned executor should exit. + private var lastTaskFinishTime = new AtomicLong(System.nanoTime()) + override def onStart(): Unit = { if (env.conf.get(DECOMMISSION_ENABLED)) { val signal = env.conf.get(EXECUTOR_DECOMMISSION_SIGNAL) @@ -269,6 +273,7 @@ private[spark] class CoarseGrainedExecutorBackend( val msg = StatusUpdate(executorId, taskId, state, data, cpus, resources) if (TaskState.isFinished(state)) { taskResources.remove(taskId) + lastTaskFinishTime.set(System.nanoTime()) } driver match { case Some(driverRef) => driverRef.send(msg) @@ -341,7 +346,6 @@ private[spark] class CoarseGrainedExecutorBackend( val shutdownThread = new Thread("wait-for-blocks-to-migrate") { override def run(): Unit = { - var lastTaskRunningTime = System.nanoTime() val sleep_time = 1000 // 1s // This config is internal and only used by unit tests to force an executor // to hang around for longer when decommissioned. @@ -358,7 +362,7 @@ private[spark] class CoarseGrainedExecutorBackend( val (migrationTime, allBlocksMigrated) = env.blockManager.lastMigrationInfo() // We can only trust allBlocksMigrated boolean value if there were no tasks running // since the start of computing it. -if (allBlocksMigrated && (migrationTime > lastTaskRunningTime)) { +if (allBlocksMigrated && (migrationTime > lastTaskFinishTime.get())) { logInfo("No running tasks, all blocks migrated, stopping.") exitExecutor(0, ExecutorLossMessage.decommissionFinished, notifyDriver = true)
(spark) branch branch-3.5 updated: [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 273ef5708fc3 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event 273ef5708fc3 is described below commit 273ef5708fc33872cfe3091627617bbac8fdd56f Author: Xingbo Jiang AuthorDate: Sun Dec 3 22:08:20 2023 -0800 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event ### What changes were proposed in this pull request? We found a race condition between lastTaskRunningTime and lastShuffleMigrationTime that could lead to a decommissioned executor exit before all the shuffle blocks have been discovered. The issue could lead to immediate task retry right after an executor exit, thus longer query execution time. To fix the issue, we choose to update the lastTaskRunningTime only when a task updates its status to finished through the StatusUpdate event. This is better than the current approach (which use a thread to check for number of running tasks every second), because in this way we clearly know whether the shuffle block refresh happened after all tasks finished running or not, thus resolved the race condition mentioned above. ### Why are the changes needed? To fix a race condition that could lead to shuffle data lost, thus longer query execution time. ### How was this patch tested? This is a very subtle race condition that is hard to write a unit test using current unit test framework. And we are confident the change is low risk. Thus only verify by passing all the existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44090 from jiangxb1987/SPARK-46182. Authored-by: Xingbo Jiang Signed-off-by: Dongjoon Hyun (cherry picked from commit 6f112f7b1a50a2b8a59952c69f67dd5f80ab6633) Signed-off-by: Dongjoon Hyun --- .../spark/executor/CoarseGrainedExecutorBackend.scala| 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala index c695a9ec2851..537522326fc7 100644 --- a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala +++ b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala @@ -21,7 +21,7 @@ import java.net.URL import java.nio.ByteBuffer import java.util.Locale import java.util.concurrent.ConcurrentHashMap -import java.util.concurrent.atomic.AtomicBoolean +import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong} import scala.util.{Failure, Success} import scala.util.control.NonFatal @@ -80,6 +80,10 @@ private[spark] class CoarseGrainedExecutorBackend( private var decommissioned = false + // Track the last time in ns that at least one task is running. If no task is running and all + // shuffle/RDD data migration are done, the decommissioned executor should exit. + private var lastTaskFinishTime = new AtomicLong(System.nanoTime()) + override def onStart(): Unit = { if (env.conf.get(DECOMMISSION_ENABLED)) { val signal = env.conf.get(EXECUTOR_DECOMMISSION_SIGNAL) @@ -269,6 +273,7 @@ private[spark] class CoarseGrainedExecutorBackend( val msg = StatusUpdate(executorId, taskId, state, data, cpus, resources) if (TaskState.isFinished(state)) { taskResources.remove(taskId) + lastTaskFinishTime.set(System.nanoTime()) } driver match { case Some(driverRef) => driverRef.send(msg) @@ -341,7 +346,6 @@ private[spark] class CoarseGrainedExecutorBackend( val shutdownThread = new Thread("wait-for-blocks-to-migrate") { override def run(): Unit = { - var lastTaskRunningTime = System.nanoTime() val sleep_time = 1000 // 1s // This config is internal and only used by unit tests to force an executor // to hang around for longer when decommissioned. @@ -358,7 +362,7 @@ private[spark] class CoarseGrainedExecutorBackend( val (migrationTime, allBlocksMigrated) = env.blockManager.lastMigrationInfo() // We can only trust allBlocksMigrated boolean value if there were no tasks running // since the start of computing it. -if (allBlocksMigrated && (migrationTime > lastTaskRunningTime)) { +if (allBlocksMigrated && (migrationTime > lastTaskFinishTime.get())) { logInfo("No running tasks, all blocks migrated, stopping.") exitExecutor(0, ExecutorLossMessage.decommissionFinished, notifyDriver = true)
(spark) branch master updated: [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6f112f7b1a50 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event 6f112f7b1a50 is described below commit 6f112f7b1a50a2b8a59952c69f67dd5f80ab6633 Author: Xingbo Jiang AuthorDate: Sun Dec 3 22:08:20 2023 -0800 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event ### What changes were proposed in this pull request? We found a race condition between lastTaskRunningTime and lastShuffleMigrationTime that could lead to a decommissioned executor exit before all the shuffle blocks have been discovered. The issue could lead to immediate task retry right after an executor exit, thus longer query execution time. To fix the issue, we choose to update the lastTaskRunningTime only when a task updates its status to finished through the StatusUpdate event. This is better than the current approach (which use a thread to check for number of running tasks every second), because in this way we clearly know whether the shuffle block refresh happened after all tasks finished running or not, thus resolved the race condition mentioned above. ### Why are the changes needed? To fix a race condition that could lead to shuffle data lost, thus longer query execution time. ### How was this patch tested? This is a very subtle race condition that is hard to write a unit test using current unit test framework. And we are confident the change is low risk. Thus only verify by passing all the existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44090 from jiangxb1987/SPARK-46182. Authored-by: Xingbo Jiang Signed-off-by: Dongjoon Hyun --- .../spark/executor/CoarseGrainedExecutorBackend.scala| 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala index f1a9aa353e76..4bf4929c1339 100644 --- a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala +++ b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala @@ -21,7 +21,7 @@ import java.net.URL import java.nio.ByteBuffer import java.util.Locale import java.util.concurrent.ConcurrentHashMap -import java.util.concurrent.atomic.AtomicBoolean +import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong} import scala.util.{Failure, Success} import scala.util.control.NonFatal @@ -77,6 +77,10 @@ private[spark] class CoarseGrainedExecutorBackend( private var decommissioned = false + // Track the last time in ns that at least one task is running. If no task is running and all + // shuffle/RDD data migration are done, the decommissioned executor should exit. + private var lastTaskFinishTime = new AtomicLong(System.nanoTime()) + override def onStart(): Unit = { if (env.conf.get(DECOMMISSION_ENABLED)) { val signal = env.conf.get(EXECUTOR_DECOMMISSION_SIGNAL) @@ -273,6 +277,7 @@ private[spark] class CoarseGrainedExecutorBackend( val msg = StatusUpdate(executorId, taskId, state, data, cpus, resources) if (TaskState.isFinished(state)) { taskResources.remove(taskId) + lastTaskFinishTime.set(System.nanoTime()) } driver match { case Some(driverRef) => driverRef.send(msg) @@ -345,7 +350,6 @@ private[spark] class CoarseGrainedExecutorBackend( val shutdownThread = new Thread("wait-for-blocks-to-migrate") { override def run(): Unit = { - var lastTaskRunningTime = System.nanoTime() val sleep_time = 1000 // 1s // This config is internal and only used by unit tests to force an executor // to hang around for longer when decommissioned. @@ -362,7 +366,7 @@ private[spark] class CoarseGrainedExecutorBackend( val (migrationTime, allBlocksMigrated) = env.blockManager.lastMigrationInfo() // We can only trust allBlocksMigrated boolean value if there were no tasks running // since the start of computing it. -if (allBlocksMigrated && (migrationTime > lastTaskRunningTime)) { +if (allBlocksMigrated && (migrationTime > lastTaskFinishTime.get())) { logInfo("No running tasks, all blocks migrated, stopping.") exitExecutor(0, ExecutorLossMessage.decommissionFinished, notifyDriver = true) } else { @@ -374,12 +378,6 @@ private[spark] class CoarseGrainedExecutorBackend( }
(spark) branch master updated (0c029e70706c -> 712352e37ec5)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 0c029e70706c [SPARK-46218][BUILD] Upgrade commons-cli to 1.6.0 add 712352e37ec5 [SPARK-40559][PYTHON][DOCS][FOLLOW-UP] Fix the docstring and document both applyInArrows No new revisions were added by this update. Summary of changes: python/docs/source/reference/pyspark.sql/grouping.rst | 2 ++ python/pyspark/sql/pandas/group_ops.py| 15 +++ 2 files changed, 9 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46218][BUILD] Upgrade commons-cli to 1.6.0
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0c029e70706c [SPARK-46218][BUILD] Upgrade commons-cli to 1.6.0 0c029e70706c is described below commit 0c029e70706c7e1a4c3a7bb763dbbcb4fe1ccd9f Author: panbingkun AuthorDate: Sun Dec 3 21:27:21 2023 -0800 [SPARK-46218][BUILD] Upgrade commons-cli to 1.6.0 ### What changes were proposed in this pull request? The pr aims to upgrade `commons-cli` from `1.5.0` to `1.6.0`. ### Why are the changes needed? - The last upgrade occurred two years ago, https://github.com/apache/spark/pull/34707 - The full release notes: https://commons.apache.org/proper/commons-cli/changes-report.html#a1.6.0 - The version mainly focus on fixing bugs: Fix NPE in CommandLine.resolveOption(String). Fixes [CLI-283](https://issues.apache.org/jira/browse/CLI-283). CommandLine.addOption(Option) should not allow a null Option. Fixes [CLI-283](https://issues.apache.org/jira/browse/CLI-283). CommandLine.addArgs(String) should not allow a null String. Fixes [CLI-283](https://issues.apache.org/jira/browse/CLI-283). NullPointerException thrown by CommandLineParser.parse(). Fixes [CLI-317](https://issues.apache.org/jira/browse/CLI-317). StringIndexOutOfBoundsException thrown by CommandLineParser.parse(). Fixes [CLI-313](https://issues.apache.org/jira/browse/CLI-313). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44132 from panbingkun/SPARK-46218. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index f1d675d92b6d..ebfe6acad960 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -35,7 +35,7 @@ breeze_2.13/2.1.0//breeze_2.13-2.1.0.jar cats-kernel_2.13/2.8.0//cats-kernel_2.13-2.8.0.jar chill-java/0.10.0//chill-java-0.10.0.jar chill_2.13/0.10.0//chill_2.13-0.10.0.jar -commons-cli/1.5.0//commons-cli-1.5.0.jar +commons-cli/1.6.0//commons-cli-1.6.0.jar commons-codec/1.16.0//commons-codec-1.16.0.jar commons-collections/3.2.2//commons-collections-3.2.2.jar commons-collections4/4.4//commons-collections4-4.4.jar diff --git a/pom.xml b/pom.xml index 2a259cfd322b..27ee42f103dd 100644 --- a/pom.xml +++ b/pom.xml @@ -220,7 +220,7 @@ 2.70.0 3.1.0 1.1.0 -1.5.0 +1.6.0 1.70 1.9.0 4.1.100.Final - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46227][SQL] Move `withSQLConf` from `SQLHelper` to `SQLConfHelper`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new aee6b1582775 [SPARK-46227][SQL] Move `withSQLConf` from `SQLHelper` to `SQLConfHelper` aee6b1582775 is described below commit aee6b158277537709a717223b518923431bca0a6 Author: ulysses-you AuthorDate: Sun Dec 3 21:23:34 2023 -0800 [SPARK-46227][SQL] Move `withSQLConf` from `SQLHelper` to `SQLConfHelper` ### What changes were proposed in this pull request? This pr moves method `withSQLConf` from `SQLHelper` in catalyst test module to `SQLConfHelper` trait in catalyst module. To make it easy to use such case: `val x = withSQLConf {}`, this pr also changes its return type. ### Why are the changes needed? A part of https://github.com/apache/spark/pull/44013 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Pass CI ### Was this patch authored or co-authored using generative AI tooling? no Closes #44142 from ulysses-you/withSQLConf. Authored-by: ulysses-you Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/catalyst/SQLConfHelper.scala | 29 .../spark/sql/catalyst/plans/SQLHelper.scala | 32 ++ .../sql/internal/ExecutorSideSQLConfSuite.scala| 2 +- .../org/apache/spark/sql/test/SQLTestUtils.scala | 2 +- .../spark/sql/hive/execution/HiveSerDeSuite.scala | 2 +- 5 files changed, 34 insertions(+), 33 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SQLConfHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SQLConfHelper.scala index cee35cdb8d84..f4605b9218f0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SQLConfHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SQLConfHelper.scala @@ -17,6 +17,7 @@ package org.apache.spark.sql.catalyst +import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.internal.SQLConf /** @@ -29,4 +30,32 @@ trait SQLConfHelper { * See [[SQLConf.get]] for more information. */ def conf: SQLConf = SQLConf.get + + /** + * Sets all SQL configurations specified in `pairs`, calls `f`, and then restores all SQL + * configurations. + */ + protected def withSQLConf[T](pairs: (String, String)*)(f: => T): T = { +val conf = SQLConf.get +val (keys, values) = pairs.unzip +val currentValues = keys.map { key => + if (conf.contains(key)) { +Some(conf.getConfString(key)) + } else { +None + } +} +keys.lazyZip(values).foreach { (k, v) => + if (SQLConf.isStaticConfigKey(k)) { +throw new AnalysisException(s"Cannot modify the value of a static config: $k") + } + conf.setConfString(k, v) +} +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => conf.setConfString(key, value) +case (key, None) => conf.unsetConf(key) + } +} + } } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala index eb844e6f057f..92681613bd83 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala @@ -23,41 +23,13 @@ import scala.util.control.NonFatal import org.scalatest.Assertions.fail -import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.SQLConfHelper import org.apache.spark.sql.catalyst.util.DateTimeTestUtils import org.apache.spark.sql.catalyst.util.DateTimeUtils.getZoneId import org.apache.spark.sql.internal.SQLConf import org.apache.spark.util.Utils -trait SQLHelper { - - /** - * Sets all SQL configurations specified in `pairs`, calls `f`, and then restores all SQL - * configurations. - */ - protected def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { -val conf = SQLConf.get -val (keys, values) = pairs.unzip -val currentValues = keys.map { key => - if (conf.contains(key)) { -Some(conf.getConfString(key)) - } else { -None - } -} -keys.lazyZip(values).foreach { (k, v) => - if (SQLConf.isStaticConfigKey(k)) { -throw new AnalysisException(s"Cannot modify the value of a static config: $k") - } - conf.setConfString(k, v) -} -try f finally { - keys.zip(currentValues).foreach { -case (key, Some(value)) => conf.setConfString(key, value) -case (key, None) => conf.unsetConf(key) - } -} - } +trait SQLHelper extends SQLConf
(spark) branch master updated: [SPARK-46217][CORE][TESTS] Include `Driver/App` data in `PersistenceEngineBenchmark`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5031e52f9e0 [SPARK-46217][CORE][TESTS] Include `Driver/App` data in `PersistenceEngineBenchmark` 5031e52f9e0 is described below commit 5031e52f9e032e8e450af9fcd294f5b53e2c4cfd Author: Dongjoon Hyun AuthorDate: Sun Dec 3 15:50:09 2023 -0800 [SPARK-46217][CORE][TESTS] Include `Driver/App` data in `PersistenceEngineBenchmark` ### What changes were proposed in this pull request? This PR aims to include `DirverInfo` and `ApplicationInfo` data in `PersistenceEngineBenchmark`. ### Why are the changes needed? Previously, `PersistenceEngine` recovers three kind of information. Previously, `PersistenceEngineBenchmark ` focused on `WorkerInfo` only. This PR will add two other informations to be more complete. https://github.com/apache/spark/blob/3da2e5c632468ec7cf7001255c1a44197b46ce30/core/src/main/scala/org/apache/spark/deploy/master/PersistenceEngine.scala#L56-L78 ### Does this PR introduce _any_ user-facing change? No. This is a test improvement. ### How was this patch tested? Manual tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44130 from dongjoon-hyun/SPARK-46217. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../PersistenceEngineBenchmark-jdk21-results.txt | 28 +- .../PersistenceEngineBenchmark-results.txt | 28 +- .../deploy/master/PersistenceEngineBenchmark.scala | 65 +- 3 files changed, 80 insertions(+), 41 deletions(-) diff --git a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt index 38e74ed6b53..314fb6958b6 100644 --- a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt +++ b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt @@ -4,20 +4,20 @@ PersistenceEngineBenchmark OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1051-azure AMD EPYC 7763 64-Core Processor -2000 Workers: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative +1000 Workers: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -ZooKeeperPersistenceEngine with JavaSerializer 2254 2329 119 0.0 1126867.1 1.0X -ZooKeeperPersistenceEngine with KryoSerializer 1911 1912 1 0.0 955667.1 1.2X -FileSystemPersistenceEngine with JavaSerializer 438 448 15 0.0 218868.1 5.1X -FileSystemPersistenceEngine with JavaSerializer (lz4) 187 195 8 0.0 93337.8 12.1X -FileSystemPersistenceEngine with JavaSerializer (lzf) 193 216 20 0.0 96678.8 11.7X -FileSystemPersistenceEngine with JavaSerializer (snappy)175 183 10 0.0 87652.3 12.9X -FileSystemPersistenceEngine with JavaSerializer (zstd) 243 255 14 0.0 121695.2 9.3X -FileSystemPersistenceEngine with KryoSerializer 150 160 15 0.0 75089.7 15.0X -FileSystemPersistenceEngine with KryoSerializer (lz4) 170 177 10 0.0 84996.7 13.3X -FileSystemPersistenceEngine with KryoSerializer (lzf) 192 203 12 0.0 96019.1 11.7X -FileSystemPersistenceEngine with KryoSerializer (snappy)184 202 16 0.0 92241.3 12.2X -FileSystemPersistenceEngine with KryoSerializer (zstd) 232 238 5 0.0 116075.2 9.7X -BlackHolePersistenceEngine0 0 0 27.3 36.6 30761.0X +ZooKeeperPersistenceEngine with JavaSerializer 5402 5546 233 0.0 5402030.8 1.0X +ZooKeeperPersistenceEngine with KryoSerializer 4185 4220 32 0.0 4184623.1 1.3X +FileSystemPersistenceEngine with JavaSerializer1591 1634 37 0.0 1590836.4 3.4X +FileSystemPersistenceEngine with JavaSerializer (lz4
(spark) branch master updated: [SPARK-46216][CORE] Improve `FileSystemPersistenceEngine` to support compressions
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3da2e5c6324 [SPARK-46216][CORE] Improve `FileSystemPersistenceEngine` to support compressions 3da2e5c6324 is described below commit 3da2e5c632468ec7cf7001255c1a44197b46ce30 Author: Dongjoon Hyun AuthorDate: Sun Dec 3 00:26:16 2023 -0800 [SPARK-46216][CORE] Improve `FileSystemPersistenceEngine` to support compressions ### What changes were proposed in this pull request? This PR aims to improve `FileSystemPersistenceEngine` to support compressions via a new configuration, `spark.deploy.recoveryCompressionCodec`. ### Why are the changes needed? To allow the users to choose a proper compression codec for their workloads. For `JavaSerializer` case, `LZ4` compression is **2x** faster than the baseline (no compression). ``` OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Linux 5.15.0-1051-azure AMD EPYC 7763 64-Core Processor 2000 Workers: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative ZooKeeperPersistenceEngine with JavaSerializer 2276 2360 115 0.0 1137909.6 1.0X ZooKeeperPersistenceEngine with KryoSerializer 1883 1906 34 0.0 941364.2 1.2X FileSystemPersistenceEngine with JavaSerializer 431 436 7 0.0 215436.9 5.3X FileSystemPersistenceEngine with JavaSerializer (lz4) 209 216 9 0.0 104404.1 10.9X FileSystemPersistenceEngine with JavaSerializer (lzf) 199 202 2 0.0 99489.5 11.4X FileSystemPersistenceEngine with JavaSerializer (snappy)192 199 9 0.0 95872.9 11.9X FileSystemPersistenceEngine with JavaSerializer (zstd) 258 264 6 0.0 129249.4 8.8X FileSystemPersistenceEngine with KryoSerializer 139 151 13 0.0 69374.5 16.4X FileSystemPersistenceEngine with KryoSerializer (lz4) 159 165 8 0.0 79588.9 14.3X FileSystemPersistenceEngine with KryoSerializer (lzf) 180 195 18 0.0 89844.0 12.7X FileSystemPersistenceEngine with KryoSerializer (snappy)164 183 18 0.0 82016.0 13.9X FileSystemPersistenceEngine with KryoSerializer (zstd) 206 218 11 0.0 102838.9 11.1X BlackHolePersistenceEngine0 0 0 35.1 28.5 39908.5X ``` ### Does this PR introduce _any_ user-facing change? No, this is a new feature. ### How was this patch tested? Pass the CIs with the newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44129 from dongjoon-hyun/SPARK-46216. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../PersistenceEngineBenchmark-jdk21-results.txt | 22 .../PersistenceEngineBenchmark-results.txt | 22 .../master/FileSystemPersistenceEngine.scala | 10 -- .../spark/deploy/master/RecoveryModeFactory.scala | 6 ++-- .../org/apache/spark/internal/config/Deploy.scala | 7 .../apache/spark/deploy/master/MasterSuite.scala | 40 ++ .../deploy/master/PersistenceEngineBenchmark.scala | 19 -- .../deploy/master/PersistenceEngineSuite.scala | 13 +++ 8 files changed, 118 insertions(+), 21 deletions(-) diff --git a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt index 65dbfd0990d..38e74ed6b53 100644 --- a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt +++ b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt @@ -4,12 +4,20 @@ PersistenceEngineBenchmark OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1051-azure AMD EPYC 7763 64-Core Processor -1000 Workers:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative
(spark) branch master updated: [SPARK-46206][PS] Use a narrower scope exception for SQL processor
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 73a09ed7bd7 [SPARK-46206][PS] Use a narrower scope exception for SQL processor 73a09ed7bd7 is described below commit 73a09ed7bd7372779e25d65498c4ab6b8496f0a8 Author: Haejoon Lee AuthorDate: Sat Dec 2 21:42:09 2023 -0800 [SPARK-46206][PS] Use a narrower scope exception for SQL processor ### What changes were proposed in this pull request? This PR proposes to refine the exception handling in SQL processor functions by replacing the general `Exception` class with more specific exception types. ### Why are the changes needed? The current exception handling uses the broad `Exception` type, which can obscure the root cause of issues. By specifying more accurate exceptions, the code becomes clearer: - In `_get_local_scope()`, an `IndexError` is more appropriate as it explicitly handles the case where the index is out of range when accessing the call stack using `inspect.stack()`. - In `_get_ipython_scope()`, `AttributeError` and `ModuleNotFoundError` could occur if the IPython environment is not available or the expected attributes in the IPython shell object are missing. Using these specific exceptions enhances the maintainability and readability of the code, making it easier for developers to understand and handle errors more effectively. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? The existing test suite `pyspark.pandas.tests.test_sql::SQLTests` should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44114 from itholic/refine_sql_error. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- python/pyspark/pandas/sql_processor.py | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/python/pyspark/pandas/sql_processor.py b/python/pyspark/pandas/sql_processor.py index 1bd1cb9823c..b047417b763 100644 --- a/python/pyspark/pandas/sql_processor.py +++ b/python/pyspark/pandas/sql_processor.py @@ -206,9 +206,7 @@ def _get_local_scope() -> Dict[str, Any]: # Get 2 scopes above (_get_local_scope -> sql -> ...) to capture the vars there. try: return inspect.stack()[_CAPTURE_SCOPES][0].f_locals -except Exception: -# TODO (rxin, thunterdb): use a narrower scope exception. -# See https://github.com/databricks/koalas/pull/448 +except IndexError: return {} @@ -222,9 +220,7 @@ def _get_ipython_scope() -> Dict[str, Any]: shell = get_ipython() return shell.user_ns -except Exception: -# TODO (rxin, thunterdb): use a narrower scope exception. -# See https://github.com/databricks/koalas/pull/448 +except (AttributeError, ModuleNotFoundError): return None - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46212][CORE][SQL][SS][CONNECT][MLLIB][GRAPHX][DSTREAM][PROTOBUF][EXAMPLES] Use other functions to simplify the code pattern of `s.c.MapOps#view.mapValues`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 938b7f58051 [SPARK-46212][CORE][SQL][SS][CONNECT][MLLIB][GRAPHX][DSTREAM][PROTOBUF][EXAMPLES] Use other functions to simplify the code pattern of `s.c.MapOps#view.mapValues` 938b7f58051 is described below commit 938b7f580519e3da64004185f7083ae63cf99bc0 Author: yangjie01 AuthorDate: Sat Dec 2 21:39:14 2023 -0800 [SPARK-46212][CORE][SQL][SS][CONNECT][MLLIB][GRAPHX][DSTREAM][PROTOBUF][EXAMPLES] Use other functions to simplify the code pattern of `s.c.MapOps#view.mapValues` ### What changes were proposed in this pull request? This pr simplifies `s.c.MapOps.view.mapValues` using the following approach: - For the `s.c.immutable.MapOps` type, replace it with the `s.c.immutable.MapOps#transform` function. ```scala def transform[W](f: (K, V) => W): CC[K, W] = map { case (k, v) => (k, f(k, v)) } ``` Like the case in `CountMinSketchSuite`: https://github.com/apache/spark/blob/0d40b1aea758b95a4416c8653599af8713a4aa16/common/sketch/src/test/scala/org/apache/spark/util/sketch/CountMinSketchSuite.scala#L59 - For the `s.c.MapOps` type, since the `transform` function does not exist for this type, replace it directly with the `map` function. ```scala def map[K2, V2](f: ((K, V)) => (K2, V2)): CC[K2, V2] = mapFactory.from(new View.Map(this, f)) ``` Like the case in `KafkaTestUtils`: https://github.com/apache/spark/blob/0d40b1aea758b95a4416c8653599af8713a4aa16/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala#L381 - For the `s.c.mutable.MapOps` type, the `transform` function has also been deprecated. At the same time, the signature of `transform` and its replacement function `mapValuesInPlace` is as follows: ```scala deprecated("Use mapValuesInPlace instead", "2.13.0") inline final def transform(f: (K, V) => V): this.type = mapValuesInPlace(f) def mapValuesInPlace(f: (K, V) => V): this.type = {...} ``` The target type of the value in the function is `V`, which is different from the target type of the value in `s.c.immutable.MapOps#transform`, which is `W`. This does not meet the desired requirement. So in this scenario, it can be divided into two sub-scenarios for handling: 1. If the `mutable.Map` are using needs to be eventually converted to an `immutable.Map`, first convert it to an `immutable.Map` and then use the `transform` function for replacement. Like the case in `SparkConnectPlanner`: https://github.com/apache/spark/blob/0d40b1aea758b95a4416c8653599af8713a4aa16/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L292 2. If the `mutable.Map` are using does not need to be converted to an `immutable.Map` in the end, directly use the `map` function from `scala.collection.MapOps` for replacement. Like the case in `SparkSession`: https://github.com/apache/spark/blob/0d40b1aea758b95a4416c8653599af8713a4aa16/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala#L313 In addition, there is a special case in `PythonWorkerFactory`: https://github.com/apache/spark/blob/0d40b1aea758b95a4416c8653599af8713a4aa16/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L381 For this case, it only needs to `destroy` each `Process` in `values` without returning any value. Therefore, it has been rewritten using `.values.foreach`. ### Why are the changes needed? The coding pattern of `s.c.MapOps.view.mapValues` seems verbose, it can be simplified using other functions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44122 from LuciferYang/SPARK-46212. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../spark/util/sketch/CountMinSketchSuite.scala| 2 +- .../org/apache/spark/sql/avro/AvroUtils.scala | 3 +-- .../scala/org/apache/spark/sql/SparkSession.scala | 2 +- .../spark/sql/ClientDataFrameStatSuite.scala | 2 +- .../org/apache/spark/sql/connect/dsl/package.scala | 2 +- .../sql/connect/planner/SparkConnectPlanner.scala | 15 ++- .../sql/kafka010/KafkaMicroBatchSourceSuite.scala | 2 +- .../apache/spark/sql/kafka010/KafkaTestUtils.scala | 3 ++- .../streaming/kafka010/ConsumerStrategy.scala | 9 --- .../kafka010/DirectKafkaInputDStream.scala | 2 +- .../kafka010/DirectKafkaStreamSuite