[spark] branch master updated: [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 103de914a5f [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions 103de914a5f is described below commit 103de914a5f96fccbe722663ee69c8ee7d9c8135 Author: Jiaan Geng AuthorDate: Wed Oct 18 14:55:51 2023 +0800 [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions ### What changes were proposed in this pull request? Currently, Spark runtime filter supports multi level shuffle join side as filter creation side. Please see: https://github.com/apache/spark/pull/39170. Although this feature adds the adaptive scene and improves the performance, there are still need to support other case. **Optimization of Expression Transitivity on the Creation Side of Spark Runtime Filter** **Principle** Association expressions are transitive in some Joins, such as: `Tab1.col1A = Tab2.col2B` and `Tab2.col2B = Tab3.col3C` Actually, it can be inferred that `Tab1.col1A = Tab3.col3C`. **Optimization points** Currently, the runtime filter's creation side expression only uses directly associated keys. If the transitivity of association conditions is utilized, runtime filters can be injected into many scenarios, such as: ``` SELECT * FROM ( SELECT * FROM tab1 JOIN tab2 ON tab1.c1 = tab2.c2 WHERE tab2.a2 = 5 ) AS a JOIN tab3 ON tab3.c3 = a.c1 ``` The `tab3.c3` here is only associated with `tab1.c1` and not with `tab2.c2`. Although there is selective filtering on tab2 (`tab2.a2 = 5`), Spark is currently unable to inject a Runtime Filter. As long as transitivity is considered, we can know that `tab3.c3` and `tab2.c2` are related, so we can still inject Runtime Filter and improve performance. For the current implementation, Spark only inject runtime filter into tab1 with bloom filter based on `bf2.a2 = 5`. Because there is no the join between tab3 and tab2, so Spark can't inject runtime filter into tab3 with the same bloom filter. But the above SQL have the join condition `tab3.c3 = a.c1(tab1.c1)` between tab3 and tab2, and also have the join condition `tab1.c1 = tab2.c2`. We can rely on the transitivity of the join condition to get the virtual join condition `tab3.c3 = tab2.c2`, then we can inject the bloom filter based on `bf2.a2 = 5` into tab3. ### Why are the changes needed? Enhance the Spark runtime filter and improve performance. ### Does this PR introduce _any_ user-facing change? 'No'. Just update the inner implementation. ### How was this patch tested? New tests. Micro benchmark for q75 in TPC-DS. **2TB TPC-DS** | TPC-DS Query | Before(Seconds) | After(Seconds) | Speedup(Percent) | | | | | | | q75 | 129.664 | 81.562 | 58.98% | Closes #42317 from beliefer/SPARK-44649. Authored-by: Jiaan Geng Signed-off-by: Wenchen Fan --- .../catalyst/optimizer/InjectRuntimeFilter.scala | 64 +++--- .../spark/sql/InjectRuntimeFilterSuite.scala | 38 +++-- 2 files changed, 75 insertions(+), 27 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala index 8737082e571..30526bd8106 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala @@ -125,14 +125,14 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J */ private def extractSelectiveFilterOverScan( plan: LogicalPlan, - filterCreationSideKey: Expression): Option[LogicalPlan] = { -@tailrec + filterCreationSideKey: Expression): Option[(Expression, LogicalPlan)] = { def extract( p: LogicalPlan, predicateReference: AttributeSet, hasHitFilter: Boolean, hasHitSelectiveFilter: Boolean, -currentPlan: LogicalPlan): Option[LogicalPlan] = p match { +currentPlan: LogicalPlan, +targetKey: Expression): Option[(Expression, LogicalPlan)] = p match { case Project(projectList, child) if hasHitFilter => // We need to make sure all expressions referenced by filter predicates are simple // expressions. @@ -143,41 +143,62 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J referencedExprs.map(_.references).foldLeft(AttributeSet.empty)(_ ++ _), hasHitFilter, hasHitSel
[spark] branch master updated: [SPARK-45558][SS] Introduce a metadata file for streaming stateful operator
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3005dc89084 [SPARK-45558][SS] Introduce a metadata file for streaming stateful operator 3005dc89084 is described below commit 3005dc8908486f63a3e471cd05189881b833daf1 Author: Chaoqin Li AuthorDate: Wed Oct 18 15:49:43 2023 +0900 [SPARK-45558][SS] Introduce a metadata file for streaming stateful operator ### What changes were proposed in this pull request? Introduce a metadata file for streaming stateful operator, write metadata for stateful operator during planning. The information to store in the metadata file: - operator name (no need to be unique among stateful operators in the query) - state store name - numColumnsPrefixKey: > 0 if prefix scan is enabled, 0 otherwise The body of metadata file will be in json format. The metadata file will be versioned just as other streaming metadata file to be future proof. ### Why are the changes needed? The metadata file will improve expose more information about the state store, improves debugability and facilitate the development of state related feature such as reading and writing state and state repartitioning. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add unit test and integration tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43393 from chaoqin-li1123/state_metadata. Authored-by: Chaoqin Li Signed-off-by: Jungtaek Lim --- .../spark/sql/execution/QueryExecution.scala | 2 +- .../execution/streaming/IncrementalExecution.scala | 22 ++- .../execution/streaming/MicroBatchExecution.scala | 4 +- .../streaming/StreamingSymmetricHashJoinExec.scala | 10 ++ .../streaming/continuous/ContinuousExecution.scala | 3 +- .../streaming/state/OperatorStateMetadata.scala| 136 .../execution/streaming/statefulOperators.scala| 21 ++- .../state/OperatorStateMetadataSuite.scala | 181 + 8 files changed, 374 insertions(+), 5 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala index b3c97a83970..3d35300773b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala @@ -272,7 +272,7 @@ class QueryExecution( new IncrementalExecution( sparkSession, logical, OutputMode.Append(), "", UUID.randomUUID, UUID.randomUUID, 0, None, OffsetSeqMetadata(0, 0), -WatermarkPropagator.noop()) +WatermarkPropagator.noop(), false) } else { this } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala index ebdb9caf09e..a67097f6e96 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala @@ -20,6 +20,8 @@ package org.apache.spark.sql.execution.streaming import java.util.UUID import java.util.concurrent.atomic.AtomicInteger +import org.apache.hadoop.fs.Path + import org.apache.spark.internal.Logging import org.apache.spark.sql.{SparkSession, Strategy} import org.apache.spark.sql.catalyst.QueryPlanningTracker @@ -32,6 +34,7 @@ import org.apache.spark.sql.execution.aggregate.{HashAggregateExec, MergingSessi import org.apache.spark.sql.execution.exchange.ShuffleExchangeLike import org.apache.spark.sql.execution.python.FlatMapGroupsInPandasWithStateExec import org.apache.spark.sql.execution.streaming.sources.WriteToMicroBatchDataSourceV1 +import org.apache.spark.sql.execution.streaming.state.OperatorStateMetadataWriter import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.streaming.OutputMode import org.apache.spark.util.Utils @@ -50,7 +53,8 @@ class IncrementalExecution( val currentBatchId: Long, val prevOffsetSeqMetadata: Option[OffsetSeqMetadata], val offsetSeqMetadata: OffsetSeqMetadata, -val watermarkPropagator: WatermarkPropagator) +val watermarkPropagator: WatermarkPropagator, +val isFirstBatch: Boolean) extends QueryExecution(sparkSession, logicalPlan) with Logging { // Modified planner with stateful operations. @@ -71,6 +75,8 @@ class IncrementalExecution( StreamingGlobalLimitStrategy(outputMode) :: Nil } + private lazy val hadoopConf = sparkSession.sessionState.newHadoopConf() + private[sql] val numStateStores
[spark] branch master updated (b1e57a2b359 -> d7d38fbc184)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b1e57a2b359 [SPARK-45534][CORE] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver add d7d38fbc184 [SPARK-45587][INFRA] Skip UNIDOC and MIMA in `build` GitHub Action job No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 4 1 file changed, 4 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45534][CORE] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b1e57a2b359 [SPARK-45534][CORE] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver b1e57a2b359 is described below commit b1e57a2b359d7d9fbf07adfba10db97f38b99bde Author: zhaomin AuthorDate: Wed Oct 18 01:20:08 2023 -0500 [SPARK-45534][CORE] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver ### What changes were proposed in this pull request? use java.lang.ref.Cleaner instead of finalize() for RemoteBlockPushResolver ### Why are the changes needed? The finalize() method has been marked as deprecated since Java 9 and will be removed in the future, java.lang.ref.Cleaner is the more recommended solution. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43371 from zhaomin1423/45315. Authored-by: zhaomin Signed-off-by: Mridul Muralidharan gmail.com> --- .../network/shuffle/RemoteBlockPushResolver.java | 101 + .../network/shuffle/ShuffleTestAccessor.scala | 2 +- 2 files changed, 64 insertions(+), 39 deletions(-) diff --git a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java index a915d0eccb0..14fefebe089 100644 --- a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java +++ b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java @@ -21,6 +21,7 @@ import java.io.DataOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; +import java.lang.ref.Cleaner; import java.nio.ByteBuffer; import java.nio.channels.FileChannel; import java.nio.charset.StandardCharsets; @@ -94,6 +95,7 @@ import org.apache.spark.network.util.TransportConf; */ public class RemoteBlockPushResolver implements MergedShuffleFileManager { + private static final Cleaner CLEANER = Cleaner.create(); private static final Logger logger = LoggerFactory.getLogger(RemoteBlockPushResolver.class); public static final String MERGED_SHUFFLE_FILE_NAME_PREFIX = "shuffleMerged"; @@ -481,7 +483,7 @@ public class RemoteBlockPushResolver implements MergedShuffleFileManager { appShuffleInfo.shuffles.forEach((shuffleId, shuffleInfo) -> shuffleInfo.shuffleMergePartitions .forEach((shuffleMergeId, partitionInfo) -> { synchronized (partitionInfo) { - partitionInfo.closeAllFilesAndDeleteIfNeeded(false); + partitionInfo.cleanable.clean(); } })); if (cleanupLocalDirs) { @@ -537,7 +539,8 @@ public class RemoteBlockPushResolver implements MergedShuffleFileManager { partitions .forEach((partitionId, partitionInfo) -> { synchronized (partitionInfo) { - partitionInfo.closeAllFilesAndDeleteIfNeeded(true); + partitionInfo.cleanable.clean(); + partitionInfo.deleteAllFiles(); } }); } @@ -822,7 +825,7 @@ public class RemoteBlockPushResolver implements MergedShuffleFileManager { msg.appAttemptId, msg.shuffleId, msg.shuffleMergeId, partition.reduceId, ioe.getMessage()); } finally { -partition.closeAllFilesAndDeleteIfNeeded(false); +partition.cleanable.clean(); } } } @@ -1720,6 +1723,7 @@ public class RemoteBlockPushResolver implements MergedShuffleFileManager { // The meta file for a particular merged shuffle contains all the map indices that belong to // every chunk. The entry per chunk is a serialized bitmap. private final MergeShuffleFile metaFile; +private final Cleaner.Cleanable cleanable; // Location offset of the last successfully merged block for this shuffle partition private long dataFilePos; // Track the map index whose block is being merged for this shuffle partition @@ -1756,6 +1760,8 @@ public class RemoteBlockPushResolver implements MergedShuffleFileManager { this.dataFilePos = 0; this.mapTracker = new RoaringBitmap(); this.chunkTracker = new RoaringBitmap(); + this.cleanable = CLEANER.register(this, new ResourceCleaner(dataChannel, indexFile, +metaFile, appAttemptShuffleMergeId, reduceId)); } public long getDataFilePos() { @@ -1864,36 +1870,13 @@ public class RemoteBlockPushResolver implements MergedShuffleFileManager { metaFile.getChannel().truncate(metaFile.getPos());
[spark] branch master updated: [SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 11e7ea4f11d [SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error 11e7ea4f11d is described below commit 11e7ea4f11df71e2942322b01fcaab57dac20c83 Author: Jia Fan AuthorDate: Wed Oct 18 11:06:43 2023 +0500 [SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error ### What changes were proposed in this pull request? Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error, it would be like: ```log org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0 (TID 4031) (10.68.177.106 executor 0): com.univocity.parsers.common.TextParsingException: java.lang.IllegalStateException - Error reading from input Parser Configuration: CsvParserSettings: Auto configuration enabled=true Auto-closing enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Delimiters for detection=null Empty value= Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore leading whitespaces in quotes=false Ignore trailing whitespaces=false Ignore trailing whitespaces in quotes=false Input buffer size=1048576 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=1000 Line separator detection enabled=true Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip bits as whitespace=true Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=# Field delimiter=, Line separator (normalized)=\n Line separator sequence=\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=0, column=0, record=0 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402) at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277) at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843) at org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.(UnivocityParser.scala:463) at org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46... ``` Because multiline CSV/JSON use `BinaryFileRDD` not `FileScanRDD`. Unlike `FileScanRDD`, when met corrupt files will check `ignoreCorruptFiles` config to avoid report IOException, `BinaryFileRDD` will not report error because it return normal `PortableDataStream`. So we should catch it when infer schema in lambda function. Also do same thing for `ignoreMissingFiles`. ### Why are the changes needed? Fix the bug when use mulitline mode with ignoreCorruptFiles/ignoreMissingFiles config. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42979 from Hisoka-X/SPARK-45035_csv_multi_line. Authored-by: Jia Fan Signed-off-by: Max Gekk --- .../spark/sql/catalyst/json/JsonInferSchema.scala | 18 +-- .../execution/datasources/csv/CSVDataSource.scala | 28 --- .../datasources/CommonFileDataSourceSuite.scala| 28 +++ .../sql/execution/datasources/csv/CSVSuite.scala | 58 +- .../sql/execution/datasources/json/JsonSuite.scala | 46 - 5 files changed, 142 insertions(+), 36 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala index 4123c5290b6..4d04b34876c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catal
[spark] branch master updated: [SPARK-45576][CORE][FOLLOWUP] Remove unused imports to fix Java linter errors
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fbf8b7be090 [SPARK-45576][CORE][FOLLOWUP] Remove unused imports to fix Java linter errors fbf8b7be090 is described below commit fbf8b7be090bf20f7a6e81c62ba5fe3f9f9a801b Author: Dongjoon Hyun AuthorDate: Tue Oct 17 22:23:42 2023 -0700 [SPARK-45576][CORE][FOLLOWUP] Remove unused imports to fix Java linter errors ### What changes were proposed in this pull request? This PR aims to remove the unused imports. ### Why are the changes needed? To recover `master` branch by fixing Java linter errors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Or, manually checked like the following. ``` $ dev/lint-java Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.9.5/bin/mvn Using SPARK_LOCAL_IP=localhost Checkstyle checks passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43423 from dongjoon-hyun/SPARK-45576. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java| 2 -- 1 file changed, 2 deletions(-) diff --git a/common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java b/common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java index dbd71f987d4..0526fcb11be 100644 --- a/common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java +++ b/common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java @@ -28,8 +28,6 @@ import java.util.HashMap; import java.util.Map; import org.junit.jupiter.api.Test; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; import static org.junit.jupiter.api.Assertions.*; - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45500][CORE][WEBUI][FOLLOWUP] Show `RELAUNCHING` drivers too
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 73f9f5296e3 [SPARK-45500][CORE][WEBUI][FOLLOWUP] Show `RELAUNCHING` drivers too 73f9f5296e3 is described below commit 73f9f5296e36541db78ab10c4c01a56fbc17cca8 Author: Dongjoon Hyun AuthorDate: Tue Oct 17 21:56:30 2023 -0700 [SPARK-45500][CORE][WEBUI][FOLLOWUP] Show `RELAUNCHING` drivers too ### What changes were proposed in this pull request? This is a follow-up of #43328 to show `RELAUNCHING` drivers too. ### Why are the changes needed? When we submit with `--supervise` option, the abnormally-completed driver is in `RELAUNCHING` status and newly launched driver is in `SUBMITTED` status. https://github.com/apache/spark/blob/0cb4a84f6ab0c1bd101e6bc72be82987bbc02e9b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L995 ![Screenshot 2023-10-17 at 8 01 01 PM](https://github.com/apache/spark/assets/9700541/22c614cb-3f5c-44a0-b5f5-8edf4a20c580) ### Does this PR introduce _any_ user-facing change? Yes, but this is a new UI item. ### How was this patch tested? Manual tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43418 from dongjoon-hyun/SPARK-45500-2. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index 5c1887be5b8..48c0c9601c1 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -156,7 +156,8 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { {state.completedDrivers.length} Completed ({state.completedDrivers.count(_.state == DriverState.KILLED)} Killed, {state.completedDrivers.count(_.state == DriverState.FAILED)} Failed, -{state.completedDrivers.count(_.state == DriverState.ERROR)} Error) +{state.completedDrivers.count(_.state == DriverState.ERROR)} Error, +{state.completedDrivers.count(_.state == DriverState.RELAUNCHING)} Relaunching) Status: {state.status} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4901548d4c3 -> f4bd99da12f)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 4901548d4c3 [SPARK-45576][CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite add f4bd99da12f [SPARK-45009][SQL][FOLLOW UP] Turn off decorrelation in join conditions for AQE InSubquery test No new revisions were added by this update. Summary of changes: .../apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45576][CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4901548d4c3 [SPARK-45576][CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite 4901548d4c3 is described below commit 4901548d4c36ac5988bcc04501057de40712e66d Author: Hasnain Lakhani AuthorDate: Tue Oct 17 23:48:19 2023 -0500 [SPARK-45576][CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite ### What changes were proposed in this pull request? Remove debug logs that were left in by accident. ### Why are the changes needed? These were not intended to be committed ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #43404 from hasnain-db/remove-logs. Authored-by: Hasnain Lakhani Signed-off-by: Mridul Muralidharan gmail.com> --- .../apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java| 6 -- 1 file changed, 6 deletions(-) diff --git a/common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java b/common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java index 7e2cc38e70b..dbd71f987d4 100644 --- a/common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java +++ b/common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java @@ -37,8 +37,6 @@ import static org.apache.spark.network.ssl.SslSampleConfigs.*; public class ReloadingX509TrustManagerSuite { - private final Logger logger = LoggerFactory.getLogger(ReloadingX509TrustManagerSuite.class); - /** * Waits until reload count hits the requested value, sleeping 100ms at a time. * If the maximum number of attempts is hit, throws a RuntimeException @@ -280,8 +278,6 @@ public class ReloadingX509TrustManagerSuite { new ReloadingX509TrustManager("jks", trustStoreSymlink, "password", 1); assertEquals(1, tm.getReloadInterval()); assertEquals(0, tm.reloadCount); -logger.info("TRUST STORE 1 IS" + trustStore1); -logger.info("TRUST STORE 2 IS " + trustStore2); try { tm.init(); assertEquals(1, tm.getAcceptedIssuers().length); @@ -289,10 +285,8 @@ public class ReloadingX509TrustManagerSuite { assertEquals(0, tm.reloadCount); // Repoint to trustStore2, which has another cert - logger.info("REPOINTING SYMLINK!!!"); trustStoreSymlink.delete(); Files.createSymbolicLink(trustStoreSymlink.toPath(), trustStore2.toPath()); - logger.info("REPOINTED!!!"); // Wait up to 5s until we reload waitForReloadCount(tm, 1, 50); - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45546][BUILD][FOLLOW-UP] Remove snapshot repo mistakenly added
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 74dc5a3d8c0 [SPARK-45546][BUILD][FOLLOW-UP] Remove snapshot repo mistakenly added 74dc5a3d8c0 is described below commit 74dc5a3d8c0ffe425dfadec44e41615a8f3f8367 Author: Hyukjin Kwon AuthorDate: Tue Oct 17 20:11:46 2023 -0700 [SPARK-45546][BUILD][FOLLOW-UP] Remove snapshot repo mistakenly added ### What changes were proposed in this pull request? This PR removes snapshot repo mistakenly added in `pom.xml` ### Why are the changes needed? To clean up. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? CI in this PR ### Was this patch authored or co-authored using generative AI tooling? No Closes #43415 from HyukjinKwon/SPARK-45546-followup. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- pom.xml | 7 --- 1 file changed, 7 deletions(-) diff --git a/pom.xml b/pom.xml index ade6537c2a1..824ae49f6da 100644 --- a/pom.xml +++ b/pom.xml @@ -3840,11 +3840,4 @@ - - - internal.snapshot - Internal Snapshot Repository - http://localhost:8081/repository/maven-snapshots/ - - - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][INFRA] Skip if JIRA ID is an empty string
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2189f9bf289 [MINOR][INFRA] Skip if JIRA ID is an empty string 2189f9bf289 is described below commit 2189f9bf2894dfb9a6ae1e74b0863486aaf49621 Author: Hyukjin Kwon AuthorDate: Tue Oct 17 20:10:10 2023 -0700 [MINOR][INFRA] Skip if JIRA ID is an empty string ### What changes were proposed in this pull request? This PR skips when an empty string is provided as a JIRA ID. ### Why are the changes needed? When you merge a minor PR that does not have a JIRA ID, and you say `y` to update associated JIRA, you face an error as below: ``` ... Would you like to update an associated JIRA? (y/n): y Enter a JIRA id []: ASF JIRA could not find JiraError HTTP 405 url: https://issues.apache.org/jira/rest/api/2/issue/ response headers = {'Date': 'Wed, 18 Oct 2023 02:44:24 GMT', 'Server': 'Apache', 'X-AREQUESTID': '164x103019855x17', 'X-ASESSIONID': '1bxvboa', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Security-Policy': "frame-ancestors 'self'", 'Strict-Transport-Security': 'max-age=31536000', 'X-Seraph-LoginReason': 'OK', 'X-AUSERNAME': 'gurwls223', 'Allow': 'POST,O [...] response text = ... ``` After this PR, it doesn't fail but shows a warning `JIRA ID not found, skipping`. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? I am going to test this change against this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43417 from HyukjinKwon/minor-merge-script. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- dev/merge_spark_pr.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py index 4021999f19b..41ea921bb86 100755 --- a/dev/merge_spark_pr.py +++ b/dev/merge_spark_pr.py @@ -246,6 +246,9 @@ def resolve_jira_issue(merge_branches, comment, default_jira_id=""): jira_id = input("Enter a JIRA id [%s]: " % default_jira_id) if jira_id == "": jira_id = default_jira_id +if jira_id == "": +print("JIRA ID not found, skipping.") +return try: issue = asf_jira.issue(jira_id) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3ef18e2d00f -> 0cb4a84f6ab)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 3ef18e2d00f [SPARK-45546][BUILD][INFRA] Make `publish-snapshot` support `package` first then `deploy` add 0cb4a84f6ab [MINOR][DOCS] Update the docs for spark.sql.optimizer.canChangeCachedPlanOutputPartitioning configuration No new revisions were added by this update. Summary of changes: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [MINOR][DOCS] Update the docs for spark.sql.optimizer.canChangeCachedPlanOutputPartitioning configuration
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new ed2a4cc6033 [MINOR][DOCS] Update the docs for spark.sql.optimizer.canChangeCachedPlanOutputPartitioning configuration ed2a4cc6033 is described below commit ed2a4cc6033ac35faa7b19eb236a4c953543d519 Author: Hyukjin Kwon AuthorDate: Wed Oct 18 11:43:59 2023 +0900 [MINOR][DOCS] Update the docs for spark.sql.optimizer.canChangeCachedPlanOutputPartitioning configuration ### What changes were proposed in this pull request? This PR fixes the documentation for `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` configuration by saying this is enabled by default. This is a followup of https://github.com/apache/spark/pull/40390 (but did not use a JIRA due to fixed versions properties in the JIRA). ### Why are the changes needed? To mention that this is enabled, to the end users. ### Does this PR introduce _any_ user-facing change? No, it's an internal conf, not documented. ### How was this patch tested? CI in this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43411 from HyukjinKwon/fix-docs. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 0cb4a84f6ab0c1bd101e6bc72be82987bbc02e9b) Signed-off-by: Hyukjin Kwon --- sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 427d0480190..4ea0cd5bcc1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -1529,7 +1529,7 @@ object SQLConf { .doc("Whether to forcibly enable some optimization rules that can change the output " + "partitioning of a cached query when executing it for caching. If it is set to true, " + "queries may need an extra shuffle to read the cached data. This configuration is " + -"disabled by default. Currently, the optimization rules enabled by this configuration " + +"enabled by default. The optimization rules enabled by this configuration " + s"are ${ADAPTIVE_EXECUTION_ENABLED.key} and ${AUTO_BUCKETED_SCAN_ENABLED.key}.") .version("3.2.0") .booleanConf - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45546][BUILD][INFRA] Make `publish-snapshot` support `package` first then `deploy`
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3ef18e2d00f [SPARK-45546][BUILD][INFRA] Make `publish-snapshot` support `package` first then `deploy` 3ef18e2d00f is described below commit 3ef18e2d00f386196292f0c768816626bc903d47 Author: yangjie01 AuthorDate: Wed Oct 18 10:15:27 2023 +0800 [SPARK-45546][BUILD][INFRA] Make `publish-snapshot` support `package` first then `deploy` ### What changes were proposed in this pull request? This pr adds an environment variable `PACKAGE_BEFORE_DEPLOY` to the `publish-snapshot` process. When `PACKAGE_BEFORE_DEPLOY` is true, the publish process will be split into two steps: the first step is to package with mvn package, and the second step is to deploy the packaged jar. At the same time, this PR sets `PACKAGE_BEFORE_DEPLOY` to true in the `publish_snapshot.yml` configuration. ### Why are the changes needed? Make the `publish-snapshot` task in GitHub Action to be divided into two steps, which can alleviate the resource pressure brought by direct deploy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43378 from LuciferYang/no-doc-deploy. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: yangjie01 --- .github/workflows/publish_snapshot.yml | 4 dev/create-release/release-build.sh| 14 -- pom.xml| 7 +++ 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/.github/workflows/publish_snapshot.yml b/.github/workflows/publish_snapshot.yml index 7ed836f016b..476d41d0cf1 100644 --- a/.github/workflows/publish_snapshot.yml +++ b/.github/workflows/publish_snapshot.yml @@ -66,4 +66,8 @@ jobs: GPG_KEY: "not_used" GPG_PASSPHRASE: "not_used" GIT_REF: ${{ matrix.branch }} +# SPARK-45546 adds this environment variable to split the publish snapshot process into two steps: +# first package, then deploy. This is intended to reduce the resource pressure of deploy. +# When PACKAGE_BEFORE_DEPLOY is not set to true, it will revert to the one-step deploy method. +PACKAGE_BEFORE_DEPLOY: true run: ./dev/create-release/release-build.sh publish-snapshot diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index f3571c4e48c..3776c64e31e 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -432,14 +432,24 @@ if [[ "$1" == "publish-snapshot" ]]; then echo "" >> $tmp_settings if [[ $PUBLISH_SCALA_2_12 = 1 ]]; then -$MVN --settings $tmp_settings -DskipTests $SCALA_2_12_PROFILES $PUBLISH_PROFILES clean deploy +if [ "$PACKAGE_BEFORE_DEPLOY" = "true" ]; then + $MVN -DskipTests $SCALA_2_12_PROFILES $PUBLISH_PROFILES clean package + $MVN --settings $tmp_settings -DskipTests $SCALA_2_12_PROFILES $PUBLISH_PROFILES deploy +else + $MVN --settings $tmp_settings -DskipTests $SCALA_2_12_PROFILES $PUBLISH_PROFILES clean deploy +fi fi if [[ $PUBLISH_SCALA_2_13 = 1 ]]; then if [[ $SPARK_VERSION < "4.0" ]]; then ./dev/change-scala-version.sh 2.13 fi -$MVN --settings $tmp_settings -DskipTests $SCALA_2_13_PROFILES $PUBLISH_PROFILES clean deploy +if [ "$PACKAGE_BEFORE_DEPLOY" = "true" ]; then + $MVN -DskipTests $SCALA_2_13_PROFILES $PUBLISH_PROFILES clean package + $MVN --settings $tmp_settings -DskipTests $SCALA_2_13_PROFILES $PUBLISH_PROFILES deploy +else + $MVN --settings $tmp_settings -DskipTests $SCALA_2_13_PROFILES $PUBLISH_PROFILES clean deploy +fi fi rm $tmp_settings diff --git a/pom.xml b/pom.xml index 824ae49f6da..ade6537c2a1 100644 --- a/pom.xml +++ b/pom.xml @@ -3840,4 +3840,11 @@ + + + internal.snapshot + Internal Snapshot Repository + http://localhost:8081/repository/maven-snapshots/ + + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45578][CORE] Remove `InaccessibleObjectException` usage by using `trySetAccessible`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3a3b8c14b8c [SPARK-45578][CORE] Remove `InaccessibleObjectException` usage by using `trySetAccessible` 3a3b8c14b8c is described below commit 3a3b8c14b8c0e056554f11a37e31d8add3087e28 Author: Dongjoon Hyun AuthorDate: Tue Oct 17 18:36:05 2023 -0700 [SPARK-45578][CORE] Remove `InaccessibleObjectException` usage by using `trySetAccessible` ### What changes were proposed in this pull request? This PR aims to remove `InaccessibleObjectException` usage by using `trySetAccessible` instead of `setAccessible`. ### Why are the changes needed? `trySetAccessible` is available on Java 9+ - https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/reflect/AccessibleObject.html#trySetAccessible() We can simplify the code for Apache Spark 4.0.0 because we support only Java 17 and 21 . **BEFORE** ``` $ git grep InaccessibleObjectException common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java: if ("InaccessibleObjectException".equals(re.getClass().getSimpleName())) { core/src/main/scala/org/apache/spark/util/SizeEstimator.scala: // Java 9+ can throw InaccessibleObjectException but the class is Java 9+-only core/src/main/scala/org/apache/spark/util/SizeEstimator.scala: if re.getClass.getSimpleName == "InaccessibleObjectException" => ``` **AFTER** ``` $ git grep InaccessibleObjectException ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43406 from dongjoon-hyun/SPARK-45578. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../main/java/org/apache/spark/unsafe/Platform.java| 18 +- .../scala/org/apache/spark/util/SizeEstimator.scala| 10 +++--- 2 files changed, 8 insertions(+), 20 deletions(-) diff --git a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java index e02346c4773..dfa5734ccbc 100644 --- a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java +++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java @@ -72,19 +72,11 @@ public final class Platform { cls.getDeclaredConstructor(Long.TYPE, Integer.TYPE) : cls.getDeclaredConstructor(Long.TYPE, Long.TYPE); Field cleanerField = cls.getDeclaredField("cleaner"); - try { -constructor.setAccessible(true); -cleanerField.setAccessible(true); - } catch (RuntimeException re) { -// This is a Java 9+ exception, so needs to be handled without importing it -if ("InaccessibleObjectException".equals(re.getClass().getSimpleName())) { - // Continue, but the constructor/field are not available - // See comment below for more context - constructor = null; - cleanerField = null; -} else { - throw re; -} + if (!constructor.trySetAccessible()) { +constructor = null; + } + if (!cleanerField.trySetAccessible()) { +cleanerField = null; } // Have to set these values no matter what: DBB_CONSTRUCTOR = constructor; diff --git a/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala b/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala index 39e071616f2..10ff80143b7 100644 --- a/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala +++ b/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala @@ -333,19 +333,15 @@ object SizeEstimator extends Logging { if (fieldClass.isPrimitive) { sizeCount(primitiveSize(fieldClass)) += 1 } else { - // Note: in Java 9+ this would be better with trySetAccessible and canAccess try { -field.setAccessible(true) // Enable future get()'s on this field -pointerFields = field :: pointerFields +if (field.trySetAccessible()) { // Enable future get()'s on this field + pointerFields = field :: pointerFields +} } catch { // If the field isn't accessible, we can still record the pointer size // but can't know more about the field, so ignore it case _: SecurityException => // do nothing -// Java 9+ can throw InaccessibleObjectException but the class is Java 9+-only -case re: RuntimeException -if re.getClass.getSimpleName == "InaccessibleObjectException" => -
[spark] branch branch-3.3 updated: [MINOR][SQL] Remove signature from Hive thriftserver exception
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 746f936f4b5 [MINOR][SQL] Remove signature from Hive thriftserver exception 746f936f4b5 is described below commit 746f936f4b5d233264ee31e4298074355bc28fda Author: Sean Owen AuthorDate: Tue Oct 17 16:10:56 2023 -0700 [MINOR][SQL] Remove signature from Hive thriftserver exception ### What changes were proposed in this pull request? Don't return expected signature to caller in Hive thriftserver exception ### Why are the changes needed? Please see private discussion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43402 from srowen/HiveCookieSigner. Authored-by: Sean Owen Signed-off-by: Dongjoon Hyun (cherry picked from commit cf59b1f51c16301f689b4e0f17ba4dbd140e1b19) Signed-off-by: Dongjoon Hyun --- .../src/main/java/org/apache/hive/service/CookieSigner.java| 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java index 782e47a6cd9..4b8d2cb1536 100644 --- a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java +++ b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java @@ -81,8 +81,7 @@ public class CookieSigner { LOG.debug("Signature generated for " + rawValue + " inside verify is " + currentSignature); } if (!MessageDigest.isEqual(originalSignature.getBytes(), currentSignature.getBytes())) { - throw new IllegalArgumentException("Invalid sign, original = " + originalSignature + -" current = " + currentSignature); + throw new IllegalArgumentException("Invalid sign"); } return rawValue; } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated: [MINOR][SQL] Remove signature from Hive thriftserver exception
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new e2911e7c208 [MINOR][SQL] Remove signature from Hive thriftserver exception e2911e7c208 is described below commit e2911e7c208f49f4fb7575bdd33c92e0a3b645a2 Author: Sean Owen AuthorDate: Tue Oct 17 16:10:56 2023 -0700 [MINOR][SQL] Remove signature from Hive thriftserver exception ### What changes were proposed in this pull request? Don't return expected signature to caller in Hive thriftserver exception ### Why are the changes needed? Please see private discussion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43402 from srowen/HiveCookieSigner. Authored-by: Sean Owen Signed-off-by: Dongjoon Hyun (cherry picked from commit cf59b1f51c16301f689b4e0f17ba4dbd140e1b19) Signed-off-by: Dongjoon Hyun --- .../src/main/java/org/apache/hive/service/CookieSigner.java| 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java index 782e47a6cd9..4b8d2cb1536 100644 --- a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java +++ b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java @@ -81,8 +81,7 @@ public class CookieSigner { LOG.debug("Signature generated for " + rawValue + " inside verify is " + currentSignature); } if (!MessageDigest.isEqual(originalSignature.getBytes(), currentSignature.getBytes())) { - throw new IllegalArgumentException("Invalid sign, original = " + originalSignature + -" current = " + currentSignature); + throw new IllegalArgumentException("Invalid sign"); } return rawValue; } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [MINOR][SQL] Remove signature from Hive thriftserver exception
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 18599ea750f [MINOR][SQL] Remove signature from Hive thriftserver exception 18599ea750f is described below commit 18599ea750f50e07a910487fb3a871ed69fb9cab Author: Sean Owen AuthorDate: Tue Oct 17 16:10:56 2023 -0700 [MINOR][SQL] Remove signature from Hive thriftserver exception ### What changes were proposed in this pull request? Don't return expected signature to caller in Hive thriftserver exception ### Why are the changes needed? Please see private discussion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43402 from srowen/HiveCookieSigner. Authored-by: Sean Owen Signed-off-by: Dongjoon Hyun (cherry picked from commit cf59b1f51c16301f689b4e0f17ba4dbd140e1b19) Signed-off-by: Dongjoon Hyun --- .../src/main/java/org/apache/hive/service/CookieSigner.java| 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java index 782e47a6cd9..4b8d2cb1536 100644 --- a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java +++ b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/CookieSigner.java @@ -81,8 +81,7 @@ public class CookieSigner { LOG.debug("Signature generated for " + rawValue + " inside verify is " + currentSignature); } if (!MessageDigest.isEqual(originalSignature.getBytes(), currentSignature.getBytes())) { - throw new IllegalArgumentException("Invalid sign, original = " + originalSignature + -" current = " + currentSignature); + throw new IllegalArgumentException("Invalid sign"); } return rawValue; } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4a0ed9cd725 -> cf59b1f51c1)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 4a0ed9cd725 [SPARK-45577][PYTHON] Fix UserDefinedPythonTableFunctionAnalyzeRunner to pass folded values from named arguments add cf59b1f51c1 [MINOR][SQL] Remove signature from Hive thriftserver exception No new revisions were added by this update. Summary of changes: .../src/main/java/org/apache/hive/service/CookieSigner.java| 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b10fea96b5b -> 4a0ed9cd725)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b10fea96b5b [SPARK-45566][PS] Support Pandas-like testing utils for Pandas API on Spark add 4a0ed9cd725 [SPARK-45577][PYTHON] Fix UserDefinedPythonTableFunctionAnalyzeRunner to pass folded values from named arguments No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_udtf.py | 17 +++-- .../execution/python/UserDefinedPythonFunction.scala| 12 2 files changed, 23 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45566][PS] Support Pandas-like testing utils for Pandas API on Spark
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b10fea96b5b [SPARK-45566][PS] Support Pandas-like testing utils for Pandas API on Spark b10fea96b5b is described below commit b10fea96b5b0fd6c3623b0463d17dc583de3e995 Author: Haejoon Lee AuthorDate: Wed Oct 18 06:59:24 2023 +0900 [SPARK-45566][PS] Support Pandas-like testing utils for Pandas API on Spark ### What changes were proposed in this pull request? This PR proposes to support utility functions `assert_frame_equal`, `assert_series_equal`, and `assert_index_equal` in the Pandas API on Spark to aid users in testing. See [pd.assert_frame_equal](https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_frame_equal.html), [pd.assert_series_equal](https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_series_equal.html), [pd.assert_index_equal](https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_index_equal.html) for more detail. ### Why are the changes needed? These utility functions allow users to efficiently test the equality of `DataFrames`, `Series`, and `Indexes` in the Pandas API on Spark. Ensuring accurate testing helps in maintaining code quality and user trust in the platform. e.g. ```python from pyspark.pandas.testing import assert_frame_equal df1 = spark.createDataFrame([('Alice', 1), ('Bob', 2)], ["name", "age"]) df2 = spark.createDataFrame([('Alice', 1), ('Bob', 2)], ["name", "age"]) assert_frame_equal(df1, df2) ``` ### Does this PR introduce _any_ user-facing change? Yes. Users will now have access to `assert_frame_equal`, `assert_series_equal`, `and assert_index_equal` functions for testing purposes. ### How was this patch tested? Added doctests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43398 from itholic/SPARK-45566. Authored-by: Haejoon Lee Signed-off-by: Hyukjin Kwon --- .../docs/source/reference/pyspark.pandas/index.rst | 1 + .../pyspark.pandas/{index.rst => testing.rst} | 31 +- python/pyspark/pandas/testing.py | 328 + 3 files changed, 341 insertions(+), 19 deletions(-) diff --git a/python/docs/source/reference/pyspark.pandas/index.rst b/python/docs/source/reference/pyspark.pandas/index.rst index 31fc95e95f1..0d45ba64b4d 100644 --- a/python/docs/source/reference/pyspark.pandas/index.rst +++ b/python/docs/source/reference/pyspark.pandas/index.rst @@ -38,3 +38,4 @@ This page gives an overview of all public pandas API on Spark. resampling ml extensions + testing diff --git a/python/docs/source/reference/pyspark.pandas/index.rst b/python/docs/source/reference/pyspark.pandas/testing.rst similarity index 69% copy from python/docs/source/reference/pyspark.pandas/index.rst copy to python/docs/source/reference/pyspark.pandas/testing.rst index 31fc95e95f1..67589fb019a 100644 --- a/python/docs/source/reference/pyspark.pandas/index.rst +++ b/python/docs/source/reference/pyspark.pandas/testing.rst @@ -16,25 +16,18 @@ under the License. -=== -Pandas API on Spark -=== +.. _api.testing: -This page gives an overview of all public pandas API on Spark. +=== +Testing +=== +.. currentmodule:: pyspark.pandas -.. note:: - pandas API on Spark follows the API specifications of latest pandas release. +Assertion functions +--- +.. autosummary:: + :toctree: api/ -.. toctree:: - :maxdepth: 2 - - io - general_functions - series - frame - indexing - window - groupby - resampling - ml - extensions + testing.assert_frame_equal + testing.assert_series_equal + testing.assert_index_equal diff --git a/python/pyspark/pandas/testing.py b/python/pyspark/pandas/testing.py new file mode 100644 index 000..49ec6081338 --- /dev/null +++ b/python/pyspark/pandas/testing.py @@ -0,0 +1,328 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitat
[spark] branch master updated: [MINOR][DOCS] Fix one typo
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f1ae56b152b [MINOR][DOCS] Fix one typo f1ae56b152b is described below commit f1ae56b152bdf19246d698b65e553790ad54306b Author: Ruifeng Zheng AuthorDate: Tue Oct 17 13:49:41 2023 -0500 [MINOR][DOCS] Fix one typo ### What changes were proposed in this pull request? Fix one typo ### Why are the changes needed? for doc ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? I didn't find other similar typos in this page, so only one fix ### Was this patch authored or co-authored using generative AI tooling? no Closes #43401 from zhengruifeng/minor_typo_connect_overview. Authored-by: Ruifeng Zheng Signed-off-by: Sean Owen --- docs/spark-connect-overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/spark-connect-overview.md b/docs/spark-connect-overview.md index 82d84f39ca1..c7bad0994a8 100644 --- a/docs/spark-connect-overview.md +++ b/docs/spark-connect-overview.md @@ -261,7 +261,7 @@ spark-connect-repl --host myhost.com --port 443 --token ABCDEFG The supported list of CLI arguments may be found [here](https://github.com/apache/spark/blob/master/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClientParser.scala#L48). - Configure programmatically with a connection ctring + Configure programmatically with a connection string The connection may also be programmatically created using _SparkSession#builder_ as in this example: {% highlight scala %} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45564][SQL] Simplify 'DataFrameStatFunctions.bloomFilter' with 'BloomFilterAggregate' expression
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 922844fff65 [SPARK-45564][SQL] Simplify 'DataFrameStatFunctions.bloomFilter' with 'BloomFilterAggregate' expression 922844fff65 is described below commit 922844fff65ac38fd93bd0c914dcc7e5cf879996 Author: Ruifeng Zheng AuthorDate: Tue Oct 17 10:11:36 2023 -0500 [SPARK-45564][SQL] Simplify 'DataFrameStatFunctions.bloomFilter' with 'BloomFilterAggregate' expression ### What changes were proposed in this pull request? Simplify 'DataFrameStatFunctions.bloomFilter' function with 'BloomFilterAggregate' expression ### Why are the changes needed? existing implementation was based on RDD, and it can be simplified by dataframe operations ### Does this PR introduce _any_ user-facing change? when the input parameters or datatypes are invalid, throw `AnalysisException` instead of `IllegalArgumentException` ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #43391 from zhengruifeng/sql_reimpl_stat_bloomFilter. Authored-by: Ruifeng Zheng Signed-off-by: Sean Owen --- .../apache/spark/sql/DataFrameStatFunctions.scala | 68 +- 1 file changed, 14 insertions(+), 54 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala index 9d4f83c53a3..de3b100cd6a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala @@ -23,6 +23,8 @@ import scala.jdk.CollectionConverters._ import org.apache.spark.annotation.Stable import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.Literal +import org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregate import org.apache.spark.sql.execution.stat._ import org.apache.spark.sql.functions.col import org.apache.spark.sql.types._ @@ -535,7 +537,7 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 2.0.0 */ def bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter = { -buildBloomFilter(Column(colName), expectedNumItems, -1L, fpp) +bloomFilter(Column(colName), expectedNumItems, fpp) } /** @@ -547,7 +549,8 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 2.0.0 */ def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter = { -buildBloomFilter(col, expectedNumItems, -1L, fpp) +val numBits = BloomFilter.optimalNumOfBits(expectedNumItems, fpp) +bloomFilter(col, expectedNumItems, numBits) } /** @@ -559,7 +562,7 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 2.0.0 */ def bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter = { -buildBloomFilter(Column(colName), expectedNumItems, numBits, Double.NaN) +bloomFilter(Column(colName), expectedNumItems, numBits) } /** @@ -571,57 +574,14 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 2.0.0 */ def bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter = { -buildBloomFilter(col, expectedNumItems, numBits, Double.NaN) - } - - private def buildBloomFilter(col: Column, expectedNumItems: Long, - numBits: Long, - fpp: Double): BloomFilter = { -val singleCol = df.select(col) -val colType = singleCol.schema.head.dataType - -require(colType == StringType || colType.isInstanceOf[IntegralType], - s"Bloom filter only supports string type and integral types, but got $colType.") - -val updater: (BloomFilter, InternalRow) => Unit = colType match { - // For string type, we can get bytes of our `UTF8String` directly, and call the `putBinary` - // instead of `putString` to avoid unnecessary conversion. - case StringType => (filter, row) => filter.putBinary(row.getUTF8String(0).getBytes) - case ByteType => (filter, row) => filter.putLong(row.getByte(0)) - case ShortType => (filter, row) => filter.putLong(row.getShort(0)) - case IntegerType => (filter, row) => filter.putLong(row.getInt(0)) - case LongType => (filter, row) => filter.putLong(row.getLong(0)) - case _ => -throw new IllegalArgumentException( - s"Bloom filter only supports string type and integral types, " + -s"and does not support type $colType." -) -} - - singleCol.queryExecution.toRdd.treeAggregate(null.asInstanceOf[BloomFilter]
[spark] branch branch-3.3 updated: [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 8cd3c1a9c1c [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite 8cd3c1a9c1c is described below commit 8cd3c1a9c1c336155fe09728171aba84ef55ef2d Author: Kent Yao AuthorDate: Tue Oct 17 22:19:18 2023 +0800 [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite ### What changes were proposed in this pull request? WholeStageCodegenSparkSubmitSuite is [flaky](https://github.com/apache/spark/actions/runs/6479534195/job/17593342589) because SHUFFLE_PARTITIONS(200) creates 200 reducers for one total core and improper stop progress causes executor launcher reties. The heavy load and reties might result in timeout test failures. ### Why are the changes needed? CI robustness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing WholeStageCodegenSparkSubmitSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #43394 from yaooqinn/SPARK-45568. Authored-by: Kent Yao Signed-off-by: Kent Yao (cherry picked from commit f00ec39542a5f9ac75d8c24f0f04a7be703c8d7c) Signed-off-by: Kent Yao --- .../WholeStageCodegenSparkSubmitSuite.scala| 57 -- 1 file changed, 30 insertions(+), 27 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala index 73c4e4c3e1e..06ba8fb772a 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala @@ -26,6 +26,7 @@ import org.apache.spark.deploy.SparkSubmitTestUtils import org.apache.spark.internal.Logging import org.apache.spark.sql.{QueryTest, Row, SparkSession} import org.apache.spark.sql.functions.{array, col, count, lit} +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.IntegerType import org.apache.spark.unsafe.Platform import org.apache.spark.util.ResetSystemProperties @@ -68,39 +69,41 @@ class WholeStageCodegenSparkSubmitSuite extends SparkSubmitTestUtils object WholeStageCodegenSparkSubmitSuite extends Assertions with Logging { - var spark: SparkSession = _ - def main(args: Array[String]): Unit = { TestUtils.configTestLog4j2("INFO") -spark = SparkSession.builder().getOrCreate() +val spark = SparkSession.builder() + .config(SQLConf.SHUFFLE_PARTITIONS.key, "2") + .getOrCreate() + +try { + // Make sure the test is run where the driver and the executors uses different object layouts + val driverArrayHeaderSize = Platform.BYTE_ARRAY_OFFSET + val executorArrayHeaderSize = +spark.sparkContext.range(0, 1).map(_ => Platform.BYTE_ARRAY_OFFSET).collect().head + assert(driverArrayHeaderSize > executorArrayHeaderSize) -// Make sure the test is run where the driver and the executors uses different object layouts -val driverArrayHeaderSize = Platform.BYTE_ARRAY_OFFSET -val executorArrayHeaderSize = - spark.sparkContext.range(0, 1).map(_ => Platform.BYTE_ARRAY_OFFSET).collect.head.toInt -assert(driverArrayHeaderSize > executorArrayHeaderSize) + val df = spark.range(71773).select((col("id") % lit(10)).cast(IntegerType) as "v") +.groupBy(array(col("v"))).agg(count(col("*"))) + val plan = df.queryExecution.executedPlan + assert(plan.exists(_.isInstanceOf[WholeStageCodegenExec])) -val df = spark.range(71773).select((col("id") % lit(10)).cast(IntegerType) as "v") - .groupBy(array(col("v"))).agg(count(col("*"))) -val plan = df.queryExecution.executedPlan -assert(plan.exists(_.isInstanceOf[WholeStageCodegenExec])) + val expectedAnswer = +Row(Array(0), 7178) :: + Row(Array(1), 7178) :: + Row(Array(2), 7178) :: + Row(Array(3), 7177) :: + Row(Array(4), 7177) :: + Row(Array(5), 7177) :: + Row(Array(6), 7177) :: + Row(Array(7), 7177) :: + Row(Array(8), 7177) :: + Row(Array(9), 7177) :: Nil -val expectedAnswer = - Row(Array(0), 7178) :: -Row(Array(1), 7178) :: -Row(Array(2), 7178) :: -Row(Array(3), 7177) :: -Row(Array(4), 7177) :: -Row(Array(5), 7177) :: -Row(Array(6), 7177) :: -Row(Array(7), 7177) :: -Row(Array(8), 7177) :: -Row(Array(9), 7177) :: Nil -val result = df.collect -QueryTest.sameRows(result.toSeq, expectedAnswer) match { - case Some(errMs
[spark] branch branch-3.4 updated: [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 03b7f7d71bf [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite 03b7f7d71bf is described below commit 03b7f7d71bf638b470b119b09c882253d32945a5 Author: Kent Yao AuthorDate: Tue Oct 17 22:19:18 2023 +0800 [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite ### What changes were proposed in this pull request? WholeStageCodegenSparkSubmitSuite is [flaky](https://github.com/apache/spark/actions/runs/6479534195/job/17593342589) because SHUFFLE_PARTITIONS(200) creates 200 reducers for one total core and improper stop progress causes executor launcher reties. The heavy load and reties might result in timeout test failures. ### Why are the changes needed? CI robustness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing WholeStageCodegenSparkSubmitSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #43394 from yaooqinn/SPARK-45568. Authored-by: Kent Yao Signed-off-by: Kent Yao (cherry picked from commit f00ec39542a5f9ac75d8c24f0f04a7be703c8d7c) Signed-off-by: Kent Yao --- .../WholeStageCodegenSparkSubmitSuite.scala| 57 -- 1 file changed, 30 insertions(+), 27 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala index 73c4e4c3e1e..06ba8fb772a 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala @@ -26,6 +26,7 @@ import org.apache.spark.deploy.SparkSubmitTestUtils import org.apache.spark.internal.Logging import org.apache.spark.sql.{QueryTest, Row, SparkSession} import org.apache.spark.sql.functions.{array, col, count, lit} +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.IntegerType import org.apache.spark.unsafe.Platform import org.apache.spark.util.ResetSystemProperties @@ -68,39 +69,41 @@ class WholeStageCodegenSparkSubmitSuite extends SparkSubmitTestUtils object WholeStageCodegenSparkSubmitSuite extends Assertions with Logging { - var spark: SparkSession = _ - def main(args: Array[String]): Unit = { TestUtils.configTestLog4j2("INFO") -spark = SparkSession.builder().getOrCreate() +val spark = SparkSession.builder() + .config(SQLConf.SHUFFLE_PARTITIONS.key, "2") + .getOrCreate() + +try { + // Make sure the test is run where the driver and the executors uses different object layouts + val driverArrayHeaderSize = Platform.BYTE_ARRAY_OFFSET + val executorArrayHeaderSize = +spark.sparkContext.range(0, 1).map(_ => Platform.BYTE_ARRAY_OFFSET).collect().head + assert(driverArrayHeaderSize > executorArrayHeaderSize) -// Make sure the test is run where the driver and the executors uses different object layouts -val driverArrayHeaderSize = Platform.BYTE_ARRAY_OFFSET -val executorArrayHeaderSize = - spark.sparkContext.range(0, 1).map(_ => Platform.BYTE_ARRAY_OFFSET).collect.head.toInt -assert(driverArrayHeaderSize > executorArrayHeaderSize) + val df = spark.range(71773).select((col("id") % lit(10)).cast(IntegerType) as "v") +.groupBy(array(col("v"))).agg(count(col("*"))) + val plan = df.queryExecution.executedPlan + assert(plan.exists(_.isInstanceOf[WholeStageCodegenExec])) -val df = spark.range(71773).select((col("id") % lit(10)).cast(IntegerType) as "v") - .groupBy(array(col("v"))).agg(count(col("*"))) -val plan = df.queryExecution.executedPlan -assert(plan.exists(_.isInstanceOf[WholeStageCodegenExec])) + val expectedAnswer = +Row(Array(0), 7178) :: + Row(Array(1), 7178) :: + Row(Array(2), 7178) :: + Row(Array(3), 7177) :: + Row(Array(4), 7177) :: + Row(Array(5), 7177) :: + Row(Array(6), 7177) :: + Row(Array(7), 7177) :: + Row(Array(8), 7177) :: + Row(Array(9), 7177) :: Nil -val expectedAnswer = - Row(Array(0), 7178) :: -Row(Array(1), 7178) :: -Row(Array(2), 7178) :: -Row(Array(3), 7177) :: -Row(Array(4), 7177) :: -Row(Array(5), 7177) :: -Row(Array(6), 7177) :: -Row(Array(7), 7177) :: -Row(Array(8), 7177) :: -Row(Array(9), 7177) :: Nil -val result = df.collect -QueryTest.sameRows(result.toSeq, expectedAnswer) match { - case Some(errMs
[spark] branch branch-3.5 updated: [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 6a5747d66e5 [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite 6a5747d66e5 is described below commit 6a5747d66e53ed0d934cdd9ca5c9bd9fde6868e6 Author: Kent Yao AuthorDate: Tue Oct 17 22:19:18 2023 +0800 [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite ### What changes were proposed in this pull request? WholeStageCodegenSparkSubmitSuite is [flaky](https://github.com/apache/spark/actions/runs/6479534195/job/17593342589) because SHUFFLE_PARTITIONS(200) creates 200 reducers for one total core and improper stop progress causes executor launcher reties. The heavy load and reties might result in timeout test failures. ### Why are the changes needed? CI robustness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing WholeStageCodegenSparkSubmitSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #43394 from yaooqinn/SPARK-45568. Authored-by: Kent Yao Signed-off-by: Kent Yao (cherry picked from commit f00ec39542a5f9ac75d8c24f0f04a7be703c8d7c) Signed-off-by: Kent Yao --- .../WholeStageCodegenSparkSubmitSuite.scala| 57 -- 1 file changed, 30 insertions(+), 27 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala index e253de76221..69145d890fc 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala @@ -26,6 +26,7 @@ import org.apache.spark.deploy.SparkSubmitTestUtils import org.apache.spark.internal.Logging import org.apache.spark.sql.{QueryTest, Row, SparkSession} import org.apache.spark.sql.functions.{array, col, count, lit} +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.IntegerType import org.apache.spark.tags.ExtendedSQLTest import org.apache.spark.unsafe.Platform @@ -70,39 +71,41 @@ class WholeStageCodegenSparkSubmitSuite extends SparkSubmitTestUtils object WholeStageCodegenSparkSubmitSuite extends Assertions with Logging { - var spark: SparkSession = _ - def main(args: Array[String]): Unit = { TestUtils.configTestLog4j2("INFO") -spark = SparkSession.builder().getOrCreate() +val spark = SparkSession.builder() + .config(SQLConf.SHUFFLE_PARTITIONS.key, "2") + .getOrCreate() + +try { + // Make sure the test is run where the driver and the executors uses different object layouts + val driverArrayHeaderSize = Platform.BYTE_ARRAY_OFFSET + val executorArrayHeaderSize = +spark.sparkContext.range(0, 1).map(_ => Platform.BYTE_ARRAY_OFFSET).collect().head + assert(driverArrayHeaderSize > executorArrayHeaderSize) -// Make sure the test is run where the driver and the executors uses different object layouts -val driverArrayHeaderSize = Platform.BYTE_ARRAY_OFFSET -val executorArrayHeaderSize = - spark.sparkContext.range(0, 1).map(_ => Platform.BYTE_ARRAY_OFFSET).collect.head.toInt -assert(driverArrayHeaderSize > executorArrayHeaderSize) + val df = spark.range(71773).select((col("id") % lit(10)).cast(IntegerType) as "v") +.groupBy(array(col("v"))).agg(count(col("*"))) + val plan = df.queryExecution.executedPlan + assert(plan.exists(_.isInstanceOf[WholeStageCodegenExec])) -val df = spark.range(71773).select((col("id") % lit(10)).cast(IntegerType) as "v") - .groupBy(array(col("v"))).agg(count(col("*"))) -val plan = df.queryExecution.executedPlan -assert(plan.exists(_.isInstanceOf[WholeStageCodegenExec])) + val expectedAnswer = +Row(Array(0), 7178) :: + Row(Array(1), 7178) :: + Row(Array(2), 7178) :: + Row(Array(3), 7177) :: + Row(Array(4), 7177) :: + Row(Array(5), 7177) :: + Row(Array(6), 7177) :: + Row(Array(7), 7177) :: + Row(Array(8), 7177) :: + Row(Array(9), 7177) :: Nil -val expectedAnswer = - Row(Array(0), 7178) :: -Row(Array(1), 7178) :: -Row(Array(2), 7178) :: -Row(Array(3), 7177) :: -Row(Array(4), 7177) :: -Row(Array(5), 7177) :: -Row(Array(6), 7177) :: -Row(Array(7), 7177) :: -Row(Array(8), 7177) :: -Row(Array(9), 7177) :: Nil -val result = df.collect -QueryTest.sameRows(result.toSeq, expectedAnswer) match { - case Some(errMsg) =>
[spark] branch master updated: [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f00ec39542a [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite f00ec39542a is described below commit f00ec39542a5f9ac75d8c24f0f04a7be703c8d7c Author: Kent Yao AuthorDate: Tue Oct 17 22:19:18 2023 +0800 [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite ### What changes were proposed in this pull request? WholeStageCodegenSparkSubmitSuite is [flaky](https://github.com/apache/spark/actions/runs/6479534195/job/17593342589) because SHUFFLE_PARTITIONS(200) creates 200 reducers for one total core and improper stop progress causes executor launcher reties. The heavy load and reties might result in timeout test failures. ### Why are the changes needed? CI robustness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing WholeStageCodegenSparkSubmitSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #43394 from yaooqinn/SPARK-45568. Authored-by: Kent Yao Signed-off-by: Kent Yao --- .../WholeStageCodegenSparkSubmitSuite.scala| 57 -- 1 file changed, 30 insertions(+), 27 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala index e253de76221..69145d890fc 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala @@ -26,6 +26,7 @@ import org.apache.spark.deploy.SparkSubmitTestUtils import org.apache.spark.internal.Logging import org.apache.spark.sql.{QueryTest, Row, SparkSession} import org.apache.spark.sql.functions.{array, col, count, lit} +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.IntegerType import org.apache.spark.tags.ExtendedSQLTest import org.apache.spark.unsafe.Platform @@ -70,39 +71,41 @@ class WholeStageCodegenSparkSubmitSuite extends SparkSubmitTestUtils object WholeStageCodegenSparkSubmitSuite extends Assertions with Logging { - var spark: SparkSession = _ - def main(args: Array[String]): Unit = { TestUtils.configTestLog4j2("INFO") -spark = SparkSession.builder().getOrCreate() +val spark = SparkSession.builder() + .config(SQLConf.SHUFFLE_PARTITIONS.key, "2") + .getOrCreate() + +try { + // Make sure the test is run where the driver and the executors uses different object layouts + val driverArrayHeaderSize = Platform.BYTE_ARRAY_OFFSET + val executorArrayHeaderSize = +spark.sparkContext.range(0, 1).map(_ => Platform.BYTE_ARRAY_OFFSET).collect().head + assert(driverArrayHeaderSize > executorArrayHeaderSize) -// Make sure the test is run where the driver and the executors uses different object layouts -val driverArrayHeaderSize = Platform.BYTE_ARRAY_OFFSET -val executorArrayHeaderSize = - spark.sparkContext.range(0, 1).map(_ => Platform.BYTE_ARRAY_OFFSET).collect.head.toInt -assert(driverArrayHeaderSize > executorArrayHeaderSize) + val df = spark.range(71773).select((col("id") % lit(10)).cast(IntegerType) as "v") +.groupBy(array(col("v"))).agg(count(col("*"))) + val plan = df.queryExecution.executedPlan + assert(plan.exists(_.isInstanceOf[WholeStageCodegenExec])) -val df = spark.range(71773).select((col("id") % lit(10)).cast(IntegerType) as "v") - .groupBy(array(col("v"))).agg(count(col("*"))) -val plan = df.queryExecution.executedPlan -assert(plan.exists(_.isInstanceOf[WholeStageCodegenExec])) + val expectedAnswer = +Row(Array(0), 7178) :: + Row(Array(1), 7178) :: + Row(Array(2), 7178) :: + Row(Array(3), 7177) :: + Row(Array(4), 7177) :: + Row(Array(5), 7177) :: + Row(Array(6), 7177) :: + Row(Array(7), 7177) :: + Row(Array(8), 7177) :: + Row(Array(9), 7177) :: Nil -val expectedAnswer = - Row(Array(0), 7178) :: -Row(Array(1), 7178) :: -Row(Array(2), 7178) :: -Row(Array(3), 7177) :: -Row(Array(4), 7177) :: -Row(Array(5), 7177) :: -Row(Array(6), 7177) :: -Row(Array(7), 7177) :: -Row(Array(8), 7177) :: -Row(Array(9), 7177) :: Nil -val result = df.collect -QueryTest.sameRows(result.toSeq, expectedAnswer) match { - case Some(errMsg) => fail(errMsg) - case _ => + QueryTest.checkAnswer(df, expectedAnswer) +} finally { + spark.s
[spark] branch master updated: [SPARK-45512][CORE][SQL][SS][DSTREAM] Fix compilation warnings related to `other-nullary-override`
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3b46cc81614 [SPARK-45512][CORE][SQL][SS][DSTREAM] Fix compilation warnings related to `other-nullary-override` 3b46cc81614 is described below commit 3b46cc816143d5bb553e86e8b716c28982cb5748 Author: YangJie AuthorDate: Tue Oct 17 07:34:06 2023 -0500 [SPARK-45512][CORE][SQL][SS][DSTREAM] Fix compilation warnings related to `other-nullary-override` ### What changes were proposed in this pull request? This PR fixes two compilation warnings related to `other-nullary-override` ``` [error] /Users/yangjie01/SourceCode/git/spark-mine-sbt/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CloseableIterator.scala:36:16: method with a single empty parameter list overrides method hasNext in trait Iterator defined without a parameter list [quickfixable] [error] Applicable -Wconf / nowarn filters for this fatal warning: msg=, cat=other-nullary-override, site=org.apache.spark.sql.connect.client.WrappedCloseableIterator [error] override def hasNext(): Boolean = innerIterator.hasNext [error]^ [error] /Users/yangjie01/SourceCode/git/spark-mine-sbt/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ExecutePlanResponseReattachableIterator.scala:136:16: method without a parameter list overrides method hasNext in class WrappedCloseableIterator defined with a single empty parameter list [quickfixable] [error] Applicable -Wconf / nowarn filters for this fatal warning: msg=, cat=other-nullary-override, site=org.apache.spark.sql.connect.client.ExecutePlanResponseReattachableIterator [error] override def hasNext: Boolean = synchronized { [error]^ [error] /Users/yangjie01/SourceCode/git/spark-mine-sbt/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala:73:20: method without a parameter list overrides method hasNext in class WrappedCloseableIterator defined with a single empty parameter list [quickfixable] [error] Applicable -Wconf / nowarn filters for this fatal warning: msg=, cat=other-nullary-override, site=org.apache.spark.sql.connect.client.GrpcExceptionConverter.convertIterator [error] override def hasNext: Boolean = { [error]^ [error] /Users/yangjie01/SourceCode/git/spark-mine-sbt/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala:77:18: method without a parameter list overrides method next in class WrappedCloseableIterator defined with a single empty parameter list [quickfixable] [error] Applicable -Wconf / nowarn filters for this fatal warning: msg=, cat=other-nullary-override, site=org.apache.spark.sql.connect.client.GrpcRetryHandler.RetryIterator [error] override def next: U = { [error] ^ [error] /Users/yangjie01/SourceCode/git/spark-mine-sbt/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala:81:18: method without a parameter list overrides method hasNext in class WrappedCloseableIterator defined with a single empty parameter list [quickfixable] [error] Applicable -Wconf / nowarn filters for this fatal warning: msg=, cat=other-nullary-override, site=org.apache.spark.sql.connect.client.GrpcRetryHandler.RetryIterator [error] override def hasNext: Boolean = { [error] ``` and removes the corresponding suppression rules from the compilation options ``` "-Wconf:cat=other-nullary-override:wv", ``` On the other hand, the code corresponding to the following three suppression rules no longer exists, so the corresponding suppression rules were also cleaned up in this pr. ``` "-Wconf:cat=lint-multiarg-infix:wv", "-Wconf:msg=method with a single empty parameter list overrides method without any parameter list:s", "-Wconf:msg=method without a parameter list overrides a method with a single empty one:s", ``` ### Why are the changes needed? Code clean up. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43332 from LuciferYang/other-nullary-override. Lead-authored-by: YangJie Co-authored-by: yangjie01 Signed-off-by: Sean Owen --- .../org/apache/spark/sql/avro/AvroRowReaderSuite.scala | 10 +- .../spark/sql/connect/client/CloseableIterator.scala | 2 +- .../ExecutePlanResponseReattachableIterator.scala | 4 ++-- .../spark/sql/connect/client/GrpcRetryHandler.scala| 2 +- ..
Re: [PR] update the canonical link, due to a change in some addresses in the latest version of the document [spark-website]
panbingkun commented on PR #483: URL: https://github.com/apache/spark-website/pull/483#issuecomment-1766288284 cc @allanf-db @HyukjinKwon @zhengruifeng @allisonwang-db @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[PR] update the canonical link, due to a change in some addresses in the latest version of the document [spark-website]
panbingkun opened a new pull request, #483: URL: https://github.com/apache/spark-website/pull/483 The pr is followup https://github.com/apache/spark-website/pull/482. https://github.com/apache/spark-website/pull/482#issuecomment-1765322679 As discussed above, due to changes in some document addresses after version `3.3.0`, `the canonical link` is incorrect. We are now correcting it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45572][PS][DOCS] Enable doctest of Frame.transpose
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7af95b49b89 [SPARK-45572][PS][DOCS] Enable doctest of Frame.transpose 7af95b49b89 is described below commit 7af95b49b89e556c57eb4a0b3ac476c8051c11de Author: Ruifeng Zheng AuthorDate: Tue Oct 17 20:47:27 2023 +0900 [SPARK-45572][PS][DOCS] Enable doctest of Frame.transpose ### What changes were proposed in this pull request? Enable doctest of Frame.transpose ### Why are the changes needed? for better test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #43399 from zhengruifeng/ps_enable_transpose_doctest. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/frame.py | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index 7d93af0485f..8b20abf9652 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -2654,8 +2654,6 @@ defaultdict(, {'col..., 'col...})] psdf._to_internal_pandas(), self.to_latex, pd.DataFrame.to_latex, args ) -# TODO: enable doctests once we drop Spark 2.3.x (due to type coercion logic -# when creating arrays) def transpose(self) -> "DataFrame": """ Transpose index and columns. @@ -2707,8 +2705,8 @@ defaultdict(, {'col..., 'col...})] 0 1 3 1 2 4 ->>> df1_transposed = df1.T.sort_index() # doctest: +SKIP ->>> df1_transposed # doctest: +SKIP +>>> df1_transposed = df1.T.sort_index() +>>> df1_transposed 0 1 col1 1 2 col2 3 4 @@ -2720,7 +2718,7 @@ defaultdict(, {'col..., 'col...})] col1int64 col2int64 dtype: object ->>> df1_transposed.dtypes # doctest: +SKIP +>>> df1_transposed.dtypes 0int64 1int64 dtype: object @@ -2736,8 +2734,8 @@ defaultdict(, {'col..., 'col...})] 09.5 0 12 18.0 0 22 ->>> df2_transposed = df2.T.sort_index() # doctest: +SKIP ->>> df2_transposed # doctest: +SKIP +>>> df2_transposed = df2.T.sort_index() +>>> df2_transposed 0 1 age12.0 22.0 kids0.0 0.0 @@ -2752,7 +2750,7 @@ defaultdict(, {'col..., 'col...})] ageint64 dtype: object ->>> df2_transposed.dtypes # doctest: +SKIP +>>> df2_transposed.dtypes 0float64 1float64 dtype: object - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45550][PS] Remove deprecated APIs from Pandas API on Spark
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5280d492ad6 [SPARK-45550][PS] Remove deprecated APIs from Pandas API on Spark 5280d492ad6 is described below commit 5280d492ad636782ca910a3c0bf0f0cb5bce2223 Author: Haejoon Lee AuthorDate: Tue Oct 17 19:40:12 2023 +0800 [SPARK-45550][PS] Remove deprecated APIs from Pandas API on Spark ### What changes were proposed in this pull request? This PR proposes to remove deprecated APIs from Pandas API on Spark: - Remove `DataFrame.to_spark_io`, use `DataFrame.spark.to_spark_io` instead. - Remove `(Index|Series).is_monotonic`, use `(Index|Series).is_monotonic_increasing` instead. ### Why are the changes needed? To cleanup API surface ### Does this PR introduce _any_ user-facing change? Remove APIs no longer available from Spark 4.x. ### How was this patch tested? The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43384 from itholic/SPARK-45550. Authored-by: Haejoon Lee Signed-off-by: Ruifeng Zheng --- .../source/migration_guide/pyspark_upgrade.rst | 2 + .../docs/source/reference/pyspark.pandas/frame.rst | 1 - .../source/reference/pyspark.pandas/indexing.rst | 1 - python/docs/source/reference/pyspark.pandas/io.rst | 2 +- .../source/reference/pyspark.pandas/series.rst | 1 - python/pyspark/pandas/base.py | 83 -- python/pyspark/pandas/frame.py | 23 -- python/pyspark/pandas/generic.py | 1 - python/pyspark/pandas/indexing.py | 4 +- python/pyspark/pandas/namespace.py | 7 +- python/pyspark/pandas/spark/accessors.py | 7 +- .../pandas/tests/test_dataframe_spark_io.py| 4 +- 12 files changed, 13 insertions(+), 123 deletions(-) diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst b/python/docs/source/migration_guide/pyspark_upgrade.rst index d081275dc83..933fa936f70 100644 --- a/python/docs/source/migration_guide/pyspark_upgrade.rst +++ b/python/docs/source/migration_guide/pyspark_upgrade.rst @@ -51,6 +51,8 @@ Upgrading from PySpark 3.5 to 4.0 * In Spark 4.0, ``Index.asi8`` has been removed from pandas API on Spark, use ``Index.astype`` instead. * In Spark 4.0, ``Index.is_type_compatible`` has been removed from pandas API on Spark, use ``Index.isin`` instead. * In Spark 4.0, ``col_space`` parameter from ``DataFrame.to_latex`` and ``Series.to_latex`` has been removed from pandas API on Spark. +* In Spark 4.0, ``DataFrame.to_spark_io`` has been removed from pandas API on Spark, use ``DataFrame.spark.to_spark_io`` instead. +* In Spark 4.0, ``Series.is_monotonic`` and ``Index.is_monotonic`` have been removed from pandas API on Spark, use ``Series.is_monotonic_increasing`` or ``Index.is_monotonic_increasing`` instead respectively. Upgrading from PySpark 3.3 to 3.4 diff --git a/python/docs/source/reference/pyspark.pandas/frame.rst b/python/docs/source/reference/pyspark.pandas/frame.rst index a22078f86e2..911999b56be 100644 --- a/python/docs/source/reference/pyspark.pandas/frame.rst +++ b/python/docs/source/reference/pyspark.pandas/frame.rst @@ -276,7 +276,6 @@ Serialization / IO / Conversion DataFrame.to_table DataFrame.to_delta DataFrame.to_parquet - DataFrame.to_spark_io DataFrame.to_csv DataFrame.to_orc DataFrame.to_pandas diff --git a/python/docs/source/reference/pyspark.pandas/indexing.rst b/python/docs/source/reference/pyspark.pandas/indexing.rst index d6be57ee9c8..08f5e224e06 100644 --- a/python/docs/source/reference/pyspark.pandas/indexing.rst +++ b/python/docs/source/reference/pyspark.pandas/indexing.rst @@ -36,7 +36,6 @@ Properties .. autosummary:: :toctree: api/ - Index.is_monotonic Index.is_monotonic_increasing Index.is_monotonic_decreasing Index.is_unique diff --git a/python/docs/source/reference/pyspark.pandas/io.rst b/python/docs/source/reference/pyspark.pandas/io.rst index b39a4e8778a..118dd49a4ad 100644 --- a/python/docs/source/reference/pyspark.pandas/io.rst +++ b/python/docs/source/reference/pyspark.pandas/io.rst @@ -69,7 +69,7 @@ Generic Spark I/O :toctree: api/ read_spark_io - DataFrame.to_spark_io + DataFrame.spark.to_spark_io Flat File / CSV --- diff --git a/python/docs/source/reference/pyspark.pandas/series.rst b/python/docs/source/reference/pyspark.pandas/series.rst index 7b658d45d4b..eb4a499c054 100644 --- a/python/docs/source/reference/pyspark.pandas/series.rst +++ b/python/docs/source/reference/pyspark.pandas/series.rst @@ -170,7 +170,6 @@ Computations / Descriptive Stats Series.value_co
[spark] branch master updated: [SPARK-45562][SQL] XML: Make 'rowTag' a required option
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4d63ca6394f [SPARK-45562][SQL] XML: Make 'rowTag' a required option 4d63ca6394f is described below commit 4d63ca6394fe8692e1f9bceb93606a86b88b5dc1 Author: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> AuthorDate: Tue Oct 17 20:38:02 2023 +0900 [SPARK-45562][SQL] XML: Make 'rowTag' a required option ### What changes were proposed in this pull request? User can specify `rowTag` option that is the name of the XML element that maps to a `DataFrame Row`. A non-existent `rowTag` will not infer any schema or generate any `DataFrame` rows. Currently, not specifying `rowTag` option results in picking up its default value of `ROW`, which won't match a real XML element in most scenarios. This results in an empty dataframe and confuse new users. This PR makes `rowTag` a required option for both read and write. XML built-in functions (from_xml/schema_of_xml) ignore `rowTag` option. ### Why are the changes needed? See above ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43389 from sandip-db/xml-rowTag. Authored-by: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- .../apache/spark/sql/catalyst/xml/XmlOptions.scala | 4 +- .../execution/datasources/xml/XmlFileFormat.scala | 2 + .../execution/datasources/xml/JavaXmlSuite.java| 10 +- .../sql/execution/datasources/xml/XmlSuite.scala | 125 +++-- .../xml/parsers/StaxXmlGeneratorSuite.scala| 4 +- 5 files changed, 103 insertions(+), 42 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala index d0cfff87279..0dedbec58e1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala @@ -63,8 +63,8 @@ private[sql] class XmlOptions( } val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName) - val rowTag = parameters.getOrElse(ROW_TAG, XmlOptions.DEFAULT_ROW_TAG) - require(rowTag.nonEmpty, s"'$ROW_TAG' option should not be empty string.") + val rowTag = parameters.getOrElse(ROW_TAG, XmlOptions.DEFAULT_ROW_TAG).trim + require(rowTag.nonEmpty, s"'$ROW_TAG' option should not be an empty string.") require(!rowTag.startsWith("<") && !rowTag.endsWith(">"), s"'$ROW_TAG' should not include angle brackets") val rootTag = parameters.getOrElse(ROOT_TAG, XmlOptions.DEFAULT_ROOT_TAG) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlFileFormat.scala index baacf7f0748..4342711b00f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlFileFormat.scala @@ -42,6 +42,8 @@ class XmlFileFormat extends TextBasedFileFormat with DataSourceRegister { def getXmlOptions( sparkSession: SparkSession, parameters: Map[String, String]): XmlOptions = { +val rowTagOpt = parameters.get(XmlOptions.ROW_TAG) +require(rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' option is required.") new XmlOptions(parameters, sparkSession.sessionState.conf.sessionLocalTimeZone, sparkSession.sessionState.conf.columnNameOfCorruptRecord) diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java index b3f39180843..c773459dc4c 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java @@ -82,7 +82,7 @@ public final class JavaXmlSuite { public void testXmlParser() { Map options = new HashMap<>(); options.put("rowTag", booksFileTag); -Dataset df = spark.read().options(options).format("xml").load(booksFile); +Dataset df = spark.read().options(options).xml(booksFile); String prefix = XmlOptions.DEFAULT_ATTRIBUTE_PREFIX(); long result = df.select(prefix + "id").count(); Assertions.assertEquals(result, numBooks); @@ -92,7 +92,7 @@ public final class JavaXmlSuite { public voi
[spark] branch master updated: [SPARK-45485][CONNECT] User agent improvements: Use SPARK_CONNECT_USER_AGENT env variable and include environment specific attributes
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bd627503f96 [SPARK-45485][CONNECT] User agent improvements: Use SPARK_CONNECT_USER_AGENT env variable and include environment specific attributes bd627503f96 is described below commit bd627503f96758edae028a269b7a6ac203a8d941 Author: Robert Dillitz AuthorDate: Tue Oct 17 19:32:29 2023 +0900 [SPARK-45485][CONNECT] User agent improvements: Use SPARK_CONNECT_USER_AGENT env variable and include environment specific attributes ### What changes were proposed in this pull request? With this PR similar to the[ Python client](https://github.com/apache/spark/blob/2cc1ee4d3a05a641d7a245f015ef824d8f7bae8b/python/pyspark/sql/connect/client/core.py#L284) the Scala client's user agent now: 1. Uses the SPARK_CONNECT_USER_AGENT environment variable if set 2. Includes the OS, JVM version, Scala version, and Spark version ### Why are the changes needed? Feature parity with the Python client. Better observability of Scala Spark Connect clients. ### Does this PR introduce _any_ user-facing change? By default, the user agent string now contains more useful information. Before: `_SPARK_CONNECT_SCALA` After: `_SPARK_CONNECT_SCALA spark/4.0.0-SNAPSHOT scala/2.13.12 jvm/17.0.8.1 os/darwin` ### How was this patch tested? Tests added & adjusted. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43313 from dillitz/user-agent-improvements. Authored-by: Robert Dillitz Signed-off-by: Hyukjin Kwon --- .../SparkConnectClientBuilderParseTestSuite.scala | 8 +++--- .../connect/client/SparkConnectClientSuite.scala | 13 -- .../sql/connect/client/SparkConnectClient.scala| 30 +++--- 3 files changed, 42 insertions(+), 9 deletions(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientBuilderParseTestSuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientBuilderParseTestSuite.scala index e1d4a18d0ff..68d2e86b19d 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientBuilderParseTestSuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientBuilderParseTestSuite.scala @@ -47,7 +47,7 @@ class SparkConnectClientBuilderParseTestSuite extends ConnectFunSuite { argumentTest("token", "azbycxdwev1234567890", _.token.get) argumentTest("user_id", "U1238", _.userId.get) argumentTest("user_name", "alice", _.userName.get) - argumentTest("user_agent", "MY APP", _.userAgent) + argumentTest("user_agent", "robert", _.userAgent.split(" ")(0)) argumentTest("session_id", UUID.randomUUID().toString, _.sessionId.get) test("Argument - remote") { @@ -95,7 +95,7 @@ class SparkConnectClientBuilderParseTestSuite extends ConnectFunSuite { "Q12") assert(builder.host === "localhost") assert(builder.port === 1507) - assert(builder.userAgent === "U8912") + assert(builder.userAgent.contains("U8912")) assert(!builder.sslEnabled) assert(builder.token.isEmpty) assert(builder.userId.contains("Q12")) @@ -113,7 +113,7 @@ class SparkConnectClientBuilderParseTestSuite extends ConnectFunSuite { "cluster=mycl") assert(builder.host === "localhost") assert(builder.port === 15002) - assert(builder.userAgent == "_SPARK_CONNECT_SCALA") + assert(builder.userAgent.contains("_SPARK_CONNECT_SCALA")) assert(builder.sslEnabled) assert(builder.token.isEmpty) assert(builder.userId.isEmpty) @@ -124,7 +124,7 @@ class SparkConnectClientBuilderParseTestSuite extends ConnectFunSuite { val builder = build("--token", "thisismysecret") assert(builder.host === "localhost") assert(builder.port === 15002) - assert(builder.userAgent === "_SPARK_CONNECT_SCALA") + assert(builder.userAgent.contains("_SPARK_CONNECT_SCALA")) assert(builder.sslEnabled) assert(builder.token.contains("thisismysecret")) assert(builder.userId.isEmpty) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala index a3df39da4a8..b3ff4eb0bb2 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala @@ -270,7 +270,7 @@ class SparkCon
[spark] branch master updated: [SPARK-45567][CONNECT] Remove redundant if in org.apache.spark.sql.connect.execution.ExecuteGrpcResponseSender#run
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c7b20b5dbfb [SPARK-45567][CONNECT] Remove redundant if in org.apache.spark.sql.connect.execution.ExecuteGrpcResponseSender#run c7b20b5dbfb is described below commit c7b20b5dbfbe7eb89f77b3f49854c90b6640a9c3 Author: zhaomin AuthorDate: Tue Oct 17 19:31:16 2023 +0900 [SPARK-45567][CONNECT] Remove redundant if in org.apache.spark.sql.connect.execution.ExecuteGrpcResponseSender#run ### What changes were proposed in this pull request? remove redundant ```if``` ### Why are the changes needed? it is redundant. https://issues.apache.org/jira/browse/SPARK-45567?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? pass ci. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43395 from zhaomin1423/45567. Authored-by: zhaomin Signed-off-by: Hyukjin Kwon --- .../spark/sql/connect/execution/ExecuteGrpcResponseSender.scala | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala index ba5ecc7a045..115cedfe112 100644 --- a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala +++ b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala @@ -124,11 +124,9 @@ private[connect] class ExecuteGrpcResponseSender[T <: Message]( execute(lastConsumedStreamIndex) } finally { executeHolder.removeGrpcResponseSender(this) -if (!executeHolder.reattachable) { - // Non reattachable executions release here immediately. - // (Reattachable executions release with ReleaseExecute RPC.) - SparkConnectService.executionManager.removeExecuteHolder(executeHolder.key) -} +// Non reattachable executions release here immediately. +// (Reattachable executions release with ReleaseExecute RPC.) + SparkConnectService.executionManager.removeExecuteHolder(executeHolder.key) } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45565][UI] Unnecessary JSON.stringify and JSON.parse loop for task list on stage detail
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ac70daf7337 [SPARK-45565][UI] Unnecessary JSON.stringify and JSON.parse loop for task list on stage detail ac70daf7337 is described below commit ac70daf7337324d38742034c1d6afc2f0243b600 Author: Kent Yao AuthorDate: Tue Oct 17 18:17:31 2023 +0800 [SPARK-45565][UI] Unnecessary JSON.stringify and JSON.parse loop for task list on stage detail ### What changes were proposed in this pull request? `dataSrc` returns a json value, we don't need to stringify it and parse it back ### Why are the changes needed? performance improvements for UI rendering ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? build and verify the stage page locally ### Was this patch authored or co-authored using generative AI tooling? no Closes #43392 from yaooqinn/SPARK-45565. Authored-by: Kent Yao Signed-off-by: Kent Yao --- core/src/main/resources/org/apache/spark/ui/static/stagepage.js | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/core/src/main/resources/org/apache/spark/ui/static/stagepage.js b/core/src/main/resources/org/apache/spark/ui/static/stagepage.js index ad3eca06a0c..4b6b7e219e1 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/stagepage.js +++ b/core/src/main/resources/org/apache/spark/ui/static/stagepage.js @@ -841,11 +841,7 @@ $(document).ready(function () { data.length = totalTasksToShow; } }, -"dataSrc": function (jsons) { - var jsonStr = JSON.stringify(jsons); - var tasksToShow = JSON.parse(jsonStr); - return tasksToShow.aaData; -}, +"dataSrc": (jsons) => jsons.aaData, "error": function (_ignored_jqXHR, _ignored_textStatus, _ignored_errorThrown) { alert("Unable to connect to the server. Looks like the Spark " + "application must have ended. Please Switch to the history UI."); - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-45283][CORE][TESTS][3.5] Make StatusTrackerSuite less fragile
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 22a83caa489 [SPARK-45283][CORE][TESTS][3.5] Make StatusTrackerSuite less fragile 22a83caa489 is described below commit 22a83caa4896a8d03ec7e76b3e7a3bd08930adcb Author: Bo Xiong AuthorDate: Tue Oct 17 18:05:23 2023 +0800 [SPARK-45283][CORE][TESTS][3.5] Make StatusTrackerSuite less fragile ### Why are the changes needed? It's discovered from [Github Actions](https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767) that StatusTrackerSuite can run into random failures, as shown by the following error message. The proposed fix is to update the unit test to remove the nondeterministic behavior. The fix has been made to the master branch in https://github.com/apache/spark/pull/43194. This PR is meant to patch branch-3.5 only. ``` [info] StatusTrackerSuite: [info] - basic status API usage (99 milliseconds) [info] - getJobIdsForGroup() (56 milliseconds) [info] - getJobIdsForGroup() with takeAsync() (48 milliseconds) [info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 milliseconds) [info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds) [info] The code passed to eventually never returned normally. Attempted 651 times over 10.00505994401 seconds. Last failure message: Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148) ``` Full trace can be found [here](https://issues.apache.org/jira/browse/SPARK-45283). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` build/mvn package -DskipTests -pl core build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.StatusTrackerSuite test ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #43388 from xiongbo-sjtu/branch-3.5. Authored-by: Bo Xiong Signed-off-by: yangjie01 --- core/src/test/scala/org/apache/spark/StatusTrackerSuite.scala | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/StatusTrackerSuite.scala b/core/src/test/scala/org/apache/spark/StatusTrackerSuite.scala index 0817abbc6a3..9019ea484b3 100644 --- a/core/src/test/scala/org/apache/spark/StatusTrackerSuite.scala +++ b/core/src/test/scala/org/apache/spark/StatusTrackerSuite.scala @@ -140,16 +140,19 @@ class StatusTrackerSuite extends SparkFunSuite with Matchers with LocalSparkCont } sc.removeJobTag("tag1") + // takeAsync() across multiple partitions val thirdJobFuture = sc.parallelize(1 to 1000, 2).takeAsync(999) -val thirdJobId = eventually(timeout(10.seconds)) { - thirdJobFuture.jobIds.head +val thirdJobIds = eventually(timeout(10.seconds)) { + // Wait for the two jobs triggered by takeAsync + thirdJobFuture.jobIds.size should be(2) + thirdJobFuture.jobIds } eventually(timeout(10.seconds)) { sc.statusTracker.getJobIdsForTag("tag1").toSet should be ( Set(firstJobId, secondJobId)) sc.statusTracker.getJobIdsForTag("tag2").toSet should be ( -Set(secondJobId, thirdJobId)) +Set(secondJobId) ++ thirdJobIds) } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org