(spark) branch master updated: [SPARK-45685][CORE][SQL] Use `LazyList` instead of `Stream`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b92265a98f2 [SPARK-45685][CORE][SQL] Use `LazyList` instead of `Stream` b92265a98f2 is described below commit b92265a98f241b333467a02f4fffc9889ad3e7da Author: yangjie01 AuthorDate: Sun Oct 29 22:43:54 2023 -0700 [SPARK-45685][CORE][SQL] Use `LazyList` instead of `Stream` ### What changes were proposed in this pull request? This pr change to use `LazyList` instead of `Stream` due to `Stream` has been marked as deprecated after Scala 2.13.0. - `class Stream` ```scala deprecated("Use LazyList (which is fully lazy) instead of Stream (which has a lazy tail only)", "2.13.0") SerialVersionUID(3L) sealed abstract class Stream[+A] extends AbstractSeq[A] with LinearSeq[A] with LinearSeqOps[A, Stream, Stream[A]] with IterableFactoryDefaults[A, Stream] with Serializable { ... deprecated("The `append` operation has been renamed `lazyAppendedAll`", "2.13.0") inline final def append[B >: A](rest: => IterableOnce[B]): Stream[B] = lazyAppendedAll(rest) ``` - `object Stream` ```scala deprecated("Use LazyList (which is fully lazy) instead of Stream (which has a lazy tail only)", "2.13.0") SerialVersionUID(3L) object Stream extends SeqFactory[Stream] { ``` - `type Stream` and value Stream ```scala deprecated("Use LazyList instead of Stream", "2.13.0") type Stream[+A] = scala.collection.immutable.Stream[A] deprecated("Use LazyList instead of Stream", "2.13.0") val Stream = scala.collection.immutable.Stream ``` - method `toStream` in trait `IterableOnceOps` ```scala deprecated("Use .to(LazyList) instead of .toStream", "2.13.0") inline final def toStream: immutable.Stream[A] = to(immutable.Stream) ``` ### Why are the changes needed? Clean up deprecated Scala API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Acitons ### Was this patch authored or co-authored using generative AI tooling? No Closes #43563 from LuciferYang/stream-2-lazylist. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../input/WholeTextFileInputFormatSuite.scala | 2 +- .../input/WholeTextFileRecordReaderSuite.scala | 2 +- .../spark/storage/FlatmapIteratorSuite.scala | 6 ++-- .../expressions/collectionOperations.scala | 2 +- .../plans/AliasAwareOutputExpression.scala | 4 +-- .../spark/sql/catalyst/plans/QueryPlan.scala | 2 +- .../apache/spark/sql/catalyst/trees/TreeNode.scala | 42 +++--- .../sql/catalyst/plans/LogicalPlanSuite.scala | 4 +-- .../spark/sql/catalyst/trees/TreeNodeSuite.scala | 36 +-- .../spark/sql/RelationalGroupedDataset.scala | 4 +-- .../sql/execution/AliasAwareOutputExpression.scala | 2 +- .../sql/execution/WholeStageCodegenExec.scala | 2 +- .../execution/joins/BroadcastHashJoinExec.scala| 2 +- .../apache/spark/sql/DataFrameAggregateSuite.scala | 2 +- .../spark/sql/DataFrameWindowFunctionsSuite.scala | 2 +- .../scala/org/apache/spark/sql/GenTPCDSData.scala | 13 --- .../apache/spark/sql/GeneratorFunctionSuite.scala | 2 +- .../scala/org/apache/spark/sql/JoinSuite.scala | 4 +-- .../apache/spark/sql/execution/PlannerSuite.scala | 2 +- .../org/apache/spark/sql/execution/SortSuite.scala | 2 +- .../sql/execution/WholeStageCodegenSuite.scala | 4 +-- 21 files changed, 70 insertions(+), 71 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala b/core/src/test/scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala index 417e711e9c0..26c1be259ad 100644 --- a/core/src/test/scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala +++ b/core/src/test/scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala @@ -83,6 +83,6 @@ object WholeTextFileInputFormatSuite { private val fileLengths = Array(10, 100, 1000) private val files = fileLengths.zip(fileNames).map { case (upperBound, filename) => -filename -> Stream.continually(testWords.toList.toStream).flatten.take(upperBound).toArray +filename -> LazyList.continually(testWords.toList.to(LazyList)).flatten.take(upperBound).toArray }.toMap } diff --git a/core/src/test/scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala b/core/src/test/scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala index c833d22b3be..e64ebe2a551 100644 --- a/core/src/test/scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala +++ b/core/src/test/scala/org/apache/spark/input/WholeT
(spark) branch master updated: [SPARK-45707][SQL] Simplify `DataFrameStatFunctions.countMinSketch` with `CountMinSketchAgg`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4339e0c0e4d [SPARK-45707][SQL] Simplify `DataFrameStatFunctions.countMinSketch` with `CountMinSketchAgg` 4339e0c0e4d is described below commit 4339e0c0e4d7e502ae6cafa90444cd153017cb1a Author: Ruifeng Zheng AuthorDate: Sun Oct 29 22:42:01 2023 -0700 [SPARK-45707][SQL] Simplify `DataFrameStatFunctions.countMinSketch` with `CountMinSketchAgg` ### What changes were proposed in this pull request? Simplify `DataFrameStatFunctions.countMinSketch` with `CountMinSketchAgg` ### Why are the changes needed? to make it consistent with sql functions ### Does this PR introduce _any_ user-facing change? better error messages: `IllegalArgumentException` -> `AnalysisException` ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #43560 from zhengruifeng/sql_reimpl_stat_countMinSketch. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/DataFrameStatFunctions.scala | 44 +++--- .../org/apache/spark/sql/DataFrameStatSuite.scala | 2 +- 2 files changed, 14 insertions(+), 32 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala index de3b100cd6a..f3690773f6d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala @@ -22,9 +22,8 @@ import java.{lang => jl, util => ju} import scala.jdk.CollectionConverters._ import org.apache.spark.annotation.Stable -import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.Literal -import org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregate +import org.apache.spark.sql.catalyst.expressions.aggregate.{BloomFilterAggregate, CountMinSketchAgg} import org.apache.spark.sql.execution.stat._ import org.apache.spark.sql.functions.col import org.apache.spark.sql.types._ @@ -483,7 +482,9 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 2.0.0 */ def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch = { -countMinSketch(col, CountMinSketch.create(depth, width, seed)) +val eps = 2.0 / width +val confidence = 1 - 1 / Math.pow(2, depth) +countMinSketch(col, eps, confidence, seed) } /** @@ -497,35 +498,16 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 2.0.0 */ def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch = { -countMinSketch(col, CountMinSketch.create(eps, confidence, seed)) - } - - private def countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch = { -val singleCol = df.select(col) -val colType = singleCol.schema.head.dataType - -val updater: (CountMinSketch, InternalRow) => Unit = colType match { - // For string type, we can get bytes of our `UTF8String` directly, and call the `addBinary` - // instead of `addString` to avoid unnecessary conversion. - case StringType => (sketch, row) => sketch.addBinary(row.getUTF8String(0).getBytes) - case ByteType => (sketch, row) => sketch.addLong(row.getByte(0)) - case ShortType => (sketch, row) => sketch.addLong(row.getShort(0)) - case IntegerType => (sketch, row) => sketch.addLong(row.getInt(0)) - case LongType => (sketch, row) => sketch.addLong(row.getLong(0)) - case _ => -throw new IllegalArgumentException( - s"Count-min Sketch only supports string type and integral types, " + -s"and does not support type $colType." -) -} - -singleCol.queryExecution.toRdd.aggregate(zero)( - (sketch: CountMinSketch, row: InternalRow) => { -updater(sketch, row) -sketch - }, - (sketch1, sketch2) => sketch1.mergeInPlace(sketch2) +val countMinSketchAgg = new CountMinSketchAgg( + col.expr, + Literal(eps, DoubleType), + Literal(confidence, DoubleType), + Literal(seed, IntegerType) ) +val bytes = df.select( + Column(countMinSketchAgg.toAggregateExpression(false)) +).head().getAs[Array[Byte]](0) +countMinSketchAgg.deserialize(bytes) } /** diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala index 1dece5c8285..430e3622102 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/Data
(spark) branch master updated: [SPARK-45682][CORE][SQL][ML][MLLIB][GRAPHX][YARN][DSTREAM][UI][EXAMPLES] Fix "method + in class Byte/Short/Char/Long/Double/Int is deprecated"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 59a9ee77657 [SPARK-45682][CORE][SQL][ML][MLLIB][GRAPHX][YARN][DSTREAM][UI][EXAMPLES] Fix "method + in class Byte/Short/Char/Long/Double/Int is deprecated" 59a9ee77657 is described below commit 59a9ee776570de987c36e3e8e995d067017064b5 Author: yangjie01 AuthorDate: Sun Oct 29 22:37:49 2023 -0700 [SPARK-45682][CORE][SQL][ML][MLLIB][GRAPHX][YARN][DSTREAM][UI][EXAMPLES] Fix "method + in class Byte/Short/Char/Long/Double/Int is deprecated" ### What changes were proposed in this pull request? There are some compilation warnings similar to the following: ``` [error] /Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib/src/test/scala/org/apache/spark/mllib/regression/LassoSuite.scala:65:58: method + in class Double is deprecated (since 2.13.0): Adding a number and a String is deprecated. Use the string interpolation `s"$num$str"` [error] Applicable -Wconf / nowarn filters for this fatal warning: msg=, cat=deprecation, site=org.apache.spark.mllib.regression.LassoSuite, origin=scala.Double.+, version=2.13.0 [error] assert(weight1 >= -1.60 && weight1 <= -1.40, weight1 + " not in [-1.6, -1.4]") [error] ^ [error] /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:1314:50: method + in class Int is deprecated (since 2.13.0): Adding a number and a String is deprecated. Use the string interpolation `s"$num$str"` [error] Applicable -Wconf / nowarn filters for this fatal warning: msg=, cat=deprecation, site=org.apache.spark.sql.hive.execution.SQLQuerySuiteBase, origin=scala.Int.+, version=2.13.0 [error] checkAnswer(df, (0 until 5).map(i => Row(i + "#", i + "#"))) [error] ^ ``` This pr fix them refer to `Adding a number and a String is deprecated. Use the string interpolation `s"$num$str"`` ### Why are the changes needed? Clean up deprecated Scala API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Acitons ### Was this patch authored or co-authored using generative AI tooling? No Closes #43573 from LuciferYang/SPARK-45682. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/SparkConf.scala| 2 +- .../spark/internal/config/ConfigBuilder.scala | 4 ++-- .../scala/org/apache/spark/scheduler/TaskSet.scala | 2 +- .../org/apache/spark/ui/ConsoleProgressBar.scala | 4 ++-- .../main/scala/org/apache/spark/ui/UIUtils.scala | 2 +- .../scala/org/apache/spark/ui/jobs/StagePage.scala | 6 +++--- .../scala/org/apache/spark/util/Distribution.scala | 4 ++-- .../scala/org/apache/spark/rdd/PipedRDDSuite.scala | 2 +- .../spark/util/collection/AppendOnlyMapSuite.scala | 6 +++--- .../spark/util/collection/OpenHashMapSuite.scala | 6 +++--- .../collection/PrimitiveKeyOpenHashMapSuite.scala | 6 +++--- .../apache/spark/examples/graphx/Analytics.scala | 2 +- .../apache/spark/graphx/util/GraphGenerators.scala | 2 +- .../apache/spark/mllib/util/MFDataGenerator.scala | 4 ++-- .../spark/ml/classification/LinearSVCSuite.scala | 2 +- .../classification/LogisticRegressionSuite.scala | 10 +- .../GeneralizedLinearRegressionSuite.scala | 22 +++--- .../ml/regression/LinearRegressionSuite.scala | 8 .../apache/spark/mllib/regression/LassoSuite.scala | 12 ++-- .../spark/deploy/yarn/ExecutorRunnable.scala | 2 +- .../yarn/YarnShuffleServiceMetricsSuite.scala | 2 +- .../spark/sql/catalyst/expressions/literals.scala | 10 +- .../expressions/IntervalExpressionsSuite.scala | 4 ++-- .../sql/catalyst/util/IntervalUtilsSuite.scala | 2 +- .../test/scala/org/apache/spark/sql/UDFSuite.scala | 2 +- .../sql/execution/datasources/FileIndexSuite.scala | 2 +- .../parquet/ParquetColumnIndexSuite.scala | 6 +++--- .../datasources/parquet/ParquetFilterSuite.scala | 2 +- .../datasources/v2/V2PredicateSuite.scala | 2 +- .../spark/sql/hive/execution/SQLQuerySuite.scala | 4 ++-- .../apache/spark/streaming/CheckpointSuite.scala | 2 +- .../apache/spark/streaming/InputStreamsSuite.scala | 4 ++-- .../spark/streaming/StreamingContextSuite.scala| 2 +- 33 files changed, 76 insertions(+), 76 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index b688604beea..b8fd2700771 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala
(spark) branch master updated: [SPARK-45664][SQL] Introduce a mapper for orc compression codecs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f12bc05e440 [SPARK-45664][SQL] Introduce a mapper for orc compression codecs f12bc05e440 is described below commit f12bc05e44099f24c470466bf777473744ab893d Author: Jiaan Geng AuthorDate: Sun Oct 29 22:22:07 2023 -0700 [SPARK-45664][SQL] Introduce a mapper for orc compression codecs ### What changes were proposed in this pull request? Currently, Spark supported all the orc compression codecs, but the orc supported compression codecs and spark supported are not completely one-on-one due to Spark introduce two compression codecs `NONE` and `UNCOMPRESSED`. On the other hand, there are a lot of magic strings copy from orc compression codecs. This issue lead to developers need to manually maintain its consistency. It is easy to make mistakes and reduce development efficiency. ### Why are the changes needed? Let developers easy to use orc compression codecs. ### Does this PR introduce _any_ user-facing change? 'No'. Introduce a new class. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43528 from beliefer/SPARK-45664. Authored-by: Jiaan Geng Signed-off-by: Dongjoon Hyun --- .../datasources/orc/OrcCompressionCodec.java | 56 ++ .../sql/execution/datasources/orc/OrcOptions.scala | 11 ++--- .../sql/execution/datasources/orc/OrcUtils.scala | 12 ++--- .../BuiltInDataSourceWriteBenchmark.scala | 4 +- .../benchmark/DataSourceReadBenchmark.scala| 4 +- .../benchmark/FilterPushdownBenchmark.scala| 3 +- .../datasources/FileSourceCodecSuite.scala | 5 +- .../execution/datasources/orc/OrcQuerySuite.scala | 26 +- .../execution/datasources/orc/OrcSourceSuite.scala | 29 +++ .../spark/sql/hive/CompressionCodecSuite.scala | 23 ++--- .../spark/sql/hive/execution/HiveDDLSuite.scala| 7 +-- .../sql/hive/orc/OrcHadoopFsRelationSuite.scala| 6 +-- .../spark/sql/hive/orc/OrcReadBenchmark.scala | 5 +- 13 files changed, 134 insertions(+), 57 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcCompressionCodec.java b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcCompressionCodec.java new file mode 100644 index 000..c8e57969068 --- /dev/null +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcCompressionCodec.java @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc; + +import java.util.Arrays; +import java.util.Locale; +import java.util.Map; +import java.util.stream.Collectors; + +import org.apache.orc.CompressionKind; + +/** + * A mapper class from Spark supported orc compression codecs to orc compression codecs. + */ +public enum OrcCompressionCodec { + NONE(CompressionKind.NONE), + UNCOMPRESSED(CompressionKind.NONE), + ZLIB(CompressionKind.ZLIB), + SNAPPY(CompressionKind.SNAPPY), + LZO(CompressionKind.LZO), + LZ4(CompressionKind.LZ4), + ZSTD(CompressionKind.ZSTD); + + private final CompressionKind compressionKind; + + OrcCompressionCodec(CompressionKind compressionKind) { +this.compressionKind = compressionKind; + } + + public CompressionKind getCompressionKind() { +return this.compressionKind; + } + + public static final Map codecNameMap = +Arrays.stream(OrcCompressionCodec.values()).collect( + Collectors.toMap(codec -> codec.name(), codec -> codec.name().toLowerCase(Locale.ROOT))); + + public String lowerCaseName() { +return codecNameMap.get(this.name()); + } +} diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala index 1c819f07038..4bed600fa4e 100644 --- a/sql/
(spark) branch master updated: [SPARK-44790][SQL] XML: to_xml implementation and bindings for python, connect and SQL
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 00d2b4fa2de [SPARK-44790][SQL] XML: to_xml implementation and bindings for python, connect and SQL 00d2b4fa2de is described below commit 00d2b4fa2def948e7517bacfce7c75be6a37eb20 Author: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> AuthorDate: Mon Oct 30 14:00:31 2023 +0900 [SPARK-44790][SQL] XML: to_xml implementation and bindings for python, connect and SQL ### What changes were proposed in this pull request? to_xml: Converts a `StructType` to a XML output string. Bindings for python, connect and SQL ### Why are the changes needed? to_xml: Converts a `StructType` to a XML output string. ### Does this PR introduce _any_ user-facing change? Yes, new to_xml API. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43503 from sandip-db/to_xml. Authored-by: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- .../scala/org/apache/spark/sql/functions.scala | 31 +++ .../org/apache/spark/sql/FunctionTestSuite.scala | 6 + .../source/reference/pyspark.sql/functions.rst | 1 + python/pyspark/sql/connect/functions.py| 10 + python/pyspark/sql/functions.py| 36 +++ .../sql/tests/connect/test_connect_function.py | 14 + sql/catalyst/pom.xml | 4 + .../sql/catalyst/analysis/FunctionRegistry.scala | 3 +- .../sql/catalyst/expressions/xmlExpressions.scala | 90 ++- .../spark/sql/catalyst/xml/StaxXmlGenerator.scala | 295 - .../apache/spark/sql/catalyst/xml/XmlOptions.scala | 5 + sql/core/pom.xml | 4 - .../datasources/xml/XmlOutputWriter.scala | 53 +--- .../scala/org/apache/spark/sql/functions.scala | 30 +++ .../sql-functions/sql-expression-schema.md | 1 + .../analyzer-results/xml-functions.sql.out | 122 + .../resources/sql-tests/inputs/xml-functions.sql | 18 +- .../sql-tests/results/xml-functions.sql.out| 134 ++ .../sql/execution/datasources/xml/XmlSuite.scala | 25 ++ 19 files changed, 696 insertions(+), 186 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala index 9c5adca7e28..1c8f5993d29 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala @@ -7470,6 +7470,37 @@ object functions { fnWithOptions("schema_of_xml", options.asScala.iterator, xml) } + // scalastyle:off line.size.limit + + /** + * (Java-specific) Converts a column containing a `StructType` into a XML string with the + * specified schema. Throws an exception, in the case of an unsupported type. + * + * @param e + * a column containing a struct. + * @param options + * options to control how the struct column is converted into a XML string. It accepts the + * same options as the XML data source. See https://spark.apache.org/docs/latest/sql-data-sources-xml.html#data-source-option";> Data + * Source Option in the version you use. + * @group xml_funcs + * @since 4.0.0 + */ + // scalastyle:on line.size.limit + def to_xml(e: Column, options: java.util.Map[String, String]): Column = +fnWithOptions("to_xml", options.asScala.iterator, e) + + /** + * Converts a column containing a `StructType` into a XML string with the specified schema. + * Throws an exception, in the case of an unsupported type. + * + * @param e + * a column containing a struct. + * @group xml_funcs + * @since 4.0.0 + */ + def to_xml(e: Column): Column = to_xml(e, Collections.emptyMap()) + /** * Returns the total number of elements in the array. The function returns null for null input. * diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/FunctionTestSuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/FunctionTestSuite.scala index e350bde9946..748843ec991 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/FunctionTestSuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/FunctionTestSuite.scala @@ -237,6 +237,12 @@ class FunctionTestSuite extends ConnectFunSuite { from_xml(a, schema, Map.empty[String, String].asJava), from_xml(a, schema, Collections.emptyMap[String, String]), from_xml(a, lit(schema.json), Collections.e
Re: [PR] Add Matomo analytics to the published documents [spark-website]
allisonwang-db commented on PR #485: URL: https://github.com/apache/spark-website/pull/485#issuecomment-1784451642 This is awesome! Thanks for adding it to ALL API docs! Let's see if it works for 3.1.1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
Re: [PR] Add Matomo analytics to the published documents [spark-website]
panbingkun commented on PR #485: URL: https://github.com/apache/spark-website/pull/485#issuecomment-1784382189 cc @zhengruifeng @itholic @allisonwang-db @HyukjinKwon @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
Re: [PR] Add Matomo analytics to the published documents [spark-website]
panbingkun commented on PR #485: URL: https://github.com/apache/spark-website/pull/485#issuecomment-1784379925 At present, this PR has only made changes to the historical document of version 3.1.1. If there are no issues with similar modifications, I will use tools to make similar modifications to other versions. PS: This PR will involve a lot of files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[PR] Add Matomo analytics to the published documents [spark-website]
panbingkun opened a new pull request, #485: URL: https://github.com/apache/spark-website/pull/485 The pr aims to add Matomo analytics to the published documents. include: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (4af4ddea116 -> 0245b842ccc)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 4af4ddea116 [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` add 0245b842ccc [SPARK-45554][PYTHON] Introduce flexible parameter to `assertSchemaEqual` No new revisions were added by this update. Summary of changes: python/pyspark/testing/utils.py | 90 ++--- 1 file changed, 85 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4af4ddea116 [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` 4af4ddea116 is described below commit 4af4ddea116d26086550596693ce09674e75bfa3 Author: Haejoon Lee AuthorDate: Mon Oct 30 11:07:01 2023 +0900 [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` ### What changes were proposed in this pull request? This PR proposes to add six new parameters to the `assertDataFrameEqual`: `ignoreNullable`, `ignoreColumnOrder`, `ignoreColumnName`, `ignoreColumnType`, `maxErrors`, and `showOnlyDiff` to provide users with more flexibility in DataFrame testing. ### Why are the changes needed? To enhance the utility of `assertDataFrameEqual` by accommodating various common DataFrame comparison scenarios that users might encounter, without necessitating manual adjustments or workarounds. ### Does this PR introduce _any_ user-facing change? Yes. `assertDataFrameEqual` now have the option to use the six new parameters: Parameter | Type | Comment -- | -- | -- ignoreNullable | Boolean [optional] | Specifies whether a column’s nullable property is included when checking for schema equality. When set to True (default), the nullable property of the columns being compared is not taken into account and the columns will be considered equal even if they have different nullable settings.When set to False, columns are considered equal only if they have the same nullable setting. ignoreColumnOrder | Boolean [optional] | Specifies whether to compare columns in the order they appear in the DataFrames or by column name. When set to False (default), columns are compared in the order they appear in the DataFrames. When set to True, a column in the expected DataFrame is compared to the column with the same name in the actual DataFrame. ignoreColumnOrder cannot be set to True if ignoreColumnNames is also set to True. ignoreColumnName | Boolean [optional] | Specifies whether to fail the initial schema equality check if the column names in the two DataFrames are different. When set to False (default), column names are checked and the function fails if they are different. When set to True, the function will succeed even if column names are different. Column data types are compared for columns in the order they appear in the DataFrames. ignoreColumnNames cannot be set to [...] ignoreColumnType | Boolean [optional] | Specifies whether to ignore the data type of the columns when comparing. When set to False (default), column data types are checked and the function fails if they are different. When set to True, the schema equality check will succeed even if column data types are different and the function will attempt to compare rows. maxErrors | Integer [optional] | The maximum number of row comparison failures to encounter before returning. When this number of row comparisons have failed, the function returns independent of how many rows have been compared. Set to None by default which means compare all rows independent of number of failures. showOnlyDiff | Boolean [optional] | If set to True, the error message will only include rows that are different. If set to False (default), the error message will include all rows (when there is at least one row that is different). ### How was this patch tested? Added usage examples into doctest for each parameter. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43433 from itholic/SPARK-45552. Authored-by: Haejoon Lee Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/tests/test_utils.py | 68 +++ python/pyspark/testing/utils.py| 215 +++-- 2 files changed, 274 insertions(+), 9 deletions(-) diff --git a/python/pyspark/sql/tests/test_utils.py b/python/pyspark/sql/tests/test_utils.py index a2cad4e83bd..421043a41bb 100644 --- a/python/pyspark/sql/tests/test_utils.py +++ b/python/pyspark/sql/tests/test_utils.py @@ -1238,6 +1238,9 @@ class UtilsTestsMixin: assertDataFrameEqual(df1, df2) +with self.assertRaises(PySparkAssertionError): +assertDataFrameEqual(df1, df2, ignoreNullable=False) + def test_schema_ignore_nullable_array_equal(self): s1 = StructType([StructField("names", ArrayType(DoubleType(), True), True)]) s2 = StructType([StructField("names", ArrayType(DoubleType(), False), False)]) @@ -1611,6 +1614,71 @@ class UtilsTestsMixin: message_parameters={"error_msg": error_msg}, ) +def test_dataframe_ignore_column_order(self): +df1 = self.spark.createD
(spark) branch master updated: [SPARK-45544][CORE] Integrate SSL support into TransportContext
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 884f6f71172 [SPARK-45544][CORE] Integrate SSL support into TransportContext 884f6f71172 is described below commit 884f6f71172156ccc7d95ed022c8fb8baadc3c0a Author: Hasnain Lakhani AuthorDate: Sun Oct 29 20:58:18 2023 -0500 [SPARK-45544][CORE] Integrate SSL support into TransportContext ### What changes were proposed in this pull request? This integrates SSL support into TransportContext and related modules so that the RPC SSL functionality can work when properly configured. ### Why are the changes needed? This is needed in order to support SSL for RPC connections. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Ran the following tests: ``` build/sbt -P yarn > project network-common > testOnly > project network-shuffle > testOnly > project core > testOnly *Ssl* > project yarn > testOnly org.apache.spark.network.yarn.SslYarnShuffleServiceWithRocksDBBackendSuite ``` I verified traffic was encrypted using TLS using two mechanisms: * Enabled trace level logging for Netty and JDK SSL and saw logs confirming TLS handshakes were happening * I ran wireshark on my machine and snooped on traffic while sending queries shuffling a fixed string. Without any encryption, I could find that string in the network traffic. With this encryption enabled, that string did not show up, and wireshark logs confirmed a TLS handshake was happening. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43541 from hasnain-db/spark-tls-final. Authored-by: Hasnain Lakhani Signed-off-by: Mridul Muralidharan gmail.com> --- .../org/apache/spark/network/TransportContext.java | 70 -- .../network/client/TransportClientFactory.java | 26 +++- .../spark/network/server/TransportServer.java | 2 +- .../apache/spark/network/util/TransportConf.java | 8 --- .../spark/network/ChunkFetchIntegrationSuite.java | 6 +- .../network/SslChunkFetchIntegrationSuite.java | 22 --- .../client/SslTransportClientFactorySuite.java | 29 + .../client/TransportClientFactorySuite.java| 8 +-- .../network/shuffle/ShuffleTransportContext.java | 10 ++-- .../shuffle/ExternalShuffleIntegrationSuite.java | 29 + .../shuffle/ExternalShuffleSecuritySuite.java | 14 - .../shuffle/ShuffleTransportContextSuite.java | 33 +- .../SslExternalShuffleIntegrationSuite.java| 44 ++ .../shuffle/SslExternalShuffleSecuritySuite.java | 35 +++ .../shuffle/SslShuffleTransportContextSuite.java | 28 + .../network/yarn/SslYarnShuffleServiceSuite.scala | 2 +- 16 files changed, 265 insertions(+), 101 deletions(-) diff --git a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java index 51d074a4ddb..90ca4f4c46a 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java +++ b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java @@ -23,13 +23,17 @@ import io.netty.handler.codec.MessageToMessageDecoder; import java.io.Closeable; import java.util.ArrayList; import java.util.List; +import javax.annotation.Nullable; import com.codahale.metrics.Counter; import io.netty.channel.Channel; import io.netty.channel.ChannelPipeline; import io.netty.channel.EventLoopGroup; import io.netty.channel.socket.SocketChannel; +import io.netty.handler.ssl.SslHandler; +import io.netty.handler.stream.ChunkedWriteHandler; import io.netty.handler.timeout.IdleStateHandler; +import io.netty.handler.codec.MessageToMessageEncoder; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -37,6 +41,8 @@ import org.apache.spark.network.client.TransportClient; import org.apache.spark.network.client.TransportClientBootstrap; import org.apache.spark.network.client.TransportClientFactory; import org.apache.spark.network.client.TransportResponseHandler; +import org.apache.spark.network.protocol.Message; +import org.apache.spark.network.protocol.SslMessageEncoder; import org.apache.spark.network.protocol.MessageDecoder; import org.apache.spark.network.protocol.MessageEncoder; import org.apache.spark.network.server.ChunkFetchRequestHandler; @@ -45,6 +51,7 @@ import org.apache.spark.network.server.TransportChannelHandler; import org.apache.spark.network.server.TransportRequestHandler; import org.apache.spark.network.server.TransportServer; import org.apache.spark
(spark) branch master updated: [SPARK-45605][CORE][SQL][SS][CONNECT][MLLIB][GRAPHX][DSTREAM][PROTOBUF][EXAMPLES] Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 89ca8b6065e [SPARK-45605][CORE][SQL][SS][CONNECT][MLLIB][GRAPHX][DSTREAM][PROTOBUF][EXAMPLES] Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues` 89ca8b6065e is described below commit 89ca8b6065e9f690a492c778262080741d50d94d Author: yangjie01 AuthorDate: Sun Oct 29 09:19:30 2023 -0500 [SPARK-45605][CORE][SQL][SS][CONNECT][MLLIB][GRAPHX][DSTREAM][PROTOBUF][EXAMPLES] Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues` ### What changes were proposed in this pull request? This pr replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues` due to `s.c.MapOps.mapValues` marked as deprecated since Scala 2.13.0: https://github.com/scala/scala/blob/bf45e199e96383b96a6955520d7d2524c78e6e12/src/library/scala/collection/Map.scala#L256-L262 ```scala deprecated("Use .view.mapValues(f). A future version will include a strict version of this method (for now, .view.mapValues(f).toMap).", "2.13.0") def mapValues[W](f: V => W): MapView[K, W] = new MapView.MapValues(this, f) ``` ### Why are the changes needed? Cleanup deprecated API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Acitons - Packaged the client, manually tested `DFSReadWriteTest/MiniReadWriteTest/PowerIterationClusteringExample`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43448 from LuciferYang/SPARK-45605. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: Sean Owen --- .../spark/util/sketch/CountMinSketchSuite.scala| 2 +- .../org/apache/spark/sql/avro/AvroUtils.scala | 1 + .../scala/org/apache/spark/sql/SparkSession.scala | 2 +- .../spark/sql/ClientDataFrameStatSuite.scala | 2 +- .../org/apache/spark/sql/connect/dsl/package.scala | 2 +- .../sql/connect/planner/SparkConnectPlanner.scala | 13 ++ .../sql/kafka010/KafkaMicroBatchSourceSuite.scala | 3 ++- .../apache/spark/sql/kafka010/KafkaTestUtils.scala | 2 +- .../streaming/kafka010/ConsumerStrategy.scala | 6 ++--- .../kafka010/DirectKafkaInputDStream.scala | 2 +- .../kafka010/DirectKafkaStreamSuite.scala | 2 +- .../spark/streaming/kafka010/KafkaTestUtils.scala | 2 +- .../spark/streaming/kinesis/KinesisTestUtils.scala | 2 +- .../kinesis/KPLBasedKinesisTestUtils.scala | 2 +- .../kinesis/KinesisBackedBlockRDDSuite.scala | 4 +-- .../spark/sql/protobuf/utils/ProtobufUtils.scala | 1 + .../org/apache/spark/api/java/JavaPairRDD.scala| 4 +-- .../apache/spark/api/java/JavaSparkContext.scala | 2 +- .../spark/api/python/PythonWorkerFactory.scala | 2 +- .../apache/spark/scheduler/InputFormatInfo.scala | 2 +- .../apache/spark/scheduler/TaskSchedulerImpl.scala | 2 +- .../cluster/CoarseGrainedSchedulerBackend.scala| 2 +- ...plicationEnvironmentInfoWrapperSerializer.scala | 5 ++-- .../ExecutorSummaryWrapperSerializer.scala | 3 ++- .../status/protobuf/JobDataWrapperSerializer.scala | 2 +- .../protobuf/StageDataWrapperSerializer.scala | 6 ++--- .../org/apache/spark/SparkThrowableSuite.scala | 2 +- .../apache/spark/rdd/PairRDDFunctionsSuite.scala | 2 +- .../test/scala/org/apache/spark/rdd/RDDSuite.scala | 1 + .../scheduler/ExecutorResourceInfoSuite.scala | 1 + .../BlockManagerDecommissionIntegrationSuite.scala | 2 +- .../storage/ShuffleBlockFetcherIteratorSuite.scala | 2 +- .../util/collection/ExternalSorterSuite.scala | 2 +- .../apache/spark/examples/DFSReadWriteTest.scala | 1 + .../apache/spark/examples/MiniReadWriteTest.scala | 1 + .../mllib/PowerIterationClusteringExample.scala| 2 +- .../spark/graphx/lib/ShortestPathsSuite.scala | 2 +- .../spark/ml/evaluation/ClusteringMetrics.scala| 1 + .../apache/spark/ml/feature/VectorIndexer.scala| 2 +- .../org/apache/spark/ml/feature/Word2Vec.scala | 2 +- .../apache/spark/ml/tree/impl/RandomForest.scala | 4 +-- .../spark/mllib/clustering/BisectingKMeans.scala | 2 +- .../mllib/linalg/distributed/BlockMatrix.scala | 4 +-- .../apache/spark/mllib/stat/test/ChiSqTest.scala | 1 + .../apache/spark/ml/recommendation/ALSSuite.scala | 8 +++--- .../apache/spark/mllib/feature/Word2VecSuite.scala | 12 - .../org/apache/spark/sql/types/Metadata.scala | 2 +- .../spark/sql/catalyst/analysis/Analyzer.scala | 3 ++- .../catalyst/catalog/ExternalCatalogUtils.scala| 2 +- .../sql/catalyst/catalog/SessionCatalog.scala | 2 +- .../spark/sql/catalyst/expressions/package.scala | 2 +- .../catalyst/optimizer/NestedCo
(spark) branch master updated: [SPARK-45636][BUILD] Upgrade jersey to 2.41
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4ae99f9320c [SPARK-45636][BUILD] Upgrade jersey to 2.41 4ae99f9320c is described below commit 4ae99f9320ca29193f7c0d6d54d61e5d3fd0b323 Author: YangJie AuthorDate: Sun Oct 29 09:18:07 2023 -0500 [SPARK-45636][BUILD] Upgrade jersey to 2.41 ### What changes were proposed in this pull request? This pr aims upgrade Jersey from 2.40 to 2.41. ### Why are the changes needed? The new version bring some improvements, like: - https://github.com/eclipse-ee4j/jersey/pull/5350 - https://github.com/eclipse-ee4j/jersey/pull/5365 - https://github.com/eclipse-ee4j/jersey/pull/5436 - https://github.com/eclipse-ee4j/jersey/pull/5296 and some bug fix, like: - https://github.com/eclipse-ee4j/jersey/pull/5359 - https://github.com/eclipse-ee4j/jersey/pull/5405 - https://github.com/eclipse-ee4j/jersey/pull/5423 - https://github.com/eclipse-ee4j/jersey/pull/5435 - https://github.com/eclipse-ee4j/jersey/pull/5445 The full release notes as follows: - https://github.com/eclipse-ee4j/jersey/releases/tag/2.41 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43490 from LuciferYang/SPARK-45636. Lead-authored-by: YangJie Co-authored-by: yangjie01 Signed-off-by: Sean Owen --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 12 ++-- pom.xml | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index c6fa77c84ca..2bfd94b9d46 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -122,12 +122,12 @@ jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar jcl-over-slf4j/2.0.9//jcl-over-slf4j-2.0.9.jar jdo-api/3.0.1//jdo-api-3.0.1.jar jdom2/2.0.6//jdom2-2.0.6.jar -jersey-client/2.40//jersey-client-2.40.jar -jersey-common/2.40//jersey-common-2.40.jar -jersey-container-servlet-core/2.40//jersey-container-servlet-core-2.40.jar -jersey-container-servlet/2.40//jersey-container-servlet-2.40.jar -jersey-hk2/2.40//jersey-hk2-2.40.jar -jersey-server/2.40//jersey-server-2.40.jar +jersey-client/2.41//jersey-client-2.41.jar +jersey-common/2.41//jersey-common-2.41.jar +jersey-container-servlet-core/2.41//jersey-container-servlet-core-2.41.jar +jersey-container-servlet/2.41//jersey-container-servlet-2.41.jar +jersey-hk2/2.41//jersey-hk2-2.41.jar +jersey-server/2.41//jersey-server-2.41.jar jettison/1.5.4//jettison-1.5.4.jar jetty-util-ajax/9.4.53.v20231009//jetty-util-ajax-9.4.53.v20231009.jar jetty-util/9.4.53.v20231009//jetty-util-9.4.53.v20231009.jar diff --git a/pom.xml b/pom.xml index 6488918326f..71c3044dd42 100644 --- a/pom.xml +++ b/pom.xml @@ -206,7 +206,7 @@ Please don't upgrade the version to 3.0.0+, Because it transitions Jakarta REST API from javax to jakarta package. --> -2.40 +2.41 2.12.5 3.5.2 3.0.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-45674][CONNECT][PYTHON] Improve error message for JVM-dependent attributes on Spark Connect
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1f8422b77ae [SPARK-45674][CONNECT][PYTHON] Improve error message for JVM-dependent attributes on Spark Connect 1f8422b77ae is described below commit 1f8422b77ae9fd920a30c55947712a5fb388c480 Author: Haejoon Lee AuthorDate: Sun Oct 29 19:26:50 2023 +0900 [SPARK-45674][CONNECT][PYTHON] Improve error message for JVM-dependent attributes on Spark Connect ### What changes were proposed in this pull request? This PR proposes to improve error message for JVM-dependent attributes on Spark Connect. ### Why are the changes needed? To improve the usability of error message by using proper error class. ### Does this PR introduce _any_ user-facing change? No API change, but only user-facing error messages are improved as below: **Before** ``` >>> spark.sparkContext PySparkNotImplementedError: [NOT_IMPLEMENTED] sparkContext() is not implemented. ``` **After** ``` >>> spark.sparkContext PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `sparkContext` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. ``` ### How was this patch tested? The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43537 from itholic/SPARK-45674. Authored-by: Haejoon Lee Signed-off-by: Hyukjin Kwon --- python/pyspark/errors/error_classes.py | 2 +- python/pyspark/sql/connect/dataframe.py| 4 +--- python/pyspark/sql/connect/session.py | 6 +- python/pyspark/sql/tests/connect/test_connect_basic.py | 14 -- 4 files changed, 3 insertions(+), 23 deletions(-) diff --git a/python/pyspark/errors/error_classes.py b/python/pyspark/errors/error_classes.py index 53a7279fde2..70e88c18f9d 100644 --- a/python/pyspark/errors/error_classes.py +++ b/python/pyspark/errors/error_classes.py @@ -354,7 +354,7 @@ ERROR_CLASSES_JSON = """ }, "JVM_ATTRIBUTE_NOT_SUPPORTED" : { "message" : [ - "Attribute `` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session." + "Attribute `` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail." ] }, "KEY_VALUE_PAIR_REQUIRED" : { diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index b322ded84a4..75cecd5f610 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -1639,13 +1639,11 @@ class DataFrame: sampleBy.__doc__ = PySparkDataFrame.sampleBy.__doc__ def __getattr__(self, name: str) -> "Column": -if name in ["_jseq", "_jdf", "_jmap", "_jcols"]: +if name in ["_jseq", "_jdf", "_jmap", "_jcols", "rdd", "toJSON"]: raise PySparkAttributeError( error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name} ) elif name in [ -"rdd", -"toJSON", "checkpoint", "localCheckpoint", ]: diff --git a/python/pyspark/sql/connect/session.py b/python/pyspark/sql/connect/session.py index 53bf19b78c8..09bd60606c7 100644 --- a/python/pyspark/sql/connect/session.py +++ b/python/pyspark/sql/connect/session.py @@ -694,14 +694,10 @@ class SparkSession: streams.__doc__ = PySparkSession.streams.__doc__ def __getattr__(self, name: str) -> Any: -if name in ["_jsc", "_jconf", "_jvm", "_jsparkSession"]: +if name in ["_jsc", "_jconf", "_jvm", "_jsparkSession", "sparkContext", "newSession"]: raise PySparkAttributeError( error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name} ) -elif name in ["newSession", "sparkContext"]: -raise PySparkNotImplementedError( -error_class="NOT_IMPLEMENTED", message_parameters={"feature": f"{name}()"} -) return object.__getattribute__(self, name) @property diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py b/python/pyspark/sql/tests/connect/test_connect_basic.py index c96d08b5bbe..34bd314c76f 100755 --- a/python/pyspark/sql/tests/connect/test_connect_basic.py +++ b/python/pyspark/sql/tests/connect/test_con