This is an automated email from the ASF dual-hosted git repository.

dongjoon-hyun pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-connect-swift.git


The following commit(s) were added to refs/heads/main by this push:
     new d57fe8c  [SPARK-57061] Support `xml(DataFrame)` in `DataFrameReader`
d57fe8c is described below

commit d57fe8c7eabd9403cbd7b7b07522950cdfb09acf
Author: Dongjoon Hyun <[email protected]>
AuthorDate: Mon May 25 18:34:37 2026 -0700

    [SPARK-57061] Support `xml(DataFrame)` in `DataFrameReader`
    
    ### What changes were proposed in this pull request?
    
    This PR aims to support `xml(DataFrame)` overload in `DataFrameReader`.
    
    ### Why are the changes needed?
    
    For feature parity with PySpark/Scala `spark.read.xml(xmlDataset)`, and to 
exercise the newly added `Spark_Connect_Parse.ParseFormat.xml` (Apache Spark 
4.2.0+).
    - https://github.com/apache/spark/pull/55332
    
    ### Does this PR introduce _any_ user-facing change?
    
    No behavior change. New public overload added.
    
    ### How was this patch tested?
    
    Pass the CIs with the newly added test case.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Claude Opus 4.7
    
    Closes #389 from dongjoon-hyun/SPARK-57061.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 Sources/SparkConnect/DataFrameReader.swift         | 21 +++++++++++++++++++++
 Sources/SparkConnect/TypeAliases.swift             |  1 +
 Tests/SparkConnectTests/DataFrameReaderTests.swift | 14 ++++++++++++++
 3 files changed, 36 insertions(+)

diff --git a/Sources/SparkConnect/DataFrameReader.swift 
b/Sources/SparkConnect/DataFrameReader.swift
index abfd8c6..5182ec7 100644
--- a/Sources/SparkConnect/DataFrameReader.swift
+++ b/Sources/SparkConnect/DataFrameReader.swift
@@ -226,6 +226,27 @@ public actor DataFrameReader: Sendable {
     return load(paths)
   }
 
+  /// Loads an XML dataset and returns the result as a ``DataFrame``.
+  /// The input ``DataFrame`` must have a single string column whose values 
are XML documents.
+  /// - Parameter xmlDataset: A ``DataFrame`` with a single string column.
+  /// - Returns: A ``DataFrame``.
+  public func xml(_ xmlDataset: DataFrame) async -> DataFrame {
+    var parse = Parse()
+    parse.format = .xml
+    parse.options = self.extraOptions.toStringDictionary()
+    if case .root(let input) = await xmlDataset.plan.opType {
+      parse.input = input
+    }
+
+    var relation = Relation()
+    relation.parse = parse
+
+    var plan = Plan()
+    plan.opType = .root(relation)
+
+    return DataFrame(spark: sparkSession, plan: plan)
+  }
+
   /// Loads an ORC file and returns the result as a ``DataFrame``.
   /// - Parameter path: A path string
   /// - Returns: A ``DataFrame``.
diff --git a/Sources/SparkConnect/TypeAliases.swift 
b/Sources/SparkConnect/TypeAliases.swift
index d5ee301..3f15f56 100644
--- a/Sources/SparkConnect/TypeAliases.swift
+++ b/Sources/SparkConnect/TypeAliases.swift
@@ -45,6 +45,7 @@ typealias NamedTable = Spark_Connect_Read.NamedTable
 typealias OneOf_Analyze = AnalyzePlanRequest.OneOf_Analyze
 typealias OneOf_CatType = Spark_Connect_Catalog.OneOf_CatType
 typealias OutputType = Spark_Connect_OutputType
+typealias Parse = Spark_Connect_Parse
 typealias Plan = Spark_Connect_Plan
 typealias Project = Spark_Connect_Project
 typealias Range = Spark_Connect_Range
diff --git a/Tests/SparkConnectTests/DataFrameReaderTests.swift 
b/Tests/SparkConnectTests/DataFrameReaderTests.swift
index 6bec415..f755879 100644
--- a/Tests/SparkConnectTests/DataFrameReaderTests.swift
+++ b/Tests/SparkConnectTests/DataFrameReaderTests.swift
@@ -61,6 +61,20 @@ struct DataFrameReaderTests {
     await spark.stop()
   }
 
+  @Test
+  func xmlDataset() async throws {
+    let spark = try await SparkSession.builder.getOrCreate()
+    if await spark.version >= "4.2.0" {
+      let xmlDF = try await spark.sql(
+        "SELECT * FROM VALUES "
+          + "('<person><name>Alice</name><age>25</age></person>'), "
+          + "('<person><name>Bob</name><age>30</age></person>') AS T(value)"
+      )
+      #expect(try await spark.read.option("rowTag", 
"person").xml(xmlDF).count() == 2)
+    }
+    await spark.stop()
+  }
+
   @Test
   func orc() async throws {
     let spark = try await SparkSession.builder.getOrCreate()


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to