[PR] [SPARK-57479][SQL] Read and infer XML schema from tar archives [spark]

via GitHub Wed, 17 Jun 2026 11:37:32 -0700


akshatshenoi-db opened a new pull request, #56572:
URL: https://github.com/apache/spark/pull/56572


   ### What changes were proposed in this pull request?
   
   SPARK-57135 added reading CSV files packed in tar archives 
(`.tar`/`.tar.gz`/`.tgz`), SPARK-57321 added CSV schema inference, SPARK-57419 
extended both to JSON, and SPARK-57478 to text, all gated by 
`spark.sql.files.archive.reader.enabled`. This extends the same capability to 
the XML data source.
   
   When the flag is enabled, the V1 XML data source reads a tar archive as if 
it were a directory of its entries: each entry is streamed through 
`ArchiveReader` (never unpacked to disk) and parsed exactly like a standalone 
XML file (`XmlDataSource.readArchive` -> `StaxXmlParser.parseStream`). Schema 
inference reads every archive entry together with any loose files in a single 
`XmlInferSchema` pass (`inferWithArchives`), so the inferred schema matches a 
directory read of the same files. The whole archive is one non-splittable unit 
(`XmlFileFormat.isSplitable` returns false), and a corrupt/missing archive is 
skipped as a unit under `ignoreCorruptFiles`/`ignoreMissingFiles`. XML has no 
DSv2 reader, so the archive scan is V1-only and no `Table` change is needed.
   
   This also adjusts the `readArchive` entry point (JSON and XML) to take a 
parser factory and build a fresh parser for each archive entry -- matching the 
per-file parser of a non-archive read -- rather than sharing one parser across 
all entries.
   
   ### Why are the changes needed?
   
   To let XML ingestion read tar archives without unpacking them to disk, 
matching the CSV, JSON, and text behavior already in Spark.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. With `spark.sql.files.archive.reader.enabled=true` (default false), the 
XML data source can read and infer schemas from `.tar`/`.tar.gz`/`.tgz` files.
   
   ### How was this patch tested?
   
   New `XMLTarArchiveReadSuite` (mixing `XMLArchiveReadBase` with the shared 
`ArchiveReadSuiteBase` and `TarArchiveReadBase`), exercising the shared archive 
read/inference/complex-type tests plus XML-specific tests: multi-line records, 
attributes, and single-pass null-field widening against a loose file.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57479][SQL] Read and infer XML schema from tar archives [spark]

Reply via email to