akshatshenoi-db opened a new pull request, #56572: URL: https://github.com/apache/spark/pull/56572
### What changes were proposed in this pull request? SPARK-57135 added reading CSV files packed in tar archives (`.tar`/`.tar.gz`/`.tgz`), SPARK-57321 added CSV schema inference, SPARK-57419 extended both to JSON, and SPARK-57478 to text, all gated by `spark.sql.files.archive.reader.enabled`. This extends the same capability to the XML data source. When the flag is enabled, the V1 XML data source reads a tar archive as if it were a directory of its entries: each entry is streamed through `ArchiveReader` (never unpacked to disk) and parsed exactly like a standalone XML file (`XmlDataSource.readArchive` -> `StaxXmlParser.parseStream`). Schema inference reads every archive entry together with any loose files in a single `XmlInferSchema` pass (`inferWithArchives`), so the inferred schema matches a directory read of the same files. The whole archive is one non-splittable unit (`XmlFileFormat.isSplitable` returns false), and a corrupt/missing archive is skipped as a unit under `ignoreCorruptFiles`/`ignoreMissingFiles`. XML has no DSv2 reader, so the archive scan is V1-only and no `Table` change is needed. This also adjusts the `readArchive` entry point (JSON and XML) to take a parser factory and build a fresh parser for each archive entry -- matching the per-file parser of a non-archive read -- rather than sharing one parser across all entries. ### Why are the changes needed? To let XML ingestion read tar archives without unpacking them to disk, matching the CSV, JSON, and text behavior already in Spark. ### Does this PR introduce _any_ user-facing change? Yes. With `spark.sql.files.archive.reader.enabled=true` (default false), the XML data source can read and infer schemas from `.tar`/`.tar.gz`/`.tgz` files. ### How was this patch tested? New `XMLTarArchiveReadSuite` (mixing `XMLArchiveReadBase` with the shared `ArchiveReadSuiteBase` and `TarArchiveReadBase`), exercising the shared archive read/inference/complex-type tests plus XML-specific tests: multi-line records, attributes, and single-pass null-field widening against a loose file. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
