akshatshenoi-db opened a new pull request, #56480: URL: https://github.com/apache/spark/pull/56480
### What changes were proposed in this pull request? SPARK-57135 added reading CSV files packed in tar archives (`.tar`/`.tar.gz`/`.tgz`) and SPARK-57321 added schema inference for them, both gated by `spark.sql.files.archive.reader.enabled`. This extends the same capability to the JSON data source. When the flag is enabled, the V1 JSON data source reads a tar archive as if it were a directory of its entries: each entry is streamed through `ArchiveReader` (never unpacked to disk) and parsed exactly like a standalone JSON file, for both line-delimited and multi-line JSON (`JsonDataSource.readArchive`/`readStream`). Schema inference reads every archive entry together with any loose files in a single `JsonInferSchema` pass (`inferWithArchives`), so the inferred schema matches a directory read of the same files. The whole archive is one non-splittable unit (`JsonFileFormat.isSplitable` returns false), and a corrupt/missing archive is skipped as a unit under `ignoreCorruptFiles`/`ignoreMissingFiles`. The DSv2 reader cannot read archives, so `JsonTable` passes `supportsArchiveScan = false` and refuses to infer a schema for archive inputs (raising `UNABLE_TO_INFER_SCHEMA`). Unlike CSV, JSON needs no per-entry header handling (records are self-describing, so one parser serves every entry) and no `mergeSchema`-style branching (`JsonInferSchema` already merges record types by field name across all inputs, so one pass is itself the union). This also unifies the archive test suites: the format-agnostic inference and complex-type tests are hoisted into `ArchiveReadSuiteBase` behind capability hooks (`supportsSchemaInference`, `supportsComplexTypes`), so CSV, JSON, and future archive formats share them instead of each duplicating them. ### Why are the changes needed? To let JSON ingestion read tar archives without unpacking them to disk, matching the CSV behavior already in Spark. ### Does this PR introduce _any_ user-facing change? Yes. With `spark.sql.files.archive.reader.enabled=true` (default false), the JSON data source can read and infer schemas from `.tar`/`.tar.gz`/`.tgz` files. ### How was this patch tested? New `JSONTarArchiveReadSuite` (mixing `JSONArchiveReadBase` with the shared `ArchiveReadSuiteBase` and `TarArchiveReadBase`), plus the hoisted shared inference and complex-type tests now also exercised by the CSV suites. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
