[PR] [SPARK-57419][SQL] Read and infer JSON schema from tar archives [spark]

via GitHub Fri, 12 Jun 2026 11:58:25 -0700


akshatshenoi-db opened a new pull request, #56480:
URL: https://github.com/apache/spark/pull/56480


   ### What changes were proposed in this pull request?
   
   SPARK-57135 added reading CSV files packed in tar archives 
(`.tar`/`.tar.gz`/`.tgz`) and SPARK-57321 added schema inference for them, both 
gated by `spark.sql.files.archive.reader.enabled`. This extends the same 
capability to the JSON data source.
   
   When the flag is enabled, the V1 JSON data source reads a tar archive as if 
it were a directory of its entries: each entry is streamed through 
`ArchiveReader` (never unpacked to disk) and parsed exactly like a standalone 
JSON file, for both line-delimited and multi-line JSON 
(`JsonDataSource.readArchive`/`readStream`). Schema inference reads every 
archive entry together with any loose files in a single `JsonInferSchema` pass 
(`inferWithArchives`), so the inferred schema matches a directory read of the 
same files. The whole archive is one non-splittable unit 
(`JsonFileFormat.isSplitable` returns false), and a corrupt/missing archive is 
skipped as a unit under `ignoreCorruptFiles`/`ignoreMissingFiles`. The DSv2 
reader cannot read archives, so `JsonTable` passes `supportsArchiveScan = 
false` and refuses to infer a schema for archive inputs (raising 
`UNABLE_TO_INFER_SCHEMA`).
   
   Unlike CSV, JSON needs no per-entry header handling (records are 
self-describing, so one parser serves every entry) and no `mergeSchema`-style 
branching (`JsonInferSchema` already merges record types by field name across 
all inputs, so one pass is itself the union).
   
   This also unifies the archive test suites: the format-agnostic inference and 
complex-type tests are hoisted into `ArchiveReadSuiteBase` behind capability 
hooks (`supportsSchemaInference`, `supportsComplexTypes`), so CSV, JSON, and 
future archive formats share them instead of each duplicating them.
   
   ### Why are the changes needed?
   
   To let JSON ingestion read tar archives without unpacking them to disk, 
matching the CSV behavior already in Spark.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. With `spark.sql.files.archive.reader.enabled=true` (default false), the 
JSON data source can read and infer schemas from `.tar`/`.tar.gz`/`.tgz` files.
   
   ### How was this patch tested?
   
   New `JSONTarArchiveReadSuite` (mixing `JSONArchiveReadBase` with the shared 
`ArchiveReadSuiteBase` and `TarArchiveReadBase`), plus the hoisted shared 
inference and complex-type tests now also exercised by the CSV suites.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57419][SQL] Read and infer JSON schema from tar archives [spark]

Reply via email to