[PR] [SPARK-57481][SQL] Read and infer Avro schema from tar archives [spark]

via GitHub Tue, 23 Jun 2026 15:23:06 -0700


akshatshenoi-db opened a new pull request, #56709:
URL: https://github.com/apache/spark/pull/56709


   ### What changes were proposed in this pull request?
   
   Extend the Avro data source to read and infer schema from 
`.tar`/`.tar.gz`/`.tgz` archives when `spark.sql.files.archive.reader.enabled` 
is set, continuing the tar-archive reader series: SPARK-57135 (CSV read), 
SPARK-57321 (CSV inference), SPARK-57419 (JSON), SPARK-57478 (text), 
SPARK-57479 (XML).
   
   Reads stream each archive entry through a forward-only `DataFileStream`, 
deserializing it like a standalone Avro file. The archive is never unpacked to 
disk and memory stays bounded. Inference reads each entry's writer schema from 
its Avro header (records are never scanned), also streamed, and uses the first 
readable schema like a directory read -- Avro does not merge schemas across 
files. The whole archive is a single split (see `isSplitable`), so the dispatch 
lives on the V1 read path; Avro has no V2 reader, so the archive scan is 
V1-only. `ArchiveReader` now opens the first entry eagerly and releases the 
stream if opening a corrupt archive fails, so the driver-side header-only 
inference path does not leak it.
   
   On the test side, `ArchiveReadSuiteBase.encodeFile` becomes a shared default 
(the format's single part-file bytes via `readOptions ++ writeOptions`), so the 
CSV/JSON/XML traits drop their identical overrides. Because Avro infers from an 
embedded header rather than from content, it runs the inference-parity tests 
but excludes the type-widening test and opts out of the schema-merge tests, and 
it adds two streaming-reader regression tests (a truncated entry must fail 
fast, a drip-fed entry must read fully) that have no format-agnostic analogue.
   
   ### Why are the changes needed?
   
   The archive reader already supports CSV, JSON, text, and XML. Avro is a 
common container for archived data, and extending the same opt-in archive path 
to Avro lets users read and infer schema from Avro files packed in a tar 
archive without unpacking them first, with the same directory-read parity the 
rest of the series guarantees.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. When `spark.sql.files.archive.reader.enabled` is set (default `false`), 
the Avro data source can now read and infer schema from `.tar`/`.tar.gz`/`.tgz` 
archives. With the flag at its default, behavior is unchanged.
   
   ### How was this patch tested?
   
   Added `AvroTarArchiveReadSuite`, which runs the shared archive 
read/inference tests from `ArchiveReadSuiteBase` (bound to Avro via 
`AvroArchiveReadBase`) over tar containers via `TarArchiveReadBase`, plus the 
two Avro-specific streaming-reader regression tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57481][SQL] Read and infer Avro schema from tar archives [spark]

Reply via email to