akshatshenoi-db opened a new pull request, #56709: URL: https://github.com/apache/spark/pull/56709
### What changes were proposed in this pull request? Extend the Avro data source to read and infer schema from `.tar`/`.tar.gz`/`.tgz` archives when `spark.sql.files.archive.reader.enabled` is set, continuing the tar-archive reader series: SPARK-57135 (CSV read), SPARK-57321 (CSV inference), SPARK-57419 (JSON), SPARK-57478 (text), SPARK-57479 (XML). Reads stream each archive entry through a forward-only `DataFileStream`, deserializing it like a standalone Avro file. The archive is never unpacked to disk and memory stays bounded. Inference reads each entry's writer schema from its Avro header (records are never scanned), also streamed, and uses the first readable schema like a directory read -- Avro does not merge schemas across files. The whole archive is a single split (see `isSplitable`), so the dispatch lives on the V1 read path; Avro has no V2 reader, so the archive scan is V1-only. `ArchiveReader` now opens the first entry eagerly and releases the stream if opening a corrupt archive fails, so the driver-side header-only inference path does not leak it. On the test side, `ArchiveReadSuiteBase.encodeFile` becomes a shared default (the format's single part-file bytes via `readOptions ++ writeOptions`), so the CSV/JSON/XML traits drop their identical overrides. Because Avro infers from an embedded header rather than from content, it runs the inference-parity tests but excludes the type-widening test and opts out of the schema-merge tests, and it adds two streaming-reader regression tests (a truncated entry must fail fast, a drip-fed entry must read fully) that have no format-agnostic analogue. ### Why are the changes needed? The archive reader already supports CSV, JSON, text, and XML. Avro is a common container for archived data, and extending the same opt-in archive path to Avro lets users read and infer schema from Avro files packed in a tar archive without unpacking them first, with the same directory-read parity the rest of the series guarantees. ### Does this PR introduce _any_ user-facing change? Yes. When `spark.sql.files.archive.reader.enabled` is set (default `false`), the Avro data source can now read and infer schema from `.tar`/`.tar.gz`/`.tgz` archives. With the flag at its default, behavior is unchanged. ### How was this patch tested? Added `AvroTarArchiveReadSuite`, which runs the shared archive read/inference tests from `ArchiveReadSuiteBase` (bound to Avro via `AvroArchiveReadBase`) over tar containers via `TarArchiveReadBase`, plus the two Avro-specific streaming-reader regression tests. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
