Akshat Shenoi created SPARK-57591:
-------------------------------------
Summary: [SQL] Read and infer ORC schema from tar archives
Key: SPARK-57591
URL: https://issues.apache.org/jira/browse/SPARK-57591
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.3.0
Reporter: Akshat Shenoi
Assignee: Akshat Shenoi
Fix For: 4.3.0
Following the CSV (SPARK-57135 / SPARK-57321) and JSON (SPARK-57419)
tar-archive support, extend the Avro data source to read and infer schema from
.tar/.tar.gz/.tgz archives, gated by spark.sql.files.archive.reader.enabled.
When the flag is enabled, the V1 Avro source treats a tar archive as a
directory of its entries. Reads stream each entry through a forward-only
DataFileStream (never unpacked to disk, memory stays bounded); a fresh datum
reader and deserializer are built per entry since each entry carries its own
writer schema in its header. Schema inference reads each entry's writer schema
from the Avro header (records are never scanned) and, like a directory read,
uses the first readable schema for the whole dataset -- Avro does not merge
schemas across files (schema evolution is not supported).
The shared ArchiveReader is hardened to open the first entry eagerly and
release the stream on a corrupt archive, so corruption surfaces (and is cleaned
up) on the driver-side, header-only inference path that has no task-completion
listener.
The whole archive is one non-splittable unit. Avro has no DSv2 reader, so the
archive scan is V1-only.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]