Akshat Shenoi created SPARK-57591:
-------------------------------------

             Summary: [SQL] Read and infer ORC schema from tar archives
                 Key: SPARK-57591
                 URL: https://issues.apache.org/jira/browse/SPARK-57591
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.3.0
            Reporter: Akshat Shenoi
            Assignee: Akshat Shenoi
             Fix For: 4.3.0


Following the CSV (SPARK-57135 / SPARK-57321) and JSON (SPARK-57419) 
tar-archive support, extend the Avro data source to read and infer schema from 
.tar/.tar.gz/.tgz archives, gated by spark.sql.files.archive.reader.enabled.

When the flag is enabled, the V1 Avro source treats a tar archive as a 
directory of its entries. Reads stream each entry through a forward-only 
DataFileStream (never unpacked to disk, memory stays bounded); a fresh datum 
reader and deserializer are built per entry since each entry carries its own 
writer schema in its header. Schema inference reads each entry's writer schema 
from the Avro header (records are never scanned) and, like a directory read, 
uses the first readable schema for the whole dataset -- Avro does not merge 
schemas across files (schema evolution is not supported).

The shared ArchiveReader is hardened to open the first entry eagerly and 
release the stream on a corrupt archive, so corruption surfaces (and is cleaned 
up) on the driver-side, header-only inference path that has no task-completion 
listener.

The whole archive is one non-splittable unit. Avro has no DSv2 reader, so the 
archive scan is V1-only.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to