[
https://issues.apache.org/jira/browse/SPARK-57481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan resolved SPARK-57481.
---------------------------------
Resolution: Fixed
Issue resolved by pull request 56709
[https://github.com/apache/spark/pull/56709]
> [SQL] Read and infer Avro schema from tar archives
> --------------------------------------------------
>
> Key: SPARK-57481
> URL: https://issues.apache.org/jira/browse/SPARK-57481
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Akshat Shenoi
> Assignee: Akshat Shenoi
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.3.0
>
>
> Following the CSV (SPARK-57135 / SPARK-57321) and JSON (SPARK-57419)
> tar-archive support, extend the Avro data source to read and infer schema
> from .tar/.tar.gz/.tgz archives, gated by
> spark.sql.files.archive.reader.enabled.
> When the flag is enabled, the V1 Avro source treats a tar archive as a
> directory of its entries. Reads stream each entry through a forward-only
> DataFileStream (never unpacked to disk, memory stays bounded); a fresh datum
> reader and deserializer are built per entry since each entry carries its own
> writer schema in its header. Schema inference reads each entry's writer
> schema from the Avro header (records are never scanned) and, like a directory
> read, uses the first readable schema for the whole dataset -- Avro does not
> merge schemas across files (schema evolution is not supported).
> The shared ArchiveReader is hardened to open the first entry eagerly and
> release the stream on a corrupt archive, so corruption surfaces (and is
> cleaned up) on the driver-side, header-only inference path that has no
> task-completion listener.
> The whole archive is one non-splittable unit. Avro has no DSv2 reader, so the
> archive scan is V1-only.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]