[ 
https://issues.apache.org/jira/browse/SPARK-57481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akshat Shenoi updated SPARK-57481:
----------------------------------
    Description: 
Following the CSV (SPARK-57135 / SPARK-57321) and JSON (SPARK-57419) 
tar-archive support, extend the Avro data source to read and infer schema from 
.tar/.tar.gz/.tgz archives, gated by spark.sql.files.archive.reader.enabled.

When the flag is enabled, the V1 Avro source treats a tar archive as a 
directory of its entries. Reads stream each entry through a forward-only 
DataFileStream (never unpacked to disk, memory stays bounded); a fresh datum 
reader and deserializer are built per entry since each entry carries its own 
writer schema in its header. Schema inference reads each entry's writer schema 
from the Avro header (records are never scanned) and, like a directory read, 
uses the first readable schema for the whole dataset -- Avro does not merge 
schemas across files (schema evolution is not supported).

The shared ArchiveReader is hardened to open the first entry eagerly and 
release the stream on a corrupt archive, so corruption surfaces (and is cleaned 
up) on the driver-side, header-only inference path that has no task-completion 
listener.

The whole archive is one non-splittable unit. Avro has no DSv2 reader, so the 
archive scan is V1-only.

  was:
Following the CSV (SPARK-57135 / SPARK-57321) and JSON (SPARK-57419) 
tar-archive support, extend the XML data source to read and infer schema from 
.tar/.tar.gz/.tgz archives, gated by spark.sql.files.archive.reader.enabled.

When the flag is enabled, the V1 XML source treats a tar archive as a directory 
of its entries: each entry is streamed through the StaxXmlParser (never 
unpacked to disk) and tokenized into its rowTag-delimited records, exactly like 
a standalone XML file. Schema inference makes a single XmlInferSchema pass over 
every archive entry together with any loose files, so the inferred schema 
matches a directory read of the same files (a field typed in one input but 
absent in another widens; a NullType field survives to one final 
canonicalization rather than being collapsed per input).

The whole archive is one non-splittable unit. XML has no DSv2 reader, so the 
archive scan is V1-only.


> [SQL] Read and infer Avro schema from tar archives
> --------------------------------------------------
>
>                 Key: SPARK-57481
>                 URL: https://issues.apache.org/jira/browse/SPARK-57481
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Akshat Shenoi
>            Assignee: Akshat Shenoi
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.3.0
>
>
> Following the CSV (SPARK-57135 / SPARK-57321) and JSON (SPARK-57419) 
> tar-archive support, extend the Avro data source to read and infer schema 
> from .tar/.tar.gz/.tgz archives, gated by 
> spark.sql.files.archive.reader.enabled.
> When the flag is enabled, the V1 Avro source treats a tar archive as a 
> directory of its entries. Reads stream each entry through a forward-only 
> DataFileStream (never unpacked to disk, memory stays bounded); a fresh datum 
> reader and deserializer are built per entry since each entry carries its own 
> writer schema in its header. Schema inference reads each entry's writer 
> schema from the Avro header (records are never scanned) and, like a directory 
> read, uses the first readable schema for the whole dataset -- Avro does not 
> merge schemas across files (schema evolution is not supported).
> The shared ArchiveReader is hardened to open the first entry eagerly and 
> release the stream on a corrupt archive, so corruption surfaces (and is 
> cleaned up) on the driver-side, header-only inference path that has no 
> task-completion listener.
> The whole archive is one non-splittable unit. Avro has no DSv2 reader, so the 
> archive scan is V1-only.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to