[ 
https://issues.apache.org/jira/browse/SPARK-57705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akshat Shenoi updated SPARK-57705:
----------------------------------
    Description: 
The streaming archive reader (ArchiveReader) currently supports tar archives 
(.tar/.tar.gz/.tgz), letting supported data sources read each entry as a 
separate file when spark.sql.files.archive.reader.enabled is set. This extends 
it to zip (.zip) archives.

Zip is one of the most common containers for shipped/archived data. Adding it 
lets users read and infer schema from files packed in a .zip without unpacking 
them to disk first, with the same directory-read parity the rest of the 
archive-reader series guarantees.

Because the read/inference integration is format-agnostic — every data source 
dispatches through ArchiveReader.isArchivePath and 
ArchiveReader(path).readEntries(...) — zip support applies to every data source 
already wired up (CSV, JSON, text, XML, Avro) with no per-data-source changes.

Scope:
- The shared entry-streaming engine is hoisted from the tar reader into the 
abstract ArchiveReader base (tar and zip both stream via commons-compress 
ArchiveInputStream); a new ZipArchiveReader opens a ZipArchiveInputStream. Zip 
entries are individually deflated, so no Hadoop codec layer is applied, and the 
reader streams local file headers sequentially (matching the tar reader's 
pure-streaming model).
- isArchivePath/apply recognize and dispatch .zip. No new config flag — zip 
rides the existing, default-off spark.sql.files.archive.reader.enabled gate.

This continues the archive-reader series: SPARK-57135 (CSV read), SPARK-57321 
(CSV inference), SPARK-57419 (JSON), SPARK-57478 (text), SPARK-57479 (XML), 
SPARK-57481 (Avro).

With the flag at its default (false), behavior is unchanged.

> [SQL] Zip support for Archive Reader
> ------------------------------------
>
>                 Key: SPARK-57705
>                 URL: https://issues.apache.org/jira/browse/SPARK-57705
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Akshat Shenoi
>            Assignee: Akshat Shenoi
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.3.0
>
>
> The streaming archive reader (ArchiveReader) currently supports tar archives 
> (.tar/.tar.gz/.tgz), letting supported data sources read each entry as a 
> separate file when spark.sql.files.archive.reader.enabled is set. This 
> extends it to zip (.zip) archives.
> Zip is one of the most common containers for shipped/archived data. Adding it 
> lets users read and infer schema from files packed in a .zip without 
> unpacking them to disk first, with the same directory-read parity the rest of 
> the archive-reader series guarantees.
> Because the read/inference integration is format-agnostic — every data source 
> dispatches through ArchiveReader.isArchivePath and 
> ArchiveReader(path).readEntries(...) — zip support applies to every data 
> source already wired up (CSV, JSON, text, XML, Avro) with no per-data-source 
> changes.
> Scope:
> - The shared entry-streaming engine is hoisted from the tar reader into the 
> abstract ArchiveReader base (tar and zip both stream via commons-compress 
> ArchiveInputStream); a new ZipArchiveReader opens a ZipArchiveInputStream. 
> Zip entries are individually deflated, so no Hadoop codec layer is applied, 
> and the reader streams local file headers sequentially (matching the tar 
> reader's pure-streaming model).
> - isArchivePath/apply recognize and dispatch .zip. No new config flag — zip 
> rides the existing, default-off spark.sql.files.archive.reader.enabled gate.
> This continues the archive-reader series: SPARK-57135 (CSV read), SPARK-57321 
> (CSV inference), SPARK-57419 (JSON), SPARK-57478 (text), SPARK-57479 (XML), 
> SPARK-57481 (Avro).
> With the flag at its default (false), behavior is unchanged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to