[
https://issues.apache.org/jira/browse/SPARK-57705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akshat Shenoi updated SPARK-57705:
----------------------------------
Description:
The streaming archive reader (ArchiveReader) currently supports tar archives
(.tar/.tar.gz/.tgz), letting supported data sources read each entry as a
separate file when spark.sql.files.archive.reader.enabled is set. This extends
it to zip (.zip) archives.
Zip is one of the most common containers for shipped/archived data. Adding it
lets users read and infer schema from files packed in a .zip without unpacking
them to disk first, with the same directory-read parity the rest of the
archive-reader series guarantees.
Because the read/inference integration is format-agnostic — every data source
dispatches through ArchiveReader.isArchivePath and
ArchiveReader(path).readEntries(...) — zip support applies to every data source
already wired up (CSV, JSON, text, XML, Avro) with no per-data-source changes.
Scope:
- The shared entry-streaming engine is hoisted from the tar reader into the
abstract ArchiveReader base (tar and zip both stream via commons-compress
ArchiveInputStream); a new ZipArchiveReader opens a ZipArchiveInputStream. Zip
entries are individually deflated, so no Hadoop codec layer is applied, and the
reader streams local file headers sequentially (matching the tar reader's
pure-streaming model).
- isArchivePath/apply recognize and dispatch .zip. No new config flag — zip
rides the existing, default-off spark.sql.files.archive.reader.enabled gate.
This continues the archive-reader series: SPARK-57135 (CSV read), SPARK-57321
(CSV inference), SPARK-57419 (JSON), SPARK-57478 (text), SPARK-57479 (XML),
SPARK-57481 (Avro).
With the flag at its default (false), behavior is unchanged.
> [SQL] Zip support for Archive Reader
> ------------------------------------
>
> Key: SPARK-57705
> URL: https://issues.apache.org/jira/browse/SPARK-57705
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Akshat Shenoi
> Assignee: Akshat Shenoi
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.3.0
>
>
> The streaming archive reader (ArchiveReader) currently supports tar archives
> (.tar/.tar.gz/.tgz), letting supported data sources read each entry as a
> separate file when spark.sql.files.archive.reader.enabled is set. This
> extends it to zip (.zip) archives.
> Zip is one of the most common containers for shipped/archived data. Adding it
> lets users read and infer schema from files packed in a .zip without
> unpacking them to disk first, with the same directory-read parity the rest of
> the archive-reader series guarantees.
> Because the read/inference integration is format-agnostic — every data source
> dispatches through ArchiveReader.isArchivePath and
> ArchiveReader(path).readEntries(...) — zip support applies to every data
> source already wired up (CSV, JSON, text, XML, Avro) with no per-data-source
> changes.
> Scope:
> - The shared entry-streaming engine is hoisted from the tar reader into the
> abstract ArchiveReader base (tar and zip both stream via commons-compress
> ArchiveInputStream); a new ZipArchiveReader opens a ZipArchiveInputStream.
> Zip entries are individually deflated, so no Hadoop codec layer is applied,
> and the reader streams local file headers sequentially (matching the tar
> reader's pure-streaming model).
> - isArchivePath/apply recognize and dispatch .zip. No new config flag — zip
> rides the existing, default-off spark.sql.files.archive.reader.enabled gate.
> This continues the archive-reader series: SPARK-57135 (CSV read), SPARK-57321
> (CSV inference), SPARK-57419 (JSON), SPARK-57478 (text), SPARK-57479 (XML),
> SPARK-57481 (Avro).
> With the flag at its default (false), behavior is unchanged.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]