.tgz archives as files [spark]

via GitHub Thu, 28 May 2026 14:30:06 -0700


akshatshenoi-eng opened a new pull request, #56193:
URL: https://github.com/apache/spark/pull/56193


   ### What changes were proposed in this pull request?
   
   Adds `ArchiveFormat`, a format-agnostic archive layer that lets any V1 
`FileFormat` transparently read tar archives (`.tar`, `.tar.gz`, `.tgz`) 
without per-format changes.
   
   **New file `ArchiveFormat.scala`:**
   - `readArchive` materializes one entry at a time to a local temp file and 
invokes the caller-supplied `readFn` against a synthetic `PartitionedFile` 
pointing at it. Only one entry's bytes live on disk at a time per task; the 
temp dir is cleaned up on iterator close and on task completion.
   - `expandArchives` does the same materialization on the driver for 
`inferSchema`, substituting entry `FileStatus`es for the archive's.
   - `isArchivePath` matches `.tar`, `.tar.gz`, and `.tgz`. `.tar.gz` is 
auto-decompressed by Hadoop's `CompressionCodecFactory` via `CodecStreams`; 
`.tgz` is not a registered Hadoop codec extension, so the gzip layer is 
unwrapped explicitly with `GZIPInputStream`.
   - Entries whose basename starts with `.` are skipped (covers macOS 
AppleDouble sidecars, `.DS_Store`, etc.).
   
   Materializing to disk (rather than streaming) means formats that need random 
access (Parquet/ORC footers) work without modification, at the cost of one 
entry's worth of disk per task.
   
   **Pipeline hooks:**
   - `PartitionedFileUtil.splitFiles`: archive paths are forced to a single 
split so the archive layer can stream entries sequentially.
   - `FileScanRDD.readCurrentFile`: archive paths are routed through 
`ArchiveFormat.readArchive` instead of calling the format reader directly.
   - `DataSource.resolve`: both `inferSchema` call sites expand archives before 
delegating to the format.
   
   The feature is gated behind `spark.sql.files.archive.enabled` (default 
`false`).
   
   ### Why are the changes needed?
   
   A common ingestion pattern stores many small files inside tar archives to 
reduce namespace pressure (fewer files in object stores or HDFS). Today there 
is no way to read these without first unpacking them externally. This change 
lets users point any V1 datasource reader directly at a tar archive and have 
the entries read transparently.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — when `spark.sql.files.archive.enabled=true`, `.tar`, `.tar.gz`, and 
`.tgz` paths are transparently expanded during both schema inference and scan. 
The config defaults to `false` so existing behaviour is unchanged.
   
   ### How was this patch tested?
   
   Two new test suites:
   - `ArchiveFormatSuite`: unit tests for `isArchivePath`, `readArchive` 
(empty, single, multi, gzip, lazy materialization, sidecar skipping, temp-file 
lifecycle, partition value propagation, TaskContext cleanup), and 
`expandArchives`.
   - `ArchiveReadSuite`: end-to-end reads through CSV, JSON, Parquet, and ORC 
readers against `.tar`, `.tar.gz`, and `.tgz` inputs, plus schema inference 
parity, splittability, and mixed-archive/non-archive partition layout tests.
   
   ### Was this patch authored or co-authored by anyone who is not a member of 
the [Spark community](https://spark.apache.org/committers.html)?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57135][SQL] Add ArchiveFormat for reading .tar/.tar.gz/.tgz archives as files [spark]

Reply via email to