akshatshenoi-eng opened a new pull request, #56254:
URL: https://github.com/apache/spark/pull/56254

   ### What changes were proposed in this pull request?
   
   > **Stacked on #56193** (CSV tar archive read support). Please review/merge 
that PR first;
   > until it merges, this PR's diff will also include its commits. The 
inference-specific
   > changes are in the top commit.
   
   Adds CSV **schema inference** for tar archives (`.tar`/`.tar.gz`/`.tgz`), 
building on the
   archive read support in #56193. When `spark.sql.files.archive.enabled` is 
set and an input
   path is a tar archive, `CSVDataSource.inferSchema` partitions inputs into 
archives vs
   non-archives, infers each archive by streaming its entries (entries are 
tokenized like
   standalone CSV files and never unpacked to disk), and merges the result with 
any non-archive
   files' inferred schema. The enablement flag is read from 
`FileSourceOptions.archiveFormatEnabled`
   (added in the base PR). The config doc is updated to note archives are 
supported during both
   scan and schema inference.
   
   ### Why are the changes needed?
   
   The archive feature was split into two PRs to keep each reviewable: #56193 
adds reading, this
   PR adds schema inference so that `inferSchema`/`inferSchema=true` works for 
archives the same
   way it does for a directory of CSV files.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The capability is behind the `spark.sql.files.archive.enabled` config 
(default `false`,
   introduced in #56193); this PR only extends that opt-in feature to schema 
inference.
   
   ### How was this patch tested?
   
   Added inference parity tests to `CSVArchiveReadBase` (run by 
`CSVTarArchiveReadSuite`):
   - an archive infers the same schema as a directory of the same files;
   - all archive formats (`.tar`/`.tar.gz`/`.tgz`) infer the same schema.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, authored with assistance from generative AI tooling.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to