akshatshenoi-eng opened a new pull request, #56254: URL: https://github.com/apache/spark/pull/56254
### What changes were proposed in this pull request? > **Stacked on #56193** (CSV tar archive read support). Please review/merge that PR first; > until it merges, this PR's diff will also include its commits. The inference-specific > changes are in the top commit. Adds CSV **schema inference** for tar archives (`.tar`/`.tar.gz`/`.tgz`), building on the archive read support in #56193. When `spark.sql.files.archive.enabled` is set and an input path is a tar archive, `CSVDataSource.inferSchema` partitions inputs into archives vs non-archives, infers each archive by streaming its entries (entries are tokenized like standalone CSV files and never unpacked to disk), and merges the result with any non-archive files' inferred schema. The enablement flag is read from `FileSourceOptions.archiveFormatEnabled` (added in the base PR). The config doc is updated to note archives are supported during both scan and schema inference. ### Why are the changes needed? The archive feature was split into two PRs to keep each reviewable: #56193 adds reading, this PR adds schema inference so that `inferSchema`/`inferSchema=true` works for archives the same way it does for a directory of CSV files. ### Does this PR introduce _any_ user-facing change? No. The capability is behind the `spark.sql.files.archive.enabled` config (default `false`, introduced in #56193); this PR only extends that opt-in feature to schema inference. ### How was this patch tested? Added inference parity tests to `CSVArchiveReadBase` (run by `CSVTarArchiveReadSuite`): - an archive infers the same schema as a directory of the same files; - all archive formats (`.tar`/`.tar.gz`/`.tgz`) infer the same schema. ### Was this patch authored or co-authored using generative AI tooling? Yes, authored with assistance from generative AI tooling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
