Re: [PR] [SPARK-57321][SQL] Infer CSV schema from tar archives [spark]

via GitHub Wed, 10 Jun 2026 11:38:07 -0700


cloud-fan commented on code in PR #56254:
URL: https://github.com/apache/spark/pull/56254#discussion_r3390679821



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala:
##########
@@ -130,6 +136,59 @@ abstract class CSVDataSource extends Serializable {
       parseEntry(parser, headerChecker, in)
     }
   }
+
+  /**
+   * Infers a CSV schema when at least one input is a tar archive. Every 
archive entry is streamed
+   * (never unpacked to disk) and every loose file is tokenized, and all of 
them feed a single
+   * [[CSVInferSchema]] pass keyed on the first input's header -- the same 
one-pass model a
+   * directory read uses, so the inferred schema matches reading the entries 
and files loose
+   * (column count fixed by the first header, `NullType` columns surviving to 
the final
+   * `toStructFields`). A corrupt/missing input is skipped as a unit (a whole 
archive or a whole
+   * file) when `ignoreCorruptFiles`/`ignoreMissingFiles` are set.
+   */
+  private def inferWithArchives(
+      sparkSession: SparkSession,
+      inputPaths: Seq[FileStatus],
+      parsedOptions: CSVOptions): StructType = {
+    val baseRdd = CSVDataSource.createBaseRdd(sparkSession, inputPaths, 
parsedOptions)
+    def tokens(dropHeader: Boolean): RDD[Array[String]] = baseRdd.flatMap { 
stream =>

Review Comment:
   In default mode (`multiLine=false`) this infers from records the scan will 
never produce. `tokenizeStream` parses each input as one continuous stream, so 
a quoted field with an embedded newline is one record here — but the V1 scan 
parses archive entries line-by-line (`TextInputCSVDataSource.readArchive` -> 
`entryLines`), which splits that field across rows. The inferred schema 
mis-describes what the read actually returns.
   
   Two more default-mode divergences from the directory parity the Scaladoc 
claims:
   - Loose files beside an archive switch inference model: alone they infer 
line-based (`inferFromDataset`), here they're stream-tokenized.
   - Header dropping changes semantics: per-input first-row drop here vs. 
`CSVUtils.filterHeaderLine`'s drop-lines-equal-to-the-first-header in the line 
model, so mismatched-header inputs infer differently than the same files in a 
plain directory.
   
   Also note `asParserSettings` is mode-tuned — 
`setLineSeparatorDetectionEnabled(lineSeparatorInRead.isEmpty && multiLine)` — 
so stream-parsing with non-multiLine settings is a combination nothing else 
exercises.
   
   I'd make per-input tokenization mode-specific, the same way `readArchive` 
already is: keep `inferWithArchives` as the shared template (first-header 
keying, sampling, one `CSVInferSchema` pass) and delegate input -> token 
iterator to the mode object — `MultiLineCSVDataSource` keeps `tokenizeStream`; 
`TextInputCSVDataSource` tokenizes line-by-line 
(`entryLines`/`ArchiveReader.lineIterator` + `filterCommentAndEmpty` + 
`parseLine`), matching its scan. Then the Scaladoc's parity claim holds, and 
default-mode tests (a quoted-newline entry; mismatched headers) can pin it. 
(The multiLine side is an exact match of `MultiLineCSVDataSource.infer` — that 
part checks out.)



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala:
##########
@@ -141,6 +200,33 @@ object CSVDataSource extends Logging {
     }
   }
 
+  /**
+   * One `PortableDataStream` per input file (the whole file, never split), 
shared by the multiLine
+   * scan-inference path and the archive inference path.

Review Comment:
   `createBaseRdd` serves only schema inference (`MultiLineCSVDataSource.infer` 
and `inferWithArchives`) — no scan path uses it.
   ```suggestion
      * schema-inference path and the archive inference path.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57321][SQL] Infer CSV schema from tar archives [spark]

Reply via email to