jubins opened a new pull request, #56581:
URL: https://github.com/apache/spark/pull/56581

   ### What is the purpose of the change?
   
   Fixes [SPARK-57515](https://issues.apache.org/jira/browse/SPARK-57515). When 
reading a CSV file with `header=true` and the header line has more columns than 
`maxColumns`
   (default 20480, user-configurable via `.option("maxColumns", N)`), Spark 
crashes with an internal
   `java.lang.ArrayIndexOutOfBoundsException` instead of a clean 
`MALFORMED_CSV_RECORD` error.
   
   [SPARK-57195](https://issues.apache.org/jira/browse/SPARK-57195) (merged 
2026-06-14) fixed the same `ArrayIndexOutOfBoundsException` for data rows and
   explicitly called out the remaining gap: _"Header rows are out of scope from 
this PR. A header over
   `maxColumns` still surfaces the raw AIOOBE (`CSVHeaderChecker`), a 
pre-existing gap."_ This PR
   closes that gap.
   
   The bug affects all three CSV read paths handled by `CSVHeaderChecker`:
   - **Non-multiLine file read** — `tokenizer.parseLine(header)` was called 
directly, bypassing the
     AIOOBE guard that `UnivocityParser.parseLine` wraps.
   - **MultiLine file read** — `tokenizer.parseNext()` during header 
consumption was unguarded.
   - **`Dataset[String]` `csv()`** — a fresh `CsvParser` was created and 
`parser.parseLine(line)` was
     called directly.
   
   ### Brief change log
   
   - `CSVHeaderChecker.checkHeaderColumnNames(line: String)`: replaced 
`parser.parseLine(line)` with
     `UnivocityParser.parseLine(parser, line)` to reuse the existing safe 
wrapper from SPARK-57195.
   - `CSVHeaderChecker.checkHeaderColumnNames(tokenizer)`: wrapped 
`tokenizer.parseNext()` in a
     try/catch that translates `ArrayIndexOutOfBoundsException` (bare or 
wrapped in
     `TextParsingException`) into `MALFORMED_CSV_RECORD`.
   - `CSVHeaderChecker.checkHeaderColumnNames(lines, tokenizer)`: wrapped 
`tokenizer.parseLine(header)`
     in the same try/catch.
   - Added private helper `malformedCsvHeaderRecord` (mirrors 
`UnivocityParser.malformedCsvRecord`)
     with the same bounded-record truncation to `MAX_ERROR_CONTENT_LENGTH`.
   
   ### Verifying this change
   
   Three new tests added to `CSVSuite`, one per affected path:
   
   - **SPARK-57515: non-multiLine CSV read with header exceeding maxColumns 
surfaces
     MALFORMED_CSV_RECORD** — writes a 3-column CSV with `maxColumns=2`, 
asserts `MALFORMED_CSV_RECORD`
     instead of AIOOBE.
   - **SPARK-57515: multiLine CSV read with header exceeding maxColumns surfaces
     MALFORMED_CSV_RECORD** — same with `multiLine=true`.
   - **SPARK-57515: Dataset[String] CSV read with header exceeding maxColumns 
surfaces
     MALFORMED_CSV_RECORD** — uses `spark.createDataset` path, asserts the 
header line appears in the
     error message.
   
   ### Does this PR potentially affect one of the following areas?
   
   - Dependencies: no
   - Public API: no — `CSVHeaderChecker` is internal
   - Serializers: no
   - Runtime per-record code paths (performance): no — only the header-parsing 
path, which runs once
     per file
   - Deployment or recovery: no
   - S3 connector: no
   
   ### Documentation
   
   This PR does not introduce a new feature. No documentation changes needed.
   
   ### Was generative AI tooling used to co-author this PR?
   
   - [x] Yes — Claude Code was used as a pair-programming assistant. All code 
was written, understood, and
   verified by the author.
   
   Generated-by: Claude Opus 4.8


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to