jubins opened a new pull request, #56581: URL: https://github.com/apache/spark/pull/56581
### What is the purpose of the change? Fixes [SPARK-57515](https://issues.apache.org/jira/browse/SPARK-57515). When reading a CSV file with `header=true` and the header line has more columns than `maxColumns` (default 20480, user-configurable via `.option("maxColumns", N)`), Spark crashes with an internal `java.lang.ArrayIndexOutOfBoundsException` instead of a clean `MALFORMED_CSV_RECORD` error. [SPARK-57195](https://issues.apache.org/jira/browse/SPARK-57195) (merged 2026-06-14) fixed the same `ArrayIndexOutOfBoundsException` for data rows and explicitly called out the remaining gap: _"Header rows are out of scope from this PR. A header over `maxColumns` still surfaces the raw AIOOBE (`CSVHeaderChecker`), a pre-existing gap."_ This PR closes that gap. The bug affects all three CSV read paths handled by `CSVHeaderChecker`: - **Non-multiLine file read** — `tokenizer.parseLine(header)` was called directly, bypassing the AIOOBE guard that `UnivocityParser.parseLine` wraps. - **MultiLine file read** — `tokenizer.parseNext()` during header consumption was unguarded. - **`Dataset[String]` `csv()`** — a fresh `CsvParser` was created and `parser.parseLine(line)` was called directly. ### Brief change log - `CSVHeaderChecker.checkHeaderColumnNames(line: String)`: replaced `parser.parseLine(line)` with `UnivocityParser.parseLine(parser, line)` to reuse the existing safe wrapper from SPARK-57195. - `CSVHeaderChecker.checkHeaderColumnNames(tokenizer)`: wrapped `tokenizer.parseNext()` in a try/catch that translates `ArrayIndexOutOfBoundsException` (bare or wrapped in `TextParsingException`) into `MALFORMED_CSV_RECORD`. - `CSVHeaderChecker.checkHeaderColumnNames(lines, tokenizer)`: wrapped `tokenizer.parseLine(header)` in the same try/catch. - Added private helper `malformedCsvHeaderRecord` (mirrors `UnivocityParser.malformedCsvRecord`) with the same bounded-record truncation to `MAX_ERROR_CONTENT_LENGTH`. ### Verifying this change Three new tests added to `CSVSuite`, one per affected path: - **SPARK-57515: non-multiLine CSV read with header exceeding maxColumns surfaces MALFORMED_CSV_RECORD** — writes a 3-column CSV with `maxColumns=2`, asserts `MALFORMED_CSV_RECORD` instead of AIOOBE. - **SPARK-57515: multiLine CSV read with header exceeding maxColumns surfaces MALFORMED_CSV_RECORD** — same with `multiLine=true`. - **SPARK-57515: Dataset[String] CSV read with header exceeding maxColumns surfaces MALFORMED_CSV_RECORD** — uses `spark.createDataset` path, asserts the header line appears in the error message. ### Does this PR potentially affect one of the following areas? - Dependencies: no - Public API: no — `CSVHeaderChecker` is internal - Serializers: no - Runtime per-record code paths (performance): no — only the header-parsing path, which runs once per file - Deployment or recovery: no - S3 connector: no ### Documentation This PR does not introduce a new feature. No documentation changes needed. ### Was generative AI tooling used to co-author this PR? - [x] Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and verified by the author. Generated-by: Claude Opus 4.8 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
