[PR] [SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference [spark]

via GitHub Mon, 01 Jun 2026 16:24:15 -0700


yashtc opened a new pull request, #56260:
URL: https://github.com/apache/spark/pull/56260


   ### What changes were proposed in this pull request?
   
   CSV schema inference can fail with an uncaught 
`java.lang.ArrayIndexOutOfBoundsException` when a row has more columns than 
`maxColumns`. SPARK-49444 (https://issues.apache.org/jira/browse/SPARK-49444) 
added handling for the per-line `UnivocityParser.parseLine` path, but the 
schema-inference paths that tokenize with a raw Univocity `CsvParser` were 
never covered, so they still surface the internal exception.
   
   This PR translates that exception into a `MALFORMED_CSV_RECORD` error across 
the remaining paths:
   
   - Adds a shared `UnivocityParser.parseLine(tokenizer, line)` helper that 
converts Univocity's `ArrayIndexOutOfBoundsException` (raised bare or wrapped 
in a `TextParsingException`) into `MALFORMED_CSV_RECORD`.
   - Guards the streaming tokenizer `UnivocityParser.convertStream`, used by 
`multiLine` reads and `multiLine` schema inference.
   - Routes the non-`multiLine` inference path 
(`TextInputCSVDataSource.inferFromDataset`) and the single-variant-column 
header read through the same helper.
   
   ### Why are the changes needed?
   
   Schema inference and `multiLine` reads crash with an internal 
`ArrayIndexOutOfBoundsException` for input that should produce a clean 
`MALFORMED_CSV_RECORD` error. SPARK-49444 only covered 
`UnivocityParser.parseLine`; the inference paths construct a raw `CsvParser` 
and call `parseLine` on it directly, bypassing that handling.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. When a CSV row has more columns than `maxColumns` during schema 
inference (or a `multiLine` read), the surfaced error changes from an internal 
`java.lang.ArrayIndexOutOfBoundsException` to `MALFORMED_CSV_RECORD` (SQLSTATE 
`KD000`), consistent with the non-`multiLine` per-row read path since 
SPARK-49444. This is a change relative to all released versions.
   
   ### How was this patch tested?
   
   Added unit tests in `CSVSuite` covering schema inference for both the 
`multiLine` and non-`multiLine` paths, asserting `MALFORMED_CSV_RECORD` instead 
of a raw `ArrayIndexOutOfBoundsException`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Anthropic), Claude Opus 4.8


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57195][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException in CSV schema inference [spark]

Reply via email to