Jubin Soni created SPARK-57515:
----------------------------------
Summary: Surface MALFORMED_CSV_RECORD instead of
ArrayIndexOutOfBoundsException when CSV header exceeds maxColumns
Key: SPARK-57515
URL: https://issues.apache.org/jira/browse/SPARK-57515
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.0.3
Reporter: Jubin Soni
When reading a CSV file with {{header=true}} and the header row contains more
columns than {{maxColumns}} (default: {{{}20480{}}}, user-configurable), Spark
throws an internal {{java.lang.ArrayIndexOutOfBoundsException}} instead of
returning a structured {{MALFORMED_CSV_RECORD}} error.
This occurs because CSV header validation paths invoke Univocity parsing APIs
directly without the malformed-record handling introduced for data rows.
*Affected Code Paths*
The issue affects all CSV read paths that validate headers:
# *Non-multiLine file read*
** {{CSVHeaderChecker}} calls {{tokenizer.parseLine(header)}} directly.
# *MultiLine file read*
** {{CSVHeaderChecker}} calls {{tokenizer.parseNext()}} directly.
# *Dataset[String] csv()*
** {{CSVHeaderChecker}} creates a new {{CsvParser}} and calls
{{parser.parseLine(line)}} directly.
In all three cases, a header exceeding {{maxColumns}} surfaces a raw
{{{}ArrayIndexOutOfBoundsException{}}}.
*Background*
SPARK-57195 (merged 2026-06-14) fixed the same
{{ArrayIndexOutOfBoundsException}} issue for CSV data rows by converting parser
failures into {{MALFORMED_CSV_RECORD}} errors.
The SPARK-57195 discussion explicitly noted:
{quote}Header rows are out of scope from this PR. A header over maxColumns
still surfaces the raw AIOOBE (CSVHeaderChecker), a pre-existing gap.
{quote}
As a result, header parsing remains inconsistent with data row parsing.
*Expected Behavior*
When the header row exceeds {{{}maxColumns{}}}, Spark should fail with a
structured {{MALFORMED_CSV_RECORD}} error, consistent with data row handling.
*Actual Behavior*
Spark throws:
{{java.lang.ArrayIndexOutOfBoundsException}}
originating from Univocity parser internals.
*Steps to Reproduce*
{{import java.nio.file.\{Files, Paths}
import java.nio.charset.StandardCharsets
val path = "/tmp/test_header.csv"
Files.write(
Paths.get(path),
"a,b,c\n1,2,3\n".getBytes(StandardCharsets.UTF_8)
)
spark.read
.option("header", "true")
.option("maxColumns", "2")
.csv(path)
.collect()}}
*Result*
{{java.lang.ArrayIndexOutOfBoundsException}}
*Expected Result*
A Spark {{MALFORMED_CSV_RECORD}} error indicating that the CSV record exceeds
the configured {{maxColumns}} limit.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]