Jubin Soni created SPARK-57515:
----------------------------------

             Summary: Surface MALFORMED_CSV_RECORD instead of 
ArrayIndexOutOfBoundsException when CSV header exceeds maxColumns
                 Key: SPARK-57515
                 URL: https://issues.apache.org/jira/browse/SPARK-57515
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.3
            Reporter: Jubin Soni


When reading a CSV file with {{header=true}} and the header row contains more 
columns than {{maxColumns}} (default: {{{}20480{}}}, user-configurable), Spark 
throws an internal {{java.lang.ArrayIndexOutOfBoundsException}} instead of 
returning a structured {{MALFORMED_CSV_RECORD}} error.

This occurs because CSV header validation paths invoke Univocity parsing APIs 
directly without the malformed-record handling introduced for data rows.

*Affected Code Paths*

The issue affects all CSV read paths that validate headers:
 # *Non-multiLine file read*

 ** {{CSVHeaderChecker}} calls {{tokenizer.parseLine(header)}} directly.

 # *MultiLine file read*

 ** {{CSVHeaderChecker}} calls {{tokenizer.parseNext()}} directly.

 # *Dataset[String] csv()*

 ** {{CSVHeaderChecker}} creates a new {{CsvParser}} and calls 
{{parser.parseLine(line)}} directly.

In all three cases, a header exceeding {{maxColumns}} surfaces a raw 
{{{}ArrayIndexOutOfBoundsException{}}}.

*Background*

SPARK-57195 (merged 2026-06-14) fixed the same 
{{ArrayIndexOutOfBoundsException}} issue for CSV data rows by converting parser 
failures into {{MALFORMED_CSV_RECORD}} errors.

The SPARK-57195 discussion explicitly noted:
{quote}Header rows are out of scope from this PR. A header over maxColumns 
still surfaces the raw AIOOBE (CSVHeaderChecker), a pre-existing gap.
{quote}
As a result, header parsing remains inconsistent with data row parsing.

*Expected Behavior*

When the header row exceeds {{{}maxColumns{}}}, Spark should fail with a 
structured {{MALFORMED_CSV_RECORD}} error, consistent with data row handling.

*Actual Behavior*

Spark throws:

 

{{java.lang.ArrayIndexOutOfBoundsException}}

originating from Univocity parser internals.

*Steps to Reproduce*

 

{{import java.nio.file.\{Files, Paths}
import java.nio.charset.StandardCharsets

val path = "/tmp/test_header.csv"

Files.write(
  Paths.get(path),
  "a,b,c\n1,2,3\n".getBytes(StandardCharsets.UTF_8)
)

spark.read
  .option("header", "true")
  .option("maxColumns", "2")
  .csv(path)
  .collect()}}

*Result*

 

{{java.lang.ArrayIndexOutOfBoundsException}}

*Expected Result*

A Spark {{MALFORMED_CSV_RECORD}} error indicating that the CSV record exceeds 
the configured {{maxColumns}} limit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to