[ 
https://issues.apache.org/jira/browse/SPARK-57195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Botadra updated SPARK-57195:
---------------------------------
    Description: 
When inferring the schema of a CSV with multiLine=true and inferSchema=true, a 
data row with more columns than maxColumns causes a raw 
java.lang.ArrayIndexOutOfBoundsException to propagate and fail the query, 
instead of a user-facing error.

SPARK-49444 made the per-line parse path (UnivocityParser.parseLine) translate 
this into a MALFORMED_CSV_RECORD error. The streaming path used by multiLine 
reads and schema inference (UnivocityParser.tokenizeStream / convertStream, via 
tokenizer.parseNext()) was not covered, so the same input still throws a raw 
ArrayIndexOutOfBoundsException.

 

Repro:
{code:java}
file:
a,b
c,d
1,2,3

spark.read.option("header","false").option("inferSchema","true")
    .option("multiLine","true").option("maxColumns","2").csv(path)

Expected: MALFORMED_CSV_RECORD (SQLSTATE KD000).

Actual: java.lang.ArrayIndexOutOfBoundsException.
{code}
 

How introduced: the streaming parseNext() path predates SPARK-49444 and was 
missed when that fix was applied to parseLine.

  was:
When inferring the schema of a CSV with multiLine=true and inferSchema=true, a 
data row with more columns than maxColumns causes a raw 
java.lang.ArrayIndexOutOfBoundsException to propagate and fail the query, 
instead of a user-facing error.

SPARK-49444 made the per-line parse path (UnivocityParser.parseLine) translate 
this into a MALFORMED_CSV_RECORD error. The streaming path used by multiLine 
reads and schema inference (UnivocityParser.tokenizeStream / convertStream, via 
tokenizer.parseNext()) was not covered, so the same input still throws a raw 
ArrayIndexOutOfBoundsException.

 

Repro:
{code:java}
file:
a,b
c,d
1,2,3

spark.read.option("header","false").option("inferSchema","true")
    .option("multiLine","true").option("maxColumns","2").csv(path)
Expected: MALFORMED_CSV_RECORD (SQLSTATE KD000). Actual: 
java.lang.ArrayIndexOutOfBoundsException.
{code}
 

How introduced: the streaming parseNext() path predates SPARK-49444 and was 
missed when that fix was applied to parseLine.


> CSV multiLine schema inference throws ArrayIndexOutOfBoundsException for a 
> row exceeding maxColumns
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-57195
>                 URL: https://issues.apache.org/jira/browse/SPARK-57195
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.2, 3.5.8, 4.1.2
>            Reporter: Yash Botadra
>            Priority: Major
>              Labels: correctness
>
> When inferring the schema of a CSV with multiLine=true and inferSchema=true, 
> a data row with more columns than maxColumns causes a raw 
> java.lang.ArrayIndexOutOfBoundsException to propagate and fail the query, 
> instead of a user-facing error.
> SPARK-49444 made the per-line parse path (UnivocityParser.parseLine) 
> translate this into a MALFORMED_CSV_RECORD error. The streaming path used by 
> multiLine reads and schema inference (UnivocityParser.tokenizeStream / 
> convertStream, via tokenizer.parseNext()) was not covered, so the same input 
> still throws a raw ArrayIndexOutOfBoundsException.
>  
> Repro:
> {code:java}
> file:
> a,b
> c,d
> 1,2,3
> spark.read.option("header","false").option("inferSchema","true")
>     .option("multiLine","true").option("maxColumns","2").csv(path)
> Expected: MALFORMED_CSV_RECORD (SQLSTATE KD000).
> Actual: java.lang.ArrayIndexOutOfBoundsException.
> {code}
>  
> How introduced: the streaming parseNext() path predates SPARK-49444 and was 
> missed when that fix was applied to parseLine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to