[ 
https://issues.apache.org/jira/browse/SPARK-24133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-24133:
--------------------------------------
    Fix Version/s: 2.3.1

> Reading Parquet files containing large strings can fail with 
> java.lang.ArrayIndexOutOfBoundsException
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24133
>                 URL: https://issues.apache.org/jira/browse/SPARK-24133
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Ala Luszczak
>            Assignee: Ala Luszczak
>            Priority: Major
>             Fix For: 2.3.1, 2.4.0
>
>
> ColumnVectors store string data in one big byte array. Since the array size 
> is capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store 
> more than 2GB of string data.
> However, since the Parquet files commonly contain large blobs stored as 
> strings, and ColumnVectors by default carry 4096 values, it's entirely 
> possible to go past that limit.
> In such cases a negative capacity is requested from 
> WritableColumnVector.reserve(). The call succeeds (requested capacity is 
> smaller than already allocated), and consequently  
> java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually 
> attempts to put the data into the array.
> This behavior is hard to troubleshoot for the users. Spark should instead 
> check for negative requested capacity in WritableColumnVector.reserve() and 
> throw more informative error, instructing the user to tweak ColumnarBatch 
> size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to