[ https://issues.apache.org/jira/browse/SPARK-24133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Herman van Hovell updated SPARK-24133: -------------------------------------- Fix Version/s: 2.3.1 > Reading Parquet files containing large strings can fail with > java.lang.ArrayIndexOutOfBoundsException > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-24133 > URL: https://issues.apache.org/jira/browse/SPARK-24133 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Ala Luszczak > Assignee: Ala Luszczak > Priority: Major > Fix For: 2.3.1, 2.4.0 > > > ColumnVectors store string data in one big byte array. Since the array size > is capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store > more than 2GB of string data. > However, since the Parquet files commonly contain large blobs stored as > strings, and ColumnVectors by default carry 4096 values, it's entirely > possible to go past that limit. > In such cases a negative capacity is requested from > WritableColumnVector.reserve(). The call succeeds (requested capacity is > smaller than already allocated), and consequently > java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually > attempts to put the data into the array. > This behavior is hard to troubleshoot for the users. Spark should instead > check for negative requested capacity in WritableColumnVector.reserve() and > throw more informative error, instructing the user to tweak ColumnarBatch > size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org