GitHub user ala opened a pull request:
https://github.com/apache/spark/pull/21206
[SPARK-24133][SQL] Check for integer overflows when resizing
WritableColumnVectors
## What changes were proposed in this pull request?
`ColumnVector`s store string data in one big byte array. Since the array
size is capped at just under Integer.MAX_VALUE, a single `ColumnVector` cannot
store more than 2GB of string data.
But since the Parquet files commonly contain large blobs stored as strings,
and `ColumnVector`s by default carry 4096 values, it's entirely possible to go
past that limit. In such cases a negative capacity is requested from
`WritableColumnVector.reserve()`. The call succeeds (requested capacity is
smaller than already allocated capacity), and consequently
`java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader actually
attempts to put the data into the array.
This change introduces a simple check for integer overflow to
`WritableColumnVector.reserve()` which should help catch the error earlier and
provide more informative exception. Additionally, the error message in
`WritableColumnVector.throwUnsupportedException()` was corrected, as it
previously encouraged users to increase rather than reduce the batch size.
## How was this patch tested?
New units tests were added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ala/spark overflow-reserve
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21206.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21206
----
commit d754175e2fb853befd807578c269aabafb311802
Author: Ala Luszczak <ala@...>
Date: 2018-05-01T09:29:31Z
add check for negative capacity, better error msg, tests
commit 17e2d0270c3edfa9a7fcfd602283eb916b5e8f6a
Author: Ala Luszczak <ala@...>
Date: 2018-05-01T09:35:58Z
include defaults for reference
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]