GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/9760
[SPARK-11767] [SQL] limit the size of caced batch
Currently the size of cached batch in only controlled by `batchSize`
(default value is 10000), which does not work well with the size of serialized
columns (for example, complex types). The memory used to build the batch is not
accounted, it's easy to OOM (especially after unified memory management).
This PR introduce a hard limit as 4M for total columns (up to 50 columns of
uncompressed primitive columns).
This also change the way to grow buffer, double it each time, then trim it
once finished.
cc @liancheng
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/davies/spark cache_limit
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9760.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9760
----
commit d57c180afdd46a7824dc860e9ed3c60eb5e9ad7c
Author: Davies Liu <[email protected]>
Date: 2015-11-17T06:29:42Z
limit the size of caced batch
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]