GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/9760

    [SPARK-11767] [SQL] limit the size of caced batch

    Currently the size of cached batch in only controlled by `batchSize` 
(default value is 10000), which does not work well with the size of serialized 
columns (for example, complex types). The memory used to build the batch is not 
accounted, it's easy to OOM (especially after unified memory management).
    
    This PR introduce a hard limit as 4M for total columns (up to 50 columns of 
uncompressed primitive columns).
    
    This also change the way to grow buffer, double it each time, then trim it 
once finished.
    
    cc @liancheng 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark cache_limit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9760.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9760
    
----
commit d57c180afdd46a7824dc860e9ed3c60eb5e9ad7c
Author: Davies Liu <[email protected]>
Date:   2015-11-17T06:29:42Z

    limit the size of caced batch

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to