GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/1083
[SPARK-1201] Do not fully materialize partitions for
StorageLevel.MEMORY_*_SER
The deserialized version of a partition may occupy much more space than the
serialized version. Therefore, if a partition is to be cached with
`StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an
`ArrayBuffer`, but instead we can unroll it into a potentially much smaller
`ByteBuffer`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark unroll-them-partitions
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1083.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1083
----
commit a8f181d6483b509c29900de5f325a01ea0ef824f
Author: Andrew Or <[email protected]>
Date: 2014-06-14T03:49:18Z
Add special handling for StorageLevel.MEMORY_*_SER
We only unroll the serialized form of each partition for this case,
because the deserialized form may be much larger and may not fit in
memory.
This commit also abstracts out part of the logic of getOrCompute to
make it more readable.
commit 2941c89baacacfc7573cde35a694bc18a7f5fd4f
Author: Andrew Or <[email protected]>
Date: 2014-06-14T03:52:31Z
Clean up BlockStore (minor)
commit 44ef28246ad4f8116155b0db4969898cc09e5e5e
Author: Andrew Or <[email protected]>
Date: 2014-06-14T03:53:25Z
Actually return updated blocks in putBytes
Previously we never returned the updated blocks in MemoryStore's
putBytes. This is a simple bug with a simple fix.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---