GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/11748
Store serialized blocks as multiple chunks in MemoryStore
This patch modifies the BlockManager, MemoryStore, and several other
storage components so that cached, serialized blocks are stored as multiple
small chunks rather than as a single contiguous ByteBuffer.
This change will help to improve the efficiency of memory allocation and
the accuracy of memory accounting when serializing blocks. Our current
serialization code uses a ByteBufferOutputStream, which doubles and
re-allocates its backing byte array; this increases the peak memory
requirements during serialization (since we need to hold extra memory while
expanding the array). In addition, we currently don't account for the extra
wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte
serialized block may actually consume 256 megabytes of memory. After switching
to storing blocks in multiple chunks, we'll be able to efficiently trim the
backing buffers so that no space is wasted.
This change is also a prerequisite to being able to cache blocks which are
larger than 2GB (although full support for that depends on several other
changes which have not bee implemented yet).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark chunked-block-serialization
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11748.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11748
----
commit 735eca68d8efcd150d47631644cf848b4d98603e
Author: Josh Rosen <[email protected]>
Date: 2016-03-15T04:57:16Z
Split MemoryEntry into two separate classes (serialized and deserialized)
commit 8f0828986b72ce722cfe0360ae863971547fc58b
Author: Josh Rosen <[email protected]>
Date: 2016-03-15T18:53:54Z
Add ChunkedByteBuffer and use it in storage layer.
commit 79b1a6a31236b81c444dda1e8ee1cfdf2f3c36ae
Author: Josh Rosen <[email protected]>
Date: 2016-03-15T20:53:27Z
Add test cases and fix bug in ChunkedByteBuffer.toInputStream()
commit 7dbcd5a9ef0c669f5db97990af944d8b63300e97
Author: Josh Rosen <[email protected]>
Date: 2016-03-15T22:05:23Z
WIP towards understanding destruction.
commit 3fbec212d9f714386121b4aed791d6c9fb1359a2
Author: Josh Rosen <[email protected]>
Date: 2016-03-15T22:39:27Z
Small fixes to dispose behavior.
commit e5e663f22094333dac6e184c78176ee658e3441e
Author: Josh Rosen <[email protected]>
Date: 2016-03-15T22:49:24Z
Modify BlockManager.dataSerialize to write ChunkedByteBuffers.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]