GitHub user wypoon opened a pull request:
https://github.com/apache/spark/pull/23058
[SPARK-25905][CORE] When getting a remote block, avoid forcing a conversion
to a ChunkedByteBuffer
## What changes were proposed in this pull request?
In `BlockManager`, `getRemoteValues` gets a `ChunkedByteBuffer` (by calling
`getRemoteBytes`) and creates an `InputStream` from it. `getRemoteBytes`, in
turn, gets a `ManagedBuffer` and converts it to a `ChunkedByteBuffer`.
Instead, expose a `getRemoteManagedBuffer` method so `getRemoteValues` can
just get this `ManagedBuffer` and use its `InputStream`.
When reading a remote cache block from disk, this reduces heap memory usage
significantly.
Retain `getRemoteBytes` for other callers.
## How was this patch tested?
Imran Rashid wrote an application
(https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala),
that among other things, tests reading remote cache blocks. I ran this
application, using 2500MB blocks, to test reading a cache block on disk.
Without this change, with `--executor-memory 5g`, the test fails with
`java.lang.OutOfMemoryError: Java heap space`. With the change, the test passes
with `--executor-memory 2g`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wypoon/spark SPARK-25905
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23058.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23058
----
commit 2516ec61d9395a2ed36185affc4018a4a4f9b7ca
Author: Wing Yew Poon <wypoon@...>
Date: 2018-11-16T02:47:16Z
[SPARK-25905][CORE] When getting a remote block, avoid forcing a conversion
to a ChunkedByteBuffer
In BlockManager, getRemoteValues gets a ChunkedByteBuffer (by calling
getRemoteBytes) and creates an InputStream from it. getRemoteBytes, in
turn, gets a ManagedBuffer and converts it to a ChunkedByteBuffer.
Instead, expose a getRemoteManagedBuffer method so getRemoteValues can
just get this ManagedBuffer and use its InputStream.
When reading a remote cache block from disk, this reduces heap memory
usage significantly.
Retain getRemoteBytes for other callers.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]