GitHub user wypoon opened a pull request:

    https://github.com/apache/spark/pull/23058

    [SPARK-25905][CORE] When getting a remote block, avoid forcing a conversion 
to a ChunkedByteBuffer

    ## What changes were proposed in this pull request?
    
    In `BlockManager`, `getRemoteValues` gets a `ChunkedByteBuffer` (by calling 
`getRemoteBytes`) and creates an `InputStream` from it. `getRemoteBytes`, in 
turn, gets a `ManagedBuffer` and converts it to a `ChunkedByteBuffer`.
    Instead, expose a `getRemoteManagedBuffer` method so `getRemoteValues` can 
just get this `ManagedBuffer` and use its `InputStream`.
    When reading a remote cache block from disk, this reduces heap memory usage 
significantly.
    Retain `getRemoteBytes` for other callers.
    
    ## How was this patch tested?
    
    Imran Rashid wrote an application 
(https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala),
 that among other things, tests reading remote cache blocks. I ran this 
application, using 2500MB blocks, to test reading a cache block on disk. 
Without this change, with `--executor-memory 5g`, the test fails with 
`java.lang.OutOfMemoryError: Java heap space`. With the change, the test passes 
with `--executor-memory 2g`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wypoon/spark SPARK-25905

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23058.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23058
    
----
commit 2516ec61d9395a2ed36185affc4018a4a4f9b7ca
Author: Wing Yew Poon <wypoon@...>
Date:   2018-11-16T02:47:16Z

    [SPARK-25905][CORE] When getting a remote block, avoid forcing a conversion 
to a ChunkedByteBuffer
    
    In BlockManager, getRemoteValues gets a ChunkedByteBuffer (by calling
    getRemoteBytes) and creates an InputStream from it. getRemoteBytes, in
    turn, gets a ManagedBuffer and converts it to a ChunkedByteBuffer.
    Instead, expose a getRemoteManagedBuffer method so getRemoteValues can
    just get this ManagedBuffer and use its InputStream.
    When reading a remote cache block from disk, this reduces heap memory
    usage significantly.
    Retain getRemoteBytes for other callers.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to