[GitHub] spark pull request #21451: [SPARK-24296][CORE][WIP] Replicate large blocks a...

squito Tue, 29 May 2018 10:01:56 -0700

GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/21451


    [SPARK-24296][CORE][WIP] Replicate large blocks as a stream.

    When replicating large cached RDD blocks, it can be helpful to replicate
    them as a stream, to avoid using large amounts of memory during the
    transfer.  This also allows blocks larger than 2GB to be replicated.
    
    Added unit tests in DistributedSuite.  Also ran tests on a cluster for
    blocks > 2gb.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark clean_replication

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21451
    
----
commit 05967808f5440835919f02d6c5d0d3563482d304
Author: Imran Rashid <irashid@...>
Date:   2018-05-02T14:55:15Z

    [SPARK-6237][NETWORK] Network-layer changes to allow stream upload.
    
    These changes allow an RPCHandler to receive an upload as a stream of
    data, without having to buffer the entire message in the FrameDecoder.
    The primary use case is for replicating large blocks.
    
    Added unit tests.

commit 43658df6d6b7dffacd528a2573e8846ab6469e81
Author: Imran Rashid <irashid@...>
Date:   2018-05-23T03:59:40Z

    [SPARK-24307][CORE] Support reading remote cached partitions > 2gb
    
    (1) Netty's ByteBuf cannot support data > 2gb.  So to transfer data from a
    ChunkedByteBuffer over the network, we use a custom version of
    FileRegion which is backed by the ChunkedByteBuffer.
    
    (2) On the receiving end, we need to expose all the data in a
    FileSegmentManagedBuffer as a ChunkedByteBuffer.  We do that by memory
    mapping the entire file in chunks.
    
    Added unit tests.  Also tested on a cluster with remote cache reads >
    2gb (in memory and on disk).

commit 7e517e4ea0ff66dc57121b54fdd71f8391edd8f2
Author: Imran Rashid <irashid@...>
Date:   2018-05-15T16:48:51Z

    [SPARK-24296][CORE] Replicate large blocks as a stream.
    
    When replicating large cached RDD blocks, it can be helpful to replicate
    them as a stream, to avoid using large amounts of memory during the
    transfer.  This also allows blocks larger than 2GB to be replicated.
    
    Added unit tests in DistributedSuite.  Also ran tests on a cluster for
    blocks > 2gb.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21451: [SPARK-24296][CORE][WIP] Replicate large blocks a...

Reply via email to