GitHub user squito opened a pull request:
https://github.com/apache/spark/pull/21451
[SPARK-24296][CORE][WIP] Replicate large blocks as a stream.
When replicating large cached RDD blocks, it can be helpful to replicate
them as a stream, to avoid using large amounts of memory during the
transfer. This also allows blocks larger than 2GB to be replicated.
Added unit tests in DistributedSuite. Also ran tests on a cluster for
blocks > 2gb.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/squito/spark clean_replication
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21451.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21451
----
commit 05967808f5440835919f02d6c5d0d3563482d304
Author: Imran Rashid <irashid@...>
Date: 2018-05-02T14:55:15Z
[SPARK-6237][NETWORK] Network-layer changes to allow stream upload.
These changes allow an RPCHandler to receive an upload as a stream of
data, without having to buffer the entire message in the FrameDecoder.
The primary use case is for replicating large blocks.
Added unit tests.
commit 43658df6d6b7dffacd528a2573e8846ab6469e81
Author: Imran Rashid <irashid@...>
Date: 2018-05-23T03:59:40Z
[SPARK-24307][CORE] Support reading remote cached partitions > 2gb
(1) Netty's ByteBuf cannot support data > 2gb. So to transfer data from a
ChunkedByteBuffer over the network, we use a custom version of
FileRegion which is backed by the ChunkedByteBuffer.
(2) On the receiving end, we need to expose all the data in a
FileSegmentManagedBuffer as a ChunkedByteBuffer. We do that by memory
mapping the entire file in chunks.
Added unit tests. Also tested on a cluster with remote cache reads >
2gb (in memory and on disk).
commit 7e517e4ea0ff66dc57121b54fdd71f8391edd8f2
Author: Imran Rashid <irashid@...>
Date: 2018-05-15T16:48:51Z
[SPARK-24296][CORE] Replicate large blocks as a stream.
When replicating large cached RDD blocks, it can be helpful to replicate
them as a stream, to avoid using large amounts of memory during the
transfer. This also allows blocks larger than 2GB to be replicated.
Added unit tests in DistributedSuite. Also ran tests on a cluster for
blocks > 2gb.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]