GitHub user squito opened a pull request:
https://github.com/apache/spark/pull/21440
[SPARK-24307][CORE] Support reading remote cached partitions > 2gb
(1) Netty's ByteBuf cannot support data > 2gb. So to transfer data from a
ChunkedByteBuffer over the network, we use a custom version of
FileRegion which is backed by the ChunkedByteBuffer.
(2) On the receiving end, we need to expose all the data in a
FileSegmentManagedBuffer as a ChunkedByteBuffer. We do that by memory
mapping the entire file in chunks.
Added unit tests. Ran the randomized test a couple of hundred times on my
laptop. Tests cover the equivalent of SPARK-24107 for the
ChunkedByteBufferFileRegion. Also tested on a cluster with remote cache reads
>2gb (in memory and on disk).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/squito/spark chunked_bb_file_region
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21440.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21440
----
commit 4373e27c2ec96b77a2311f5c5997ae5ca84bf6c5
Author: Imran Rashid <irashid@...>
Date: 2018-05-23T03:59:40Z
[SPARK-24307][CORE] Support reading remote cached partitions > 2gb
(1) Netty's ByteBuf cannot support data > 2gb. So to transfer data from a
ChunkedByteBuffer over the network, we use a custom version of
FileRegion which is backed by the ChunkedByteBuffer.
(2) On the receiving end, we need to expose all the data in a
FileSegmentManagedBuffer as a ChunkedByteBuffer. We do that by memory
mapping the entire file in chunks.
Added unit tests. Also tested on a cluster with remote cache reads >
2gb (in memory and on disk).
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]