GitHub user jerryshao opened a pull request:
https://github.com/apache/spark/pull/19476
[SPARK-22062][CORE] Spill large block to disk in BlockManager's remote
fetch to avoid OOM
## What changes were proposed in this pull request?
In the current BlockManager's `getRemoteBytes`, it will call
`BlockTransferService#fetchBlockSync` to get remote block. In the
`fetchBlockSync`, Spark will allocate a temporary `ByteBuffer` to store the
whole fetched block. This will potentially lead to OOM if block size is too big
or several blocks are fetched simultaneously in this executor.
So here leveraging the idea of shuffle fetch, to spill the large block to
local disk before consumed by upstream code. The behavior is controlled by
newly added configuration, if block size is smaller than the threshold, then
this block will be persisted in memory; otherwise it will first spill to disk,
and then read from disk file.
To achieve this feature, what I did is:
1. Rename `TempShuffleFileManager` to `TempFileManager`, since now it is
not only used by shuffle.
2. Add a new `TempFileManager` to manage the files of fetched remote
blocks, the files are tracked by weak reference, will be deleted when no use at
all.
## How was this patch tested?
This was tested by adding UT, also manual verification in local test to
perform GC to clean the files.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jerryshao/apache-spark SPARK-22062
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19476.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19476
----
commit f50a7b75c303bd2cf261dfb1b4fe74fa5498ca4b
Author: jerryshao <[email protected]>
Date: 2017-10-12T01:47:35Z
Spill large blocks to disk during remote fetches in BlockManager
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]