Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/2679#issuecomment-58939148
@uncleGen We implemented in-memory shuffle for benchmarking in
5ec645d945783457baed9e151337b2735c1b307f around the time of Spark 0.9.1 but
didn't merge it back, and Spark's shuffle code has changed substantially so the
patch no longer applies. It simply stores shuffle output blocks in the
BlockManager at an in-memory level rather than writing them to disk. I [ported
it forward](https://github.com/ankurdave/spark/compare/in-memory-shuffle-3) a
while ago, but it has again fallen a bit out of date.
The main problem with merging it for general use is that it doesn't handle
the memory-constrained case and may cause blocks of important RDDs to be
dropped. To avoid this it requires the user to explicitly clear the shuffle
outputs from memory, such as after each iteration in Pregel, though this
interferes with fault recovery.
It might be possible to solve this by introducing block-level priorities to
the block manager to prevent it from dropping RDD blocks and maybe by
asynchronously spilling the shuffle outputs to disk.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]