[
https://issues.apache.org/jira/browse/SPARK-22229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752451#comment-16752451
]
Yuval Degani commented on SPARK-22229:
--------------------------------------
Great questions, [~tgraves]
* SparkRDMA (starting version 2.0) supports ODP (On-Demand Paging) for RDMA
buffers, meaning that it can handle memory buffers that are not necessarily
pinned to physical memory. This allows SparkRDMA buffers and mapped shuffle
files to be swapped out and thus occupy more space than can fit in memory.
Further reading:
[https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x|http://example.com/]
* Jobs will, of course, perform better if they can fit in memory - this is
true for SparkRDMA and Spark in general. [~prudenko] can also share some
results with over-subscription on memory that shows further value for SparkRDMA.
* SparkRDMA works seamlessly on both InfiniBand and Ethernet fabrics.
InfiniBand will provide better numbers compared to Ethernet. One example is our
joint work with Microsoft Azure on their InfiniBand HPC clusters:
[https://databricks.com/session/accelerated-spark-on-azure-seamless-and-scalable-hardware-offloads-in-the-cloud|http://example.com/]
* SparkRDMA is considered GA and production ready. It has been under
continuous development since 2016 while integrating with various customers with
a variety of workload patterns and sizes.
* Re redundancy of MapStatuses - SparkRDMA offers an alternate protocol for
obtaining MapStatuses and translating them into remote memory addresses while
utilizing RDMA as well. SparkRDMA collects an RDMA-able table on the driver of
remote memory addresses per mapper (each mapper holds another table mapping
from reduceIds to memory addresses that contain the shuffle data for that
reduceId). SparkRDMA uses RDMA-Read for obtaining information from the tables
while removing significant overhead from the driver and reducing its position
as a bottleneck. It also reduces overhead from executors as they also use
RDMA-Read to obtain translations instead of costly RPCs. The SparkRDMA
translation protocol is fully compliant with Spark's recovery mechanisms for
crashed tasks/executors.
* True, SparkRDMA does not support the external shuffle service at this time,
although this is in the plans for the next version. And yes, that means that
dynamic allocation is not yet supported as well.
* If an executor crashes, the files still remain on the disk, though they
will lose their RDMA mapping. SparkRDMA does not recover them as of now, but
rather requires to rerun the map tasks. I do believe however that this is also
the case for the traditional Shuffle engine in Spark (unless it was changed
recently)
* Re off-heap memory - yes, the user will be required to provide an adequate
amount of memory to the JVM so that it can contain the memory that's needed. As
far as I have seen so far, this usually requires changes to YARN configs only
and not to Spark it self. [~prudenko] , please correct me if I'm wrong.
> SPIP: RDMA Accelerated Shuffle Engine
> -------------------------------------
>
> Key: SPARK-22229
> URL: https://issues.apache.org/jira/browse/SPARK-22229
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.3.0, 2.4.0, 3.0.0
> Reporter: Yuval Degani
> Priority: Major
> Attachments:
> SPARK-22229_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O
> processing overhead by bypassing the kernel and networking stack as well as
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed
> directly by the actual Spark workloads, and help reducing the job runtime
> significantly.
> This performance gain is demonstrated with both industry standard HiBench
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive
> customer applications.
> SparkRDMA will be presented at Spark Summit 2017 in Dublin
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]