[jira] [Commented] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine

Yuval Degani (JIRA) Fri, 25 Jan 2019 09:24:15 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-22229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752451#comment-16752451
 ]


Yuval Degani commented on SPARK-22229:
--------------------------------------

Great questions, [~tgraves]
 * SparkRDMA (starting version 2.0) supports ODP (On-Demand Paging) for RDMA 
buffers, meaning that it can handle memory buffers that are not necessarily 
pinned to physical memory. This allows SparkRDMA buffers and mapped shuffle 
files to be swapped out and thus occupy more space than can fit in memory. 
Further reading: 
[https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x|http://example.com/]
 * Jobs will, of course, perform better if they can fit in memory - this is 
true for SparkRDMA and Spark in general. [~prudenko] can also share some 
results with over-subscription on memory that shows further value for SparkRDMA.
 * SparkRDMA works seamlessly on both InfiniBand and Ethernet fabrics. 
InfiniBand will provide better numbers compared to Ethernet. One example is our 
joint work with Microsoft Azure on their InfiniBand HPC clusters: 
[https://databricks.com/session/accelerated-spark-on-azure-seamless-and-scalable-hardware-offloads-in-the-cloud|http://example.com/]
 * SparkRDMA is considered GA and production ready. It has been under 
continuous development since 2016 while integrating with various customers with 
a variety of workload patterns and sizes.
 * Re redundancy of MapStatuses - SparkRDMA offers an alternate protocol for 
obtaining MapStatuses and translating them into remote memory addresses while 
utilizing RDMA as well. SparkRDMA collects an RDMA-able table on the driver of 
remote memory addresses per mapper (each mapper holds another table mapping 
from reduceIds to memory addresses that contain the shuffle data for that 
reduceId). SparkRDMA uses RDMA-Read for obtaining information from the tables 
while removing significant overhead from the driver and reducing its position 
as a bottleneck. It also reduces overhead from executors as they also use 
RDMA-Read to obtain translations instead of costly RPCs. The SparkRDMA 
translation protocol is fully compliant with Spark's recovery mechanisms for 
crashed tasks/executors.
 * True, SparkRDMA does not support the external shuffle service at this time, 
although this is in the plans for the next version. And yes, that means that 
dynamic allocation is not yet supported as well.
 *  If an executor crashes, the files still remain on the disk, though they 
will lose their RDMA mapping. SparkRDMA does not recover them as of now, but 
rather requires to rerun the map tasks. I do believe however that this is also 
the case for the traditional Shuffle engine in Spark (unless it was changed 
recently)
 * Re off-heap memory - yes, the user will be required to provide an adequate 
amount of memory to the JVM so that it can contain the memory that's needed. As 
far as I have seen so far, this usually requires changes to YARN configs only 
and not to Spark it self. [~prudenko] , please correct me if I'm wrong.

> SPIP: RDMA Accelerated Shuffle Engine
> -------------------------------------
>
>                 Key: SPARK-22229
>                 URL: https://issues.apache.org/jira/browse/SPARK-22229
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.3.0, 2.4.0, 3.0.0
>            Reporter: Yuval Degani
>            Priority: Major
>         Attachments: 
> SPARK-22229_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits 
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
> processing overhead by bypassing the kernel and networking stack as well as 
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
> directly by the actual Spark workloads, and help reducing the job runtime 
> significantly. 
> This performance gain is demonstrated with both industry standard HiBench 
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive 
> customer applications. 
> SparkRDMA will be presented at Spark Summit 2017 in Dublin 
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine

Reply via email to