Re: Spark In Memory Shuffle / 5403

Peter Rudenko Fri, 19 Oct 2018 09:54:01 -0700

Hi Peter, we're using a part of Crail - it's core library, called disni (
https://github.com/zrlio/disni/). We couldn't reproduce results from that
blog post, any case Crail is more platformic approach (it comes with it's
own file system), while SparkRdma is a pluggable approach - it's just a
plugin, that you can enable/disable for a particular workload, you can use
any hadoop vendor, etc.


The best optimization for shuffle between local jvms could be using
something like short circuit local read (
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html)
to use unix socket for local communication or just directly read a part
from other's jvm shuffle file. But yes, it's not available in spark out of
box.

Thanks,
Peter Rudenko

пт, 19 жовт. 2018 о 16:54 Peter Liu <peter.p...@gmail.com> пише:

> Hi Peter,
>
> thank you for the reply and detailed information! Would this something
> comparable with Crail? (
> http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html)
> I was more looking for something simple/quick making the shuffle between
> the local jvms quicker (like the idea of using local ram disk) for my
> simple use case.
>
> of course, a general and thorough implementation should cover the shuffle
> between the nodes as major focus. hmm, looks like there is no
> implementation within spark itself yet.
>
> very much appreciated!
>
> Peter
>
> On Fri, Oct 19, 2018 at 9:38 AM Peter Rudenko <petro.rude...@gmail.com>
> wrote:
>
>> Hey Peter, in SparkRDMA shuffle plugin (
>> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file,
>> to do Remote Direct Memory Access. If the shuffle data is bigger then RAM,
>> Mellanox NIC support On Demand Paging, where OS invalidates translations
>> which are no longer valid due to either non-present pages or mapping
>> changes. So if you have an RDMA capable NIC (or you can try on Azure cloud
>>
>> https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/
>>  ), have a try. For network intensive apps you should get better
>> performance.
>>
>> Thanks,
>> Peter Rudenko
>>
>> чт, 18 жовт. 2018 о 18:07 Peter Liu <peter.p...@gmail.com> пише:
>>
>>> I would be very interested in the initial question here:
>>>
>>> is there a production level implementation for memory only shuffle and
>>> configurable (similar to  MEMORY_ONLY storage level,  MEMORY_OR_DISK
>>> storage level) as mentioned in this ticket,
>>> https://github.com/apache/spark/pull/5403 ?
>>>
>>> It would be a quite practical and useful option/feature. not sure what
>>> is the status of this ticket implementation?
>>>
>>> Thanks!
>>>
>>> Peter
>>>
>>> On Thu, Oct 18, 2018 at 6:51 AM ☼ R Nair <ravishankar.n...@gmail.com>
>>> wrote:
>>>
>>>> Thanks..great info. Will try and let all know.
>>>>
>>>> Best
>>>>
>>>> On Thu, Oct 18, 2018, 3:12 AM onmstester onmstester <
>>>> onmstes...@zoho.com> wrote:
>>>>
>>>>> create the ramdisk:
>>>>> mount tmpfs /mnt/spark -t tmpfs -o size=2G
>>>>>
>>>>> then point spark.local.dir to the ramdisk, which depends on your
>>>>> deployment strategy, for me it was through SparkConf object before passing
>>>>> it to SparkContext:
>>>>> conf.set("spark.local.dir","/mnt/spark")
>>>>>
>>>>> To validate that spark is actually using your ramdisk (by default it
>>>>> uses /tmp), ls the ramdisk after running some jobs and you should see 
>>>>> spark
>>>>> directories (with date on directory name) on your ramdisk
>>>>>
>>>>>
>>>>> Sent using Zoho Mail <https://www.zoho.com/mail/>
>>>>>
>>>>>
>>>>> ---- On Wed, 17 Oct 2018 18:57:14 +0330 *☼ R Nair
>>>>> <ravishankar.n...@gmail.com <ravishankar.n...@gmail.com>>* wrote ----
>>>>>
>>>>> What are the steps to configure this? Thanks
>>>>>
>>>>> On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester <
>>>>> onmstes...@zoho.com.invalid> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>> I failed to config spark for in-memory shuffle so currently just
>>>>> using linux memory mapped directory (tmpfs) as working directory of spark,
>>>>> so everything is fast
>>>>>
>>>>> Sent using Zoho Mail <https://www.zoho.com/mail/>
>>>>>
>>>>>
>>>>>
>>>>>

Re: Spark In Memory Shuffle / 5403

Reply via email to