Re: [julia-users] How does distribute send data to workers?

Amit Murthy Tue, 17 Dec 2013 23:46:36 -0800

It is separate from DArray and is a map-reduce framework for working with
HDFS. Just mentioned it since you asked for options for working with
distributed data.



On Wed, Dec 18, 2013 at 12:50 PM, David C Cohen <[email protected]>wrote:

>
>
> On Wednesday, December 18, 2013 6:16:54 AM UTC, Amit Murthy wrote:
>>
>> DArrays are horizontally scalable in the sense that each worker holds its
>> part of the DArray, and the workers can exist across different machines.
>> There is support for HDFS via  https://github.com/tanmaykm/HDFS.jl that
>> you can check out.
>>
>
> Does HDFS.jl work with DArray, or is it a replace for DArray?
>
>
>>
>> Out-of-the-box, Julia has a variety of parallel processing constructs as
>> documented here - http://docs.julialang.org/en/latest/manual/parallel-
>> computing/  . Some of the currently available external packages - full
>> list here - http://docs.julialang.org/en/latest/packages/packagelist/ are
>> useful for distributed data.
>>
>>
>>
>> On Wed, Dec 18, 2013 at 11:31 AM, David C Cohen <[email protected]>wrote:
>>
>>>
>>>
>>> On Wednesday, December 18, 2013 4:34:48 AM UTC, Amit Murthy wrote:
>>>>
>>>> distribute creates a new DArray from a regular array. It does this by
>>>> allocating parts of the regular array on each of the workers. "
>>>> remotecall_fetch(owner, ()->fetch(rr)[I...])" is the regular init
>>>> function passed to the DArray constructor. On each worker, it pulls in its
>>>> its part of the regular array.
>>>>
>>>> While the DArray itself is created in parallel, 1000 workers is a lot,
>>>> typically one would expect 1 worker per CPU core. Of course, if you do have
>>>> access to 1000 cores, it will be a good idea to try it out, though I
>>>> suspect, we may see other issues too as a result of it.
>>>>
>>>
>>> I assume this means that DArrays are not horizontally scalable. What
>>> other solutions are out there for julia when working with big data? Is
>>> julia itself considered a complete system when working with distributed
>>> data, or do people use julia on top of other frameworks?
>>>
>>>
>>>>
>>>> On Tue, Dec 17, 2013 at 5:13 PM, David C Cohen <[email protected]>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> According to 
>>>>> this<https://github.com/JuliaLang/julia/blob/b4fa86124dd1cb298373c3bef3f98c060cbb19b8/base/darray.jl#L160-L167>,
>>>>> distribute is defined as:
>>>>>
>>>>> function distribute(a::AbstractArray)
>>>>>
>>>>>     owner = myid()
>>>>>
>>>>>     rr = RemoteRef()
>>>>>
>>>>>     put(rr, a)
>>>>>
>>>>>     DArray(size(a)) do I
>>>>>
>>>>>         remotecall_fetch(owner, ()->fetch(rr)[I...])
>>>>>
>>>>>     end
>>>>>
>>>>> end
>>>>>
>>>>>
>>>>> I'm trying to find out how this sends data to the workers.
>>>>>
>>>>>    - Here rr is the remote reference to the local machine. Then
>>>>>    put(rr, a) sends array a to the local machine. That doesn't make sense.
>>>>>    - When, in whatever way that doesn't make sense to me, data of
>>>>>    array a is sent to workers, does the network io happen in parallel, or 
>>>>> in
>>>>>    series?
>>>>>    - If we have 1000 workers, parallel data sending means network
>>>>>    overload, series data sending means taking a long time. What's a good 
>>>>> way
>>>>>    of working with larger distributed arrays, or distributing an array 
>>>>> over
>>>>>    many many workers?
>>>>>
>>>>>
>>>>
>>

Re: [julia-users] How does distribute send data to workers?

Reply via email to