Re: [julia-users] How does distribute send data to workers?

David C Cohen Tue, 17 Dec 2013 23:21:21 -0800


On Wednesday, December 18, 2013 6:16:54 AM UTC, Amit Murthy wrote:
>
> DArrays are horizontally scalable in the sense that each worker holds its 
> part of the DArray, and the workers can exist across different machines. 
> There is support for HDFS via  https://github.com/tanmaykm/HDFS.jl that 
> you can check out.
>


Does HDFS.jl work with DArray, or is it a replace for DArray?
 

>
> Out-of-the-box, Julia has a variety of parallel processing constructs as 
> documented here - 
> http://docs.julialang.org/en/latest/manual/parallel-computing/  . Some of 
> the currently available external packages - full list here - 
> http://docs.julialang.org/en/latest/packages/packagelist/ are useful for 
> distributed data. 
>  
>
>
> On Wed, Dec 18, 2013 at 11:31 AM, David C Cohen 
> <[email protected]<javascript:>
> > wrote:
>
>>
>>
>> On Wednesday, December 18, 2013 4:34:48 AM UTC, Amit Murthy wrote:
>>>
>>> distribute creates a new DArray from a regular array. It does this by 
>>> allocating parts of the regular array on each of the workers. "
>>> remotecall_fetch(owner, ()->fetch(rr)[I...])" is the regular init 
>>> function passed to the DArray constructor. On each worker, it pulls in its 
>>> its part of the regular array.
>>>
>>> While the DArray itself is created in parallel, 1000 workers is a lot, 
>>> typically one would expect 1 worker per CPU core. Of course, if you do have 
>>> access to 1000 cores, it will be a good idea to try it out, though I 
>>> suspect, we may see other issues too as a result of it.
>>>
>>
>> I assume this means that DArrays are not horizontally scalable. What 
>> other solutions are out there for julia when working with big data? Is 
>> julia itself considered a complete system when working with distributed 
>> data, or do people use julia on top of other frameworks?
>>
>>
>>>
>>> On Tue, Dec 17, 2013 at 5:13 PM, David C Cohen <[email protected]>wrote:
>>>
>>>> Hi,
>>>>
>>>> According to 
>>>> this<https://github.com/JuliaLang/julia/blob/b4fa86124dd1cb298373c3bef3f98c060cbb19b8/base/darray.jl#L160-L167>,
>>>>  
>>>> distribute is defined as:
>>>>
>>>> function distribute(a::AbstractArray)
>>>>
>>>>     owner = myid()
>>>>
>>>>     rr = RemoteRef()
>>>>
>>>>     put(rr, a)
>>>>
>>>>     DArray(size(a)) do I
>>>>
>>>>         remotecall_fetch(owner, ()->fetch(rr)[I...])
>>>>
>>>>     end
>>>>
>>>> end
>>>>
>>>>
>>>> I'm trying to find out how this sends data to the workers.
>>>>
>>>>    - Here rr is the remote reference to the local machine. Then 
>>>>    put(rr, a) sends array a to the local machine. That doesn't make sense. 
>>>>    - When, in whatever way that doesn't make sense to me, data of 
>>>>    array a is sent to workers, does the network io happen in parallel, or 
>>>> in 
>>>>    series?
>>>>    - If we have 1000 workers, parallel data sending means network 
>>>>    overload, series data sending means taking a long time. What's a good 
>>>> way 
>>>>    of working with larger distributed arrays, or distributing an array 
>>>> over 
>>>>    many many workers?
>>>>    
>>>>
>>>
>

Re: [julia-users] How does distribute send data to workers?

Reply via email to