DArrays are horizontally scalable in the sense that each worker holds its
part of the DArray, and the workers can exist across different machines.
There is support for HDFS via  https://github.com/tanmaykm/HDFS.jl that you
can check out.

Out-of-the-box, Julia has a variety of parallel processing constructs as
documented here -
http://docs.julialang.org/en/latest/manual/parallel-computing/  . Some of
the currently available external packages - full list here -
http://docs.julialang.org/en/latest/packages/packagelist/ are useful for
distributed data.



On Wed, Dec 18, 2013 at 11:31 AM, David C Cohen <[email protected]>wrote:

>
>
> On Wednesday, December 18, 2013 4:34:48 AM UTC, Amit Murthy wrote:
>>
>> distribute creates a new DArray from a regular array. It does this by
>> allocating parts of the regular array on each of the workers. "
>> remotecall_fetch(owner, ()->fetch(rr)[I...])" is the regular init
>> function passed to the DArray constructor. On each worker, it pulls in its
>> its part of the regular array.
>>
>> While the DArray itself is created in parallel, 1000 workers is a lot,
>> typically one would expect 1 worker per CPU core. Of course, if you do have
>> access to 1000 cores, it will be a good idea to try it out, though I
>> suspect, we may see other issues too as a result of it.
>>
>
> I assume this means that DArrays are not horizontally scalable. What other
> solutions are out there for julia when working with big data? Is julia
> itself considered a complete system when working with distributed data, or
> do people use julia on top of other frameworks?
>
>
>>
>> On Tue, Dec 17, 2013 at 5:13 PM, David C Cohen <[email protected]>wrote:
>>
>>> Hi,
>>>
>>> According to 
>>> this<https://github.com/JuliaLang/julia/blob/b4fa86124dd1cb298373c3bef3f98c060cbb19b8/base/darray.jl#L160-L167>,
>>> distribute is defined as:
>>>
>>> function distribute(a::AbstractArray)
>>>     owner = myid()
>>>     rr = RemoteRef()
>>>     put(rr, a)
>>>     DArray(size(a)) do I
>>>         remotecall_fetch(owner, ()->fetch(rr)[I...])
>>>     end
>>> end
>>>
>>>
>>> I'm trying to find out how this sends data to the workers.
>>>
>>>    - Here rr is the remote reference to the local machine. Then put(rr,
>>>    a) sends array a to the local machine. That doesn't make sense.
>>>    - When, in whatever way that doesn't make sense to me, data of array
>>>    a is sent to workers, does the network io happen in parallel, or in 
>>> series?
>>>    - If we have 1000 workers, parallel data sending means network
>>>    overload, series data sending means taking a long time. What's a good way
>>>    of working with larger distributed arrays, or distributing an array over
>>>    many many workers?
>>>
>>>
>>

Reply via email to