On Wednesday, December 18, 2013 6:16:54 AM UTC, Amit Murthy wrote: > > DArrays are horizontally scalable in the sense that each worker holds its > part of the DArray, and the workers can exist across different machines. > There is support for HDFS via https://github.com/tanmaykm/HDFS.jl that > you can check out. >
Does HDFS.jl work with DArray, or is it a replace for DArray? > > Out-of-the-box, Julia has a variety of parallel processing constructs as > documented here - > http://docs.julialang.org/en/latest/manual/parallel-computing/ . Some of > the currently available external packages - full list here - > http://docs.julialang.org/en/latest/packages/packagelist/ are useful for > distributed data. > > > > On Wed, Dec 18, 2013 at 11:31 AM, David C Cohen > <[email protected]<javascript:> > > wrote: > >> >> >> On Wednesday, December 18, 2013 4:34:48 AM UTC, Amit Murthy wrote: >>> >>> distribute creates a new DArray from a regular array. It does this by >>> allocating parts of the regular array on each of the workers. " >>> remotecall_fetch(owner, ()->fetch(rr)[I...])" is the regular init >>> function passed to the DArray constructor. On each worker, it pulls in its >>> its part of the regular array. >>> >>> While the DArray itself is created in parallel, 1000 workers is a lot, >>> typically one would expect 1 worker per CPU core. Of course, if you do have >>> access to 1000 cores, it will be a good idea to try it out, though I >>> suspect, we may see other issues too as a result of it. >>> >> >> I assume this means that DArrays are not horizontally scalable. What >> other solutions are out there for julia when working with big data? Is >> julia itself considered a complete system when working with distributed >> data, or do people use julia on top of other frameworks? >> >> >>> >>> On Tue, Dec 17, 2013 at 5:13 PM, David C Cohen <[email protected]>wrote: >>> >>>> Hi, >>>> >>>> According to >>>> this<https://github.com/JuliaLang/julia/blob/b4fa86124dd1cb298373c3bef3f98c060cbb19b8/base/darray.jl#L160-L167>, >>>> >>>> distribute is defined as: >>>> >>>> function distribute(a::AbstractArray) >>>> >>>> owner = myid() >>>> >>>> rr = RemoteRef() >>>> >>>> put(rr, a) >>>> >>>> DArray(size(a)) do I >>>> >>>> remotecall_fetch(owner, ()->fetch(rr)[I...]) >>>> >>>> end >>>> >>>> end >>>> >>>> >>>> I'm trying to find out how this sends data to the workers. >>>> >>>> - Here rr is the remote reference to the local machine. Then >>>> put(rr, a) sends array a to the local machine. That doesn't make sense. >>>> - When, in whatever way that doesn't make sense to me, data of >>>> array a is sent to workers, does the network io happen in parallel, or >>>> in >>>> series? >>>> - If we have 1000 workers, parallel data sending means network >>>> overload, series data sending means taking a long time. What's a good >>>> way >>>> of working with larger distributed arrays, or distributing an array >>>> over >>>> many many workers? >>>> >>>> >>> >
