It is separate from DArray and is a map-reduce framework for working with HDFS. Just mentioned it since you asked for options for working with distributed data.
On Wed, Dec 18, 2013 at 12:50 PM, David C Cohen <[email protected]>wrote: > > > On Wednesday, December 18, 2013 6:16:54 AM UTC, Amit Murthy wrote: >> >> DArrays are horizontally scalable in the sense that each worker holds its >> part of the DArray, and the workers can exist across different machines. >> There is support for HDFS via https://github.com/tanmaykm/HDFS.jl that >> you can check out. >> > > Does HDFS.jl work with DArray, or is it a replace for DArray? > > >> >> Out-of-the-box, Julia has a variety of parallel processing constructs as >> documented here - http://docs.julialang.org/en/latest/manual/parallel- >> computing/ . Some of the currently available external packages - full >> list here - http://docs.julialang.org/en/latest/packages/packagelist/ are >> useful for distributed data. >> >> >> >> On Wed, Dec 18, 2013 at 11:31 AM, David C Cohen <[email protected]>wrote: >> >>> >>> >>> On Wednesday, December 18, 2013 4:34:48 AM UTC, Amit Murthy wrote: >>>> >>>> distribute creates a new DArray from a regular array. It does this by >>>> allocating parts of the regular array on each of the workers. " >>>> remotecall_fetch(owner, ()->fetch(rr)[I...])" is the regular init >>>> function passed to the DArray constructor. On each worker, it pulls in its >>>> its part of the regular array. >>>> >>>> While the DArray itself is created in parallel, 1000 workers is a lot, >>>> typically one would expect 1 worker per CPU core. Of course, if you do have >>>> access to 1000 cores, it will be a good idea to try it out, though I >>>> suspect, we may see other issues too as a result of it. >>>> >>> >>> I assume this means that DArrays are not horizontally scalable. What >>> other solutions are out there for julia when working with big data? Is >>> julia itself considered a complete system when working with distributed >>> data, or do people use julia on top of other frameworks? >>> >>> >>>> >>>> On Tue, Dec 17, 2013 at 5:13 PM, David C Cohen <[email protected]>wrote: >>>> >>>>> Hi, >>>>> >>>>> According to >>>>> this<https://github.com/JuliaLang/julia/blob/b4fa86124dd1cb298373c3bef3f98c060cbb19b8/base/darray.jl#L160-L167>, >>>>> distribute is defined as: >>>>> >>>>> function distribute(a::AbstractArray) >>>>> >>>>> owner = myid() >>>>> >>>>> rr = RemoteRef() >>>>> >>>>> put(rr, a) >>>>> >>>>> DArray(size(a)) do I >>>>> >>>>> remotecall_fetch(owner, ()->fetch(rr)[I...]) >>>>> >>>>> end >>>>> >>>>> end >>>>> >>>>> >>>>> I'm trying to find out how this sends data to the workers. >>>>> >>>>> - Here rr is the remote reference to the local machine. Then >>>>> put(rr, a) sends array a to the local machine. That doesn't make sense. >>>>> - When, in whatever way that doesn't make sense to me, data of >>>>> array a is sent to workers, does the network io happen in parallel, or >>>>> in >>>>> series? >>>>> - If we have 1000 workers, parallel data sending means network >>>>> overload, series data sending means taking a long time. What's a good >>>>> way >>>>> of working with larger distributed arrays, or distributing an array >>>>> over >>>>> many many workers? >>>>> >>>>> >>>> >>
