On Wednesday, December 18, 2013 4:34:48 AM UTC, Amit Murthy wrote: > > distribute creates a new DArray from a regular array. It does this by > allocating parts of the regular array on each of the workers. " > remotecall_fetch(owner, ()->fetch(rr)[I...])" is the regular init > function passed to the DArray constructor. On each worker, it pulls in its > its part of the regular array. > > While the DArray itself is created in parallel, 1000 workers is a lot, > typically one would expect 1 worker per CPU core. Of course, if you do have > access to 1000 cores, it will be a good idea to try it out, though I > suspect, we may see other issues too as a result of it. >
I assume this means that DArrays are not horizontally scalable. What other solutions are out there for julia when working with big data? Is julia itself considered a complete system when working with distributed data, or do people use julia on top of other frameworks? > > On Tue, Dec 17, 2013 at 5:13 PM, David C Cohen > <[email protected]<javascript:> > > wrote: > >> Hi, >> >> According to >> this<https://github.com/JuliaLang/julia/blob/b4fa86124dd1cb298373c3bef3f98c060cbb19b8/base/darray.jl#L160-L167>, >> >> distribute is defined as: >> >> function distribute(a::AbstractArray) >> owner = myid() >> rr = RemoteRef() >> put(rr, a) >> DArray(size(a)) do I >> remotecall_fetch(owner, ()->fetch(rr)[I...]) >> end >> end >> >> >> I'm trying to find out how this sends data to the workers. >> >> - Here rr is the remote reference to the local machine. Then put(rr, >> a) sends array a to the local machine. That doesn't make sense. >> - When, in whatever way that doesn't make sense to me, data of array >> a is sent to workers, does the network io happen in parallel, or in >> series? >> - If we have 1000 workers, parallel data sending means network >> overload, series data sending means taking a long time. What's a good way >> of working with larger distributed arrays, or distributing an array over >> many many workers? >> >> >
