Hi Fabian, It looks like that SO answer is moving data into the global scope of each worker. It is probably worth experimenting with but I'd be worried about performance implications of non-const global variables. It's probably the case that this is still a win for my use case though. Thanks for the link.
I'm using alm2map and map2alm from LibHealpix.jl ( https://github.com/mweastwood/LibHealpix.jl) to do the spherical harmonic transforms. Michael On Thursday, May 19, 2016 at 12:45:05 AM UTC-7, Fabian Gans wrote: > > Hi Michael, > > I recently had a similar problem and this SO thread helped me a lot: > http://stackoverflow.com/questions/27677399/julia-how-to-copy-data-to-another-processor-in-julia > > As a side question: Which code are you using to calculate spherical > harmonics transforms. I was looking for a julia package some time ago and > did not find any, so if your code is publicly available, could you point me > to it? > > Thanks > Fabian > > > > On Wednesday, May 18, 2016 at 9:28:34 PM UTC+2, Michael Eastwood wrote: >> >> Hi julia-users! >> >> I need some help with distributing a workload across tens of workers >> across several machines. >> >> The problem I am trying to solve involves calculating the elements of a >> large block diagonal matrix. The total size of the blocks is >500 GB, so I >> cannot store the entire thing in memory. The way the calculation works, I >> do a bunch of spherical harmonic transforms and the results give me one row >> in each block of the matrix. >> >> The following code illustrates what I am doing currently. I am >> distributing the spherical harmonic transforms amongst all the workers and >> bringing the data back to the master process to write the results to disk >> (the master process has each matrix block mmapped to disk). >> >> idx = 1 >> limit = 10000 >> nextidx() = (myidx = idx; idx += 1; myidx) >> @sync for worker in workers() >> @async while true >> myidx = nextidx() >> myidx ≤ limit || break >> coefficients = >> remotecall_fetch(spherical_harmonic_transforms, worker, input[myidx]) >> write_results_to_disk(coefficients) >> end >> end >> >> Each spherical harmonic transform takes O(10 seconds) so I thought the >> data movement cost would be negligible compared to this. However, if I have >> three machines each with 16 workers, machine 1 will have all 16 workers >> working hard (the master process is on machine 1) and machines 2&3 will >> have most of their workers idling. My hypothesis is that the cost of moving >> the data to and from the workers is preventing machines 2&3 from being >> fully utilized. >> >> coefficients is a vector of a million Complex128s (16 MB) >> input is composed of two parts: 1) a vector of 10 million Float64s (100 >> MB) and 2) a small amount of additional information that is negligibly >> small compared to the first part. >> >> The trick is that the first part of input (the 100 MB vector) doesn't >> change between iterations. So I could alleviate most of the data movement >> problem by moving that part to each worker once. Problem is that I can't >> seem to figure out how to do that. The manual ( >> http://docs.julialang.org/en/release-0.4/manual/parallel-computing/#remoterefs-and-abstractchannels) >> >> is a little thin on how to use RemoteRefs. >> >> So how do you move data to workers in a way that it can be re-used on >> subsequent iterations? An example in the manual would be very helpful! >> >> Thanks, >> Michael >> >
