That's not a problem for me; all of my data is numeric. To summarize a long post, I'm interested in understanding
1) good programming paradigms for using shared memory together with parallel maps. In particular, can a shared array and other nonshared data structure be combined into a single data structure and "passed" in a remote call without unnecessarily copying the shared array? and 2) possibilities for extending shared memory in julia to other data types, and even to user defined types. On Tuesday, January 21, 2014 11:17:10 PM UTC-8, Amit Murthy wrote: > > I have not gone through your post in detail, but would like to point out > that SharedArray can only be used for bitstypes. > > > On Wed, Jan 22, 2014 at 12:23 PM, Madeleine Udell > <[email protected]<javascript:> > > wrote: > >> # Say I have a list of tasks, eg tasks i=1:n >> # For each task I want to call a function foo >> # that depends on that task and some fixed data >> # I have many types of fixed data: eg, arrays, dictionaries, integers, etc >> >> # Imagine the data comes from eg loading a file based on user input, >> # so we can't hard code the data into the function foo >> # although it's constant during program execution >> >> # If I were doing this in serial, I'd do the following >> >> type MyData >> myint >> mydict >> myarray >> end >> >> function foo(task,data::MyData) >> data.myint + data.myarray[data.mydict[task]] >> end >> >> n = 10 >> const data = MyData(rand(),Dict(1:n,randperm(n)),randperm(n)) >> >> results = zeros(n) >> for i = 1:n >> results[i] = foo(i,data) >> end >> >> # What's the right way to do this in parallel? Here are a number of ideas >> # To use @parallel or pmap, we have to first copy all the code and data >> everywhere >> # I'd like to avoid that, since the data is huge (10 - 100 GB) >> >> @everywhere begin >> type MyData >> myint >> mydict >> myarray >> end >> >> function foo(task,data::MyData) >> data.myint + data.myarray[data.mydict[task]] >> end >> >> n = 10 >> const data = MyData(rand(),Dict(1:n,randperm(n)),randperm(n)) >> end >> >> ## @parallel >> results = zeros(n) >> @parallel for i = 1:n >> results[i] = foo(i,data) >> end >> >> ## pmap >> @everywhere foo(task) = foo(task,data) >> results = pmap(foo,1:n) >> >> # To avoid copying data, I can make myarray a shared array >> # In that case, I don't want to use @everywhere to put data on each >> processor >> # since that would reinstantiate the shared array. >> # My current solution is to rewrite my data structure to *not* include >> myarray, >> # and pass the array to the function foo separately. >> # But the code gets much less pretty as I tear apart my data structure, >> # especially if I have a large number of shared arrays. >> # Is there a way for me to avoid this while using shared memory? >> # really, I'd like to be able to define my own shared memory data types... >> >> @everywhere begin >> type MySmallerData >> myint >> mydict >> end >> >> function foo(task,data::MySmallerData,myarray::SharedArray) >> data.myint + myarray[data.mydict[task]] >> end >> >> n = 10 >> const data = MySmallerData(rand(),Dict(1:n,randperm(n))) >> end >> >> myarray = SharedArray(randperm(n)) >> >> ## @parallel >> results = zeros(n) >> @parallel for i = 1:n >> results[i] = foo(i,data,myarray) >> end >> >> ## pmap >> @everywhere foo(task) = foo(task,data,myarray) >> results = pmap(foo,1:n) >> >> # Finally, what can I do to avoid copying mydict to each processor? >> # Is there a way to use shared memory for it? >> # Once again, I'd really like to be able to define my own shared memory >> data types... >> > >
