Hi all, Q: how to organize parallel accesses to a huge common read-only Python data structure?
Details: I have a huge data structure that takes >50% of RAM. My goal is to have many computational threads (or processes) that can have an efficient read-access to the huge and complex data structure. "Efficient" in particular means "without serialization" and "without unneeded lockings on read-only data" To what I see, there are following strategies: 1. multi-processing => a. child-processes get their own *copies* of huge data structure -- bad and not possible at all in my case; => b. child-processes often communicate with the parent process via some IPC -- bad (serialization); => c. child-processes access the huge structure via some shared memory approach -- feasible without serialization?! (copy-on-write is not working here well in CPython/Linux!!); 2. multi-threading => d. CPython is told to have problems here because of GIL -- any comments? => e. GIL-less implementations have their own issues -- any hot recommendations? I am a big fan of parallel map() approach -- either multiprocessing.Pool.map or even better pprocess.pmap. However this doesn't work straight-forward anymore, when "huge data" means >50% RAM ;-) Comments and ideas are highly welcome!! Here is the workbench example of my case: ###################### import time from multiprocessing import Pool def f(_): time.sleep(5) # just to emulate the time used by my computation res = sum(parent_x) # my sofisticated formula goes here return res if __name__ == '__main__': parent_x = [1./i for i in xrange(1,10000000)]# my huge read- only data :o) p = Pool(7) res= list(p.map(f, xrange(10))) # switch to ps and see how fast your free memory is getting wasted... print res ###################### Kind regards Valery -- http://mail.python.org/mailman/listinfo/python-list