Hi all, When doing single node multi cpu parallel machine learning (e.g grid search, one vs all SGD, random forests), it would be great to avoid duplicating memory, especially for the input dataset that is used as a readonly resource in most of our common usecases.
This could be done either with shared memory (but this is not natively provided by numpy and need a separate extensions such as numpy-sharedmem [1]) or using file-backed memory mapped numpy arrays. Apparently both solutions are currently hard to implement (the latter maybe because of a bug). There is a discussion here, if your are interested in fixing this problem one way or another, feel free to drop by: https://github.com/scikit-learn/scikit-learn/issues/936 Or reply to this email with new solutions. BTW, has anyone tried to use the multiprocessing Array class [2] in a numpy / scipy context? [1] https://bitbucket.org/cleemesser/numpy-sharedmem/ [2] http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
