Hi all,

When doing single node multi cpu parallel machine learning (e.g grid
search, one vs all SGD, random forests), it would be great to avoid
duplicating memory, especially for the input dataset that is used as a
readonly resource in most of our common usecases.

This could be done either with shared memory (but this is not natively
provided by numpy and need a separate extensions such as
numpy-sharedmem [1]) or using file-backed memory mapped numpy arrays.

Apparently both solutions are currently hard to implement (the latter
maybe because of a bug).

There is a discussion here, if your are interested in fixing this
problem one way or another, feel free to drop by:

https://github.com/scikit-learn/scikit-learn/issues/936

Or reply to this email with new solutions.

BTW, has anyone tried to use the multiprocessing Array class [2] in a
numpy / scipy context?

[1] https://bitbucket.org/cleemesser/numpy-sharedmem/
[2] 
http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to