2012/5/7 Darren Govoni <[email protected]>:
> Good point. I'm no expert in the details of the algorithms per se, but I
> wonder how the Apache Mahout folks are doing it using map/reduce. Is
> there a data model in scikit that would be suitable for a map/reduce
> algorithm approach?
>
> I know ipython can do map/reduce rather easily.

MapReduce is no silver bullet for machine learning and iterative
algorithms in general (although some particular algorithms might be
implemented efficiently using MapReduce): having to pass both the
dataset chunks and the partially updated model parameters as stream of
k, v pairs is really not practical and induces disk serializations
that are completely unnecessary and can kill the performance. Read the
following blog post for more insights on this:

http://hunch.net/?p=2094

MapReduce is a nice paradigm for data preprocessing tasks like cleanup
and computing aggregate statistics on a dataset though. If you want to
do this, you should have a look at disco: http://discoproject.org/

But keep in mind that MapReduce was designed to work on big data use
cases (bigdata == google-scale problems): hundreds or thousands of
nodes (e.g. a complete datacenter) with several terabytes of data
stored on each node in a replicated manner across the datacenter.
MapReduce tries to solve many problems that are linked to those scales
like tolerance to hardware failures without restarting the whole
computation nor loosing data, limiting network bandwidth usage through
the use of data locality features of the distributed filesystem and
task scheduler and maximizing IO throughput from and to the disks. You
probably don't have those issues, so you probably don't want to waste
your time trying to fit your problem into a constraining programming
paradigm that was only designed to address those issues.

Most scikit-learn machine learning algorithms are CPU bound and assume
that the whole model fits in main memory of single node (and often the
dataset as well). This is very different from the typical MapReduce
use cases.

Also keep in mind that each algorithm is (or not) amenable to
different kind of scalability patterns through online updates and / or
parallelization (SIMD or GPU-style, multicore, or cluster
distributed). Hence you should not think in terms of one
parallelization pattern to make all algorithms implemented in
scikit-learn scalable at once but take the alternative approach of
starting from each individual algorithm and understand which solution
would possibly make it scalable by analyzing the IO, CPU and memory
usage patterns and the frequency of synchronization barriers between
the model components.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to