2012/5/7 Darren Govoni <[email protected]>: > Good point. I'm no expert in the details of the algorithms per se, but I > wonder how the Apache Mahout folks are doing it using map/reduce. Is > there a data model in scikit that would be suitable for a map/reduce > algorithm approach? > > I know ipython can do map/reduce rather easily.
MapReduce is no silver bullet for machine learning and iterative algorithms in general (although some particular algorithms might be implemented efficiently using MapReduce): having to pass both the dataset chunks and the partially updated model parameters as stream of k, v pairs is really not practical and induces disk serializations that are completely unnecessary and can kill the performance. Read the following blog post for more insights on this: http://hunch.net/?p=2094 MapReduce is a nice paradigm for data preprocessing tasks like cleanup and computing aggregate statistics on a dataset though. If you want to do this, you should have a look at disco: http://discoproject.org/ But keep in mind that MapReduce was designed to work on big data use cases (bigdata == google-scale problems): hundreds or thousands of nodes (e.g. a complete datacenter) with several terabytes of data stored on each node in a replicated manner across the datacenter. MapReduce tries to solve many problems that are linked to those scales like tolerance to hardware failures without restarting the whole computation nor loosing data, limiting network bandwidth usage through the use of data locality features of the distributed filesystem and task scheduler and maximizing IO throughput from and to the disks. You probably don't have those issues, so you probably don't want to waste your time trying to fit your problem into a constraining programming paradigm that was only designed to address those issues. Most scikit-learn machine learning algorithms are CPU bound and assume that the whole model fits in main memory of single node (and often the dataset as well). This is very different from the typical MapReduce use cases. Also keep in mind that each algorithm is (or not) amenable to different kind of scalability patterns through online updates and / or parallelization (SIMD or GPU-style, multicore, or cluster distributed). Hence you should not think in terms of one parallelization pattern to make all algorithms implemented in scikit-learn scalable at once but take the alternative approach of starting from each individual algorithm and understand which solution would possibly make it scalable by analyzing the IO, CPU and memory usage patterns and the frequency of synchronization barriers between the model components. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
