2012/7/19 Viktor Pekar <[email protected]>: > Dear all, > > I am trying to find information on the use of scikit-learn on very large > datasets, i.e. if and how it can be used in a distributed processing setup. > I saw that PiCloud has scikit-learn installed in their environment, and this > post suggests it can be deployed on PiCloud: > http://stackoverflow.com/questions/9653060/amazon-ec2-vs-picloud. But I > can't find any details on how scalable it is, and whether it is advisable at > all to use scikit-learn in such situations. So any advice and pointers will > be much appreciated.
It depends on what you want to achieve. Some stuff in machine learning are embarrassingly parallel (grid searching optimal parameter with cross validation for model selection or fitting random forests) others non that easily parallelizable (e.g. fitting a model with stochastic gradient descent as you need synchronization steps a.k.a. inter-node communication for averaging the parameters while learning) others not at all (e.g. fitting a kernel SVM with the SMO algorithm AFAIK). Right now we don't have any high level tools nor documentation to achieve this but this is on my personal todo list and it's probably the same for other scikit-learn developers. BTW, what is "very large dataset" in your case? What is the task you want to achieve with it? Supervised classification or regression? In general it's very costly to come by very large labeled datasets for supervised learning. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
