2012/7/19 Viktor Pekar <[email protected]>:
> Dear all,
>
> I am trying to find information on the use of scikit-learn on very large
> datasets, i.e. if and how it can be used in a distributed processing setup.
> I saw that PiCloud has scikit-learn installed in their environment, and this
> post suggests it can be deployed on PiCloud:
> http://stackoverflow.com/questions/9653060/amazon-ec2-vs-picloud. But I
> can't find any details on how scalable it is, and whether it is advisable at
> all to use scikit-learn in such situations. So any advice and pointers will
> be much appreciated.

It depends on what you want to achieve. Some stuff in machine learning
are embarrassingly parallel (grid searching optimal parameter with
cross validation for model selection or fitting random forests) others
non that easily parallelizable (e.g. fitting a model with stochastic
gradient descent as you need synchronization steps a.k.a. inter-node
communication for averaging the parameters while learning) others not
at all (e.g. fitting a kernel SVM with the SMO algorithm AFAIK).

Right now we don't have any high level tools nor documentation to
achieve this but this is on my personal todo list and it's probably
the same for other scikit-learn developers.

BTW, what is "very large dataset" in your case? What is the task you
want to achieve with it? Supervised classification or regression? In
general it's very costly to come by very large labeled datasets for
supervised learning.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to