2013/7/17 Harold Nguyen <[email protected]>:
> Hi Oliver,
>
> Thank you very much. Could this potentially take a long time ? Is there a
> way
> to do batch processing, or parallel computing ? (a la Mahout-ish?)

Some algorithms can be parallelized, for instance on a small to medium
IPython.parallel cluster.

See this talk for instance:

http://lanyrd.com/2013/pydata/scfxpf/

and this tutorial:

https://github.com/ogrisel/parallel_ml_tutorial

You can also use PySpark to fit linear model in parallel as discussed
in this thread:

https://groups.google.com/d/msg/spark-users/qyWltnB4NW0/QgXseskiVWsJ

It really all depends on what you are trying to achieve, what kind of
features do you have and how much labeled data you have. If you give
us more details we might be able to give more specific insights.

But my advice is again, try to first work on an offline extraction of
your database that fits in memory and do your analytics there. You can
fit a lot of data in RAM on beefy machines nowadays. Then think about
scaling the promising feature extraction methods + predictive
modelling algorithms afterwards.

You should really focus on making something simple and correct on a
single machine on a random sub-sample of your data first and then
think about scaling it later. Otherwise you will waste many CPU hours
and debugging a scalable program that yields crappy predictive
accuracy.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to