2013/7/17 Harold Nguyen <[email protected]>: > Hi Oliver, > > Thank you very much. Could this potentially take a long time ? Is there a > way > to do batch processing, or parallel computing ? (a la Mahout-ish?)
Some algorithms can be parallelized, for instance on a small to medium IPython.parallel cluster. See this talk for instance: http://lanyrd.com/2013/pydata/scfxpf/ and this tutorial: https://github.com/ogrisel/parallel_ml_tutorial You can also use PySpark to fit linear model in parallel as discussed in this thread: https://groups.google.com/d/msg/spark-users/qyWltnB4NW0/QgXseskiVWsJ It really all depends on what you are trying to achieve, what kind of features do you have and how much labeled data you have. If you give us more details we might be able to give more specific insights. But my advice is again, try to first work on an offline extraction of your database that fits in memory and do your analytics there. You can fit a lot of data in RAM on beefy machines nowadays. Then think about scaling the promising feature extraction methods + predictive modelling algorithms afterwards. You should really focus on making something simple and correct on a single machine on a random sub-sample of your data first and then think about scaling it later. Otherwise you will waste many CPU hours and debugging a scalable program that yields crappy predictive accuracy. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
