2013/11/26 Uri Laserson <[email protected]>: > Hi all, > > I was wondering whether there has been any organized effort to create > scikit-learn estimators that are backed by Spark clusters. Rather than > using the PySpark API to call sklearn functions, you would instantiate > sklearn estimators that end up calling PySpark functionality in their .fit() > methods.
Most sklearn models expects the data to fit in a contiguous chunk of memory both for simplicity and efficiency reason. Building learning algorithms that are able to efficiently deal with cluster partitioned data is much more complex and will probably never make it into scikit-learn directly (sklearn will stay a lib and not become a framework with dependencies on external cluster computing framework like spark, hadoop yarn or IPython.parallel). However that does not mean that it is not possible to train sklearn models on distributed memory and CPU resources with and spark. There are several ways to do it, for instance: - Example #1: Models that naturally support out-of-core learning and model averaging such as linear models: Write a PySpark map-able function that uses the partial_fit method of models such as SGDClassifier to incrementally learn a bunch of linear models on buffered chunks of a large (cluster partitioned RDD) in parallel and return the incrementally updated linear models after having seen a configurable number of samples. Write a PySpark reduce-able function that averages a sequence of linear models (compute the mean coef_ attribute and the mean intercept_ attribute) and return the average linear model. Make it possible to ship the averaged model as a broadcast variable to warm-start the first map-side function and iterate for another pass over the data. Wrap all of the above as Spark-aware helper class (possibly implementing the fit / predict API but taking spark RDD as input data instead of in-memory numpy arrays). - Example #2: Ensembles of in-core models trained on random partitions of the data: Write a PySpark aware helper function that shuffles the samples of an RDD with for a fixed cluster-wide PRNG seed (if that does not already exists, probably data format dependent). Write a PySpark map-able function that buffers a largish (e.g. couple of GB) segment of the RDD into an homogeneous numpy array and train in-core a sub ensemble model such as ExtraTreesClassifier with a small number of sub estimators that requires fast random access to the memory. Write a PySpark reduce function that aggregates a sequence of sub-ensembles into a final big ensemble such as a bigger ensemble such as a large forest of trees, it's quite easy to do it with the current sklearn API: https://github.com/pydata/pyrallel/blob/master/pyrallel/ensemble.py#L27-L59 Wrap all of the above as Spark-aware helper class (possibly implementing the fit / predict API but taking spark RDD as input data instead of in-memory numpy arrays). At some point some helpers like a public common model aggregation API might make it into scikit-learn, while spark specific stuff will likely live outside of the main sklearn codebase such as IPython.parallel specific stuff currently live outside of the main sklearn codebase such as my experimental pyrallel pet-project: https://github.com/pydata/pyrallel/ If people are interested in working on such a spark-sklearn project on github, please let us know on this ML. Maybe we could organize a coding sprint right after Strata in February for instance? -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
