2013/11/26 Uri Laserson <[email protected]>:
> Hi all,
>
> I was wondering whether there has been any organized effort to create
> scikit-learn estimators that are backed by Spark clusters.  Rather than
> using the PySpark API to call sklearn functions, you would instantiate
> sklearn estimators that end up calling PySpark functionality in their .fit()
> methods.

Most sklearn models expects the data to fit in a contiguous chunk of
memory both for simplicity and efficiency reason. Building learning
algorithms that are able to efficiently deal with cluster partitioned
data is much more complex and will probably never make it into
scikit-learn directly (sklearn will stay a lib and not become a
framework with dependencies on external cluster computing framework
like spark, hadoop yarn or IPython.parallel).

However that does not mean that it is not possible to train sklearn
models on distributed memory and CPU resources with and spark. There
are several ways to do it, for instance:

- Example #1: Models that naturally support out-of-core learning and
model averaging such as linear models:

Write a PySpark map-able function that uses the partial_fit method of
models such as SGDClassifier to incrementally learn a bunch of linear
models on buffered chunks of a large (cluster partitioned RDD) in
parallel and return the incrementally updated linear models after
having seen a configurable number of samples.

Write a PySpark reduce-able function that averages a sequence of
linear models (compute the mean coef_ attribute and the mean
intercept_ attribute) and return the average linear model.

Make it possible to ship the averaged model as a broadcast variable to
warm-start the first map-side function and iterate for another pass
over the data.

Wrap all of the above as Spark-aware helper class (possibly
implementing the fit / predict API but taking spark RDD as input data
instead of in-memory numpy arrays).

- Example #2: Ensembles of in-core models trained on random partitions
of the data:

Write a PySpark aware helper function that shuffles the samples of an
RDD with for a fixed cluster-wide PRNG seed (if that does not already
exists, probably data format dependent).

Write a PySpark map-able function that buffers a largish (e.g. couple
of GB) segment of the RDD into an homogeneous numpy array and train
in-core a sub ensemble model such as ExtraTreesClassifier with a small
number of sub estimators that requires fast random access to the
memory.

Write a PySpark reduce function that aggregates a sequence of
sub-ensembles into a final big ensemble such as a bigger ensemble such
as a large forest of trees, it's quite easy to do it with the current
sklearn API:

https://github.com/pydata/pyrallel/blob/master/pyrallel/ensemble.py#L27-L59

Wrap all of the above as Spark-aware helper class (possibly
implementing the fit / predict API but taking spark RDD as input data
instead of in-memory numpy arrays).

At some point some helpers like a public common model aggregation API
might make it into scikit-learn, while spark specific stuff will
likely live outside of the main sklearn codebase such as
IPython.parallel specific stuff currently live outside of the main
sklearn codebase such as my experimental pyrallel pet-project:
https://github.com/pydata/pyrallel/

If people are interested in working on such a spark-sklearn project on
github, please let us know on this ML. Maybe we could organize a
coding sprint right after Strata in February for instance?

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to