Hi,

> But the question is how to make the scikit-learn code, decisionTree Regressor 
> for example, running in distributed computing mode, to benefit the power of 
> Spark?

I am sorry but you cant. The tree implementation in scikit-learn was
not designed for this use case.

Maybe you should have a look at MLlib
(https://spark.apache.org/mllib/), which implements a bunch of machine
learning algorithms (including forests) on top of Spark.

Best,
Gilles

On 12 September 2015 at 20:11, Rex X <dnsr...@gmail.com> wrote:
> What is the best way to migrate existing scikit-learn code to PySpark
> cluster? Then we can bring together the full power of both scikit-learn and
> spark, to do scalable machine learning.
>
> Currently I use multiprocessing module of Python to boost the speed. But
> this only works for one node, while the data set is small.
>
> For many real cases, we may need to deal with gigabytes or even terabytes of
> data, with thousands of raw categorical attributes, which can lead to
> millions of discrete features, using 1-of-k representation.
>
> For these cases, one solution is to use distributed memory. That's why I am
> considering spark. And spark support Python!
> With Pyspark, we can import scikit-learn.
>
> But the question is how to make the scikit-learn code, decisionTree
> Regressor for example, running in distributed computing mode, to benefit the
> power of Spark?
>
>
> Best,
> Rex
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to