What is the best way to migrate existing scikit-learn code to PySpark
cluster? Then we can bring together the full power of both scikit-learn and
spark, to do scalable machine learning.

Currently I use multiprocessing module of Python to boost the speed. But
this only works for one node, while the data set is small.

For many real cases, we may need to deal with gigabytes or even terabytes
of data, with thousands of raw categorical attributes, which can lead to
millions of discrete features, using 1-of-k representation.

For these cases, one solution is to use distributed memory. That's why I am
considering spark. And spark support Python!
With Pyspark, we can import scikit-learn.

But the question is how to make the scikit-learn code, decisionTree
Regressor for example, running in distributed computing mode, to benefit
the power of Spark?


Best,
Rex
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to