What is the best way to migrate existing scikit-learn code to PySpark cluster? Then we can bring together the full power of both scikit-learn and spark, to do scalable machine learning.
Currently I use multiprocessing module of Python to boost the speed. But this only works for one node, while the data set is small. For many real cases, we may need to deal with gigabytes or even terabytes of data, with thousands of raw categorical attributes, which can lead to millions of discrete features, using 1-of-k representation. For these cases, one solution is to use distributed memory. That's why I am considering spark. And spark support Python! With Pyspark, we can import scikit-learn. But the question is how to make the scikit-learn code, decisionTree Regressor for example, running in distributed computing mode, to benefit the power of Spark? Best, Rex
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general