Hello community, I would like to introduce a new package that should be of interest to scikit-learn users who work with the Spark framework, or with a distributed system.
It provides the following, among other tools: - train and evaluate multiple scikit-learn models in parallel. - convert Spark's Dataframes seamlessly into numpy arrays - (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors. Spark-sklearn focuses on problems that have a small amount of data and that can be run in parallel. Note this package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib). If you want to use it, see instructions on the package page: https://github.com/databricks/spark-sklearn This blog post contains more details: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html Let us know if you have any questions. Also, documentation or code contributions are much welcome (Apache 2.0 license). Cheers Tim and Joseph ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general