Hi Arijit, PySpark and SystemML are complimentary and both serve different purpose. PySpark primarily operates on a collection of datapoints (i.e. RDD) or a DataFrame and exposes the Spark programming model (i.e. transformation and actions). SystemML primarily operates on matrices and provides wide variety of linear algebra operators required for implementing Machine Learning algorithms. Personally, I would use PySpark for data preprocessing and SystemML for training/prediction (YMMV!!). As an example: in our breast cancer project, we use PySpark APIs in https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/Preprocessing.ipynb and SystemML APIs in https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/MachineLearning.ipynb ... Yes, some operations (such as distinct) can be done in both SystemML and PySpark, in which case, you should chose the one that best fits your need.
PySpark ML (or MLLib) is more closer to SystemML. I agree with you that there is not enough comparisons out there, probably because benchmarking ML systems is non-trivial. For apples to apples comparison, you need compare both accuracy and runtime performance of a given ML model on variety of datasets. I am using the term "accuracy" broadly, so please refer to http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics. Also, since different ML systems use different optimization algorithms (i.e. SGD, conjugate gradient, direct solve, ...), one needs to reason about hyperparameters as well as convergence behavior before making a judgement. Thanks, Niketan Pansare IBM Almaden Research Center E-mail: npansar At us.ibm.com http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar PS: SystemML has recently added support for frames ( http://apache.github.io/incubator-systemml/dml-language-reference.html#frames ) that simplifies common data transformation operations such as recoding, dummy coding, binning and handling of missing values. From: arijit chakraborty <ak...@hotmail.com> To: "dev@systemml.incubator.apache.org" <dev@systemml.incubator.apache.org> Date: 04/17/2017 08:50 AM Subject: Distinct Item of a column Hi, I'm curious to know what's the advantage of systemML over pyspark? Especially in terms of performance. I tried looking for some reading on it, but hardly could find one. Thank you! Arijit