Hi Arijit,

PySpark and SystemML are complimentary and both serve different purpose.
PySpark primarily operates on a collection of datapoints (i.e. RDD) or a
DataFrame and exposes the Spark programming model (i.e. transformation and
actions). SystemML primarily operates on matrices and provides wide variety
of linear algebra operators required for implementing Machine Learning
algorithms. Personally, I would use PySpark for data preprocessing and
SystemML for training/prediction (YMMV!!). As an example: in our breast
cancer project, we use PySpark APIs in
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/Preprocessing.ipynb
 and SystemML APIs in
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/MachineLearning.ipynb
 ... Yes, some operations (such as distinct) can be done in both SystemML
and PySpark, in which case, you should chose the one that best fits your
need.

PySpark ML (or MLLib) is more closer to SystemML. I agree with you that
there is not enough comparisons out there, probably because benchmarking ML
systems is non-trivial. For apples to apples comparison, you need compare
both accuracy and runtime performance of a given ML model on variety of
datasets. I am using the term "accuracy" broadly, so please refer to
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.
Also, since different ML systems use different optimization algorithms
(i.e. SGD, conjugate gradient, direct solve, ...), one needs to reason
about hyperparameters as well as convergence behavior before making a
judgement.

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

PS: SystemML has recently added support for frames (
http://apache.github.io/incubator-systemml/dml-language-reference.html#frames
) that simplifies common data transformation operations such as recoding,
dummy coding, binning and handling of missing values.



From:   arijit chakraborty <ak...@hotmail.com>
To:     "dev@systemml.incubator.apache.org"
            <dev@systemml.incubator.apache.org>
Date:   04/17/2017 08:50 AM
Subject:        Distinct Item of a column



Hi,


I'm curious to know what's the advantage of systemML over pyspark?
Especially in terms of performance. I tried looking for some reading on it,
but hardly could find one.


Thank you!

Arijit


Reply via email to