Thank you Niketan! Your answer completely answer my question.

Regards,

Arijit

________________________________
From: Niketan Pansare <npan...@us.ibm.com>
Sent: Tuesday, April 18, 2017 12:55:28 AM
To: dev@systemml.incubator.apache.org
Subject: Re: Distinct Item of a column


Hi Arijit,

PySpark and SystemML are complimentary and both serve different purpose. 
PySpark primarily operates on a collection of datapoints (i.e. RDD) or a 
DataFrame and exposes the Spark programming model (i.e. transformation and 
actions). SystemML primarily operates on matrices and provides wide variety of 
linear algebra operators required for implementing Machine Learning algorithms. 
Personally, I would use PySpark for data preprocessing and SystemML for 
training/prediction (YMMV!!). As an example: in our breast cancer project, we 
use PySpark APIs in 
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/Preprocessing.ipynb
 and SystemML APIs in 
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/MachineLearning.ipynb
 ... Yes, some operations (such as distinct) can be done in both SystemML and 
PySpark, in which case, you should chose the one that best fits your need.

PySpark ML (or MLLib) is more closer to SystemML. I agree with you that there 
is not enough comparisons out there, probably because benchmarking ML systems 
is non-trivial. For apples to apples comparison, you need compare both accuracy 
and runtime performance of a given ML model on variety of datasets. I am using 
the term "accuracy" broadly, so please refer to 
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics. 
Also, since different ML systems use different optimization algorithms (i.e. 
SGD, conjugate gradient, direct solve, ...), one needs to reason about 
hyperparameters as well as convergence behavior before making a judgement.

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

PS: SystemML has recently added support for frames 
(http://apache.github.io/incubator-systemml/dml-language-reference.html#frames) 
that simplifies common data transformation operations such as recoding, dummy 
coding, binning and handling of missing values.

[Inactive hide details for arijit chakraborty ---04/17/2017 08:50:51 AM---Hi, 
I'm curious to know what's the advantage of system]arijit chakraborty 
---04/17/2017 08:50:51 AM---Hi, I'm curious to know what's the advantage of 
systemML over pyspark? Especially in terms of perfor

From: arijit chakraborty <ak...@hotmail.com>
To: "dev@systemml.incubator.apache.org" <dev@systemml.incubator.apache.org>
Date: 04/17/2017 08:50 AM
Subject: Distinct Item of a column

________________________________



Hi,


I'm curious to know what's the advantage of systemML over pyspark? Especially 
in terms of performance. I tried looking for some reading on it, but hardly 
could find one.


Thank you!

Arijit



Reply via email to