[Architecture] Spark Pipeline Support for ML

Supun Sethunga Mon, 18 Jan 2016 22:17:11 -0800

Hi all,

This is to followup and to summarize on the discussion [1], and the related
findings.

The basic idea behind $subject was to start using Spark DataFrames (and
Replace RDDs), and the "Pipeline" concept in WSO2 ML. Reasons for moving to
DataFrames and Pipelines are

- DataFrames are much faster than RDDs.
- Pipeline enables to define and plug different stages in ML model
building scenario (such as preprocessing, transforming, training,
validating, etc). It eliminates the requirement of creating these steps
manually. It also to ensure that training and test data go through
identical steps.

Follow are the Pros and Cons that we have identified, on integrating the
above to product ML.

*Pros:*

- Can Import CSV files directly to DataFrames. (Reduce several steps
from current implementation).
- Can perform transformations, and encodings directly on DataFrame. No
need to convert DataFrames to RDDs for transformations.
- Provides a rich set of built-in operations/transformations:
- Can add/remove columns to/from an existing DataFrames.
- Can drop rows containing NULLs in a DataFrame. Else can filter-out
only the set of rows that satisfies a certain condition.
- Supports a set of mathematical operations for Columns
in DataFrames, hence can implement advanced transformations.
- Can address NLP scenarios. Basically, can do ANY transformation using
UDFs on a DataFrame (Sample topic mapper using regex: [2]).
- Can Merge two or DataFrames.

*Cons:*

- Not all the algorithms that are currently available in WSO2 ML is not
supported by pipeline OOB. [3]
- But it is possible to wrap those unsupported algorithms with the
org.apache.spark.ml.PredictionModel

<http://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/ml/PredictionModel.html>
interface
and use it in the pipeline. ([4] is a Sample).
- Output model from a pipeline does not support PMML.
- Spark Runtime is needed for prediction. (i.e. for Prediction API and
CEP extension). Reason; in prediction, need to pass a DataFrame to the
pipeline-model and, Spark Context (i.e. SQL Context) is needed for
DataFrames.

Thus, would like to know how are we going ahead with this.

[1] "[Engineering]Invitation: ML Pipeline Discussion @ Tue Jan 12, 2016 2pm
- 3pm"
[2]
https://github.com/SupunS/play-ground/blob/aaa5697e3c5f40c492382128130a86dc1def8e09/test.spark.client_2/src/main/java/RegexTransformer.java
[3]
https://docs.google.com/spreadsheets/d/16mH5VacNhRWoRtTk2fEnZ5g5ABtLNXaJwoKXxGgLh2A/edit#gid=0
[4]
https://github.com/SupunS/play-ground/blob/master/test.spark.client_2/src/main/java/SVMClassifier.java

Regards,
Supun

--
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

[Architecture] Spark Pipeline Support for ML

Reply via email to