Hi all,

This is to followup and to summarize on the discussion [1], and the related
findings.

The basic idea behind $subject was to start using Spark DataFrames (and
Replace RDDs), and the "Pipeline" concept in WSO2 ML. Reasons for moving to
DataFrames and Pipelines are

   - DataFrames are much faster than RDDs.
   - Pipeline enables to define and plug different stages in ML model
   building scenario (such as preprocessing, transforming, training,
   validating, etc). It eliminates the requirement of creating these steps
   manually. It also to ensure that training and test data go through
   identical steps.

Follow are the Pros and Cons that we have identified, on integrating the
above to product ML.

*Pros:*

   - Can Import CSV files directly to DataFrames. (Reduce several steps
   from current implementation).
   - Can perform transformations, and encodings directly on DataFrame. No
   need to convert DataFrames to RDDs for transformations.
   - Provides a rich set of built-in operations/transformations:
      - Can add/remove columns to/from an existing DataFrames.
      - Can drop rows containing NULLs in a DataFrame. Else can filter-out
      only the set of rows that satisfies a certain condition.
      - Supports a set of mathematical operations for Columns
      in DataFrames, hence can implement advanced transformations.
   - Can address NLP scenarios. Basically, can do ANY transformation using
   UDFs on a DataFrame (Sample topic mapper using regex: [2]).
   - Can Merge two or DataFrames.


*Cons:*

   - Not all the algorithms that are currently available in WSO2 ML is not
   supported by pipeline OOB. [3]
      - But it is possible to wrap those unsupported algorithms with the
      org.apache.spark.ml.PredictionModel
      
<http://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/ml/PredictionModel.html>
interface
      and use it in the pipeline. ([4] is a Sample).
   - Output model from a pipeline does not support PMML.
   - Spark Runtime is needed for prediction. (i.e. for Prediction API and
   CEP extension). Reason; in prediction, need to pass a DataFrame to the
   pipeline-model and, Spark Context (i.e. SQL Context) is needed for
   DataFrames.


Thus, would like to know how are we going ahead with this.

[1] "[Engineering]Invitation: ML Pipeline Discussion @ Tue Jan 12, 2016 2pm
- 3pm"
[2]
https://github.com/SupunS/play-ground/blob/aaa5697e3c5f40c492382128130a86dc1def8e09/test.spark.client_2/src/main/java/RegexTransformer.java
[3]
https://docs.google.com/spreadsheets/d/16mH5VacNhRWoRtTk2fEnZ5g5ABtLNXaJwoKXxGgLh2A/edit#gid=0
[4]
https://github.com/SupunS/play-ground/blob/master/test.spark.client_2/src/main/java/SVMClassifier.java


Regards,
Supun

-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324
_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to