Hi all, This is to followup and to summarize on the discussion [1], and the related findings.
The basic idea behind $subject was to start using Spark DataFrames (and Replace RDDs), and the "Pipeline" concept in WSO2 ML. Reasons for moving to DataFrames and Pipelines are - DataFrames are much faster than RDDs. - Pipeline enables to define and plug different stages in ML model building scenario (such as preprocessing, transforming, training, validating, etc). It eliminates the requirement of creating these steps manually. It also to ensure that training and test data go through identical steps. Follow are the Pros and Cons that we have identified, on integrating the above to product ML. *Pros:* - Can Import CSV files directly to DataFrames. (Reduce several steps from current implementation). - Can perform transformations, and encodings directly on DataFrame. No need to convert DataFrames to RDDs for transformations. - Provides a rich set of built-in operations/transformations: - Can add/remove columns to/from an existing DataFrames. - Can drop rows containing NULLs in a DataFrame. Else can filter-out only the set of rows that satisfies a certain condition. - Supports a set of mathematical operations for Columns in DataFrames, hence can implement advanced transformations. - Can address NLP scenarios. Basically, can do ANY transformation using UDFs on a DataFrame (Sample topic mapper using regex: [2]). - Can Merge two or DataFrames. *Cons:* - Not all the algorithms that are currently available in WSO2 ML is not supported by pipeline OOB. [3] - But it is possible to wrap those unsupported algorithms with the org.apache.spark.ml.PredictionModel <http://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/ml/PredictionModel.html> interface and use it in the pipeline. ([4] is a Sample). - Output model from a pipeline does not support PMML. - Spark Runtime is needed for prediction. (i.e. for Prediction API and CEP extension). Reason; in prediction, need to pass a DataFrame to the pipeline-model and, Spark Context (i.e. SQL Context) is needed for DataFrames. Thus, would like to know how are we going ahead with this. [1] "[Engineering]Invitation: ML Pipeline Discussion @ Tue Jan 12, 2016 2pm - 3pm" [2] https://github.com/SupunS/play-ground/blob/aaa5697e3c5f40c492382128130a86dc1def8e09/test.spark.client_2/src/main/java/RegexTransformer.java [3] https://docs.google.com/spreadsheets/d/16mH5VacNhRWoRtTk2fEnZ5g5ABtLNXaJwoKXxGgLh2A/edit#gid=0 [4] https://github.com/SupunS/play-ground/blob/master/test.spark.client_2/src/main/java/SVMClassifier.java Regards, Supun -- *Supun Sethunga* Software Engineer WSO2, Inc. http://wso2.com/ lean | enterprise | middleware Mobile : +94 716546324
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture