[Dev] [ML] Identified Limitations - Pipeline PoC

Nethaji Chandrasiri Sun, 25 Oct 2015 13:11:34 -0700

Hi,

As I'm currently working on spark pipeline poc (regex transformer) I've
discovered the following limitations when using data frames and defining
new stages in pipeline,


There's no way to add a csv file directly to a data frame, therefore we
have to use RDD to create a row type rdd and a schema to create a data
frame.

When implementing *Transformer* interface to define a new pipeline stage,
*transform* method is the one that returns a data frame yet it only accepts
a data frame, So if we consider a data frame as a table in a relational
database, in my case I have to convert the data frame into a rdd to replace
matching values with the given label but I couldn't find a way to convert
that rdd back to a data frame without using sql context.

There's another way to do it without using rdd inside a pipeline stage, (by
simply converting the data frame into a Row type array) but I'm still
working on converting that into a data frame.

Since *transformSchema *returns the schema as initial step we have to keep
a record of every change that we do to schema as a separate list, so that
we can pass it to next stage. If not next stage takes the original schema
instead of changed one.

As I'm still working on this task I'll update the details accordingly.

Appreciate any suggestions and comments.


Thanks
-- 
*Nethaji Chandrasiri*
*Software Engineering* *Intern; WSO2, Inc.; http://wso2.com
<http://wso2.com/>*
Mobile : +94 (0) 779171059 <%2B94%20%280%29%20778%20800570>
Email  : [email protected]

_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

[Dev] [ML] Identified Limitations - Pipeline PoC

Reply via email to