Hi, As I'm currently working on spark pipeline poc (regex transformer) I've discovered the following limitations when using data frames and defining new stages in pipeline,
There's no way to add a csv file directly to a data frame, therefore we have to use RDD to create a row type rdd and a schema to create a data frame. When implementing *Transformer* interface to define a new pipeline stage, *transform* method is the one that returns a data frame yet it only accepts a data frame, So if we consider a data frame as a table in a relational database, in my case I have to convert the data frame into a rdd to replace matching values with the given label but I couldn't find a way to convert that rdd back to a data frame without using sql context. There's another way to do it without using rdd inside a pipeline stage, (by simply converting the data frame into a Row type array) but I'm still working on converting that into a data frame. Since *transformSchema *returns the schema as initial step we have to keep a record of every change that we do to schema as a separate list, so that we can pass it to next stage. If not next stage takes the original schema instead of changed one. As I'm still working on this task I'll update the details accordingly. Appreciate any suggestions and comments. Thanks -- *Nethaji Chandrasiri* *Software Engineering* *Intern; WSO2, Inc.; http://wso2.com <http://wso2.com/>* Mobile : +94 (0) 779171059 <%2B94%20%280%29%20778%20800570> Email : [email protected]
_______________________________________________ Dev mailing list [email protected] http://wso2.org/cgi-bin/mailman/listinfo/dev
