For me, option 1 sounds less error prone. On Tue, May 12, 2015 at 10:20 AM, Supun Sethunga <[email protected]> wrote:
> Hi all, > > I was looking at how Spark treats categorical data, It seems spark expect > the categorical features to be one-hot encoded, and uses the sparsity of > the vector to identify whether it is a categorical feature or a continuous > feature. (Decision Tree is an exception. There we have to explicitly define > which features are categorical) > > Hence we needs to do $subject. Could think of Three possible options: > > 1. Encode the features through a spark-transformation, when the data > point are passed to train the model. > - *Limitation*: If we encode the data during training, we need to > store the encoded details, to be used when predicting. BUT, when we > pass a > data point to a spark-transformation, what it returns is the transformed > data points (to sent to the next filter/transformation). Hence we cannot > retrieve the encoded details, unless we store them in a third party > location during training. > 2. Provide a separate functionality for data encoding. > - Physically encode the data, then and there. > - OR, Ask user how to encode each categorical feature. But not > encoding at that point, rather use those information and encode like in > (1). So we can use the same information to encode during the prediction > as > well. (This way we don't have to iterate twice through the dataset > either) > > WDYT? (2.b sounds a good option for me) What other options do we have? > > Thanks, > Supun > > -- > *Supun Sethunga* > Software Engineer > WSO2, Inc. > http://wso2.com/ > lean | enterprise | middleware > Mobile : +94 716546324 > -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
_______________________________________________ Dev mailing list [email protected] http://wso2.org/cgi-bin/mailman/listinfo/dev
