For me, option 1 sounds less error prone.

On Tue, May 12, 2015 at 10:20 AM, Supun Sethunga <[email protected]> wrote:

> Hi all,
>
> I was looking at how Spark treats categorical data, It seems spark expect
> the categorical features to be one-hot encoded, and uses the sparsity of
> the vector to identify whether it is a categorical feature or a continuous
> feature. (Decision Tree is an exception. There we have to explicitly define
> which features are categorical)
>
> Hence we needs to do $subject. Could think of Three possible options:
>
>    1. Encode the features through a spark-transformation, when the data
>    point are passed to train the model.
>       - *Limitation*: If we encode the data during training, we need to
>       store the encoded details, to be used when predicting. BUT, when we 
> pass a
>       data point to a spark-transformation, what it returns is the transformed
>       data points (to sent to the next filter/transformation). Hence we cannot
>       retrieve the encoded details, unless we store them in a third party
>       location during training.
>    2. Provide a separate functionality for data encoding.
>       - Physically encode the data, then and there.
>       - OR, Ask user how to encode each categorical feature. But not
>       encoding at that point, rather use those information and encode like in
>       (1). So we can use the same information to encode during the prediction 
> as
>       well. (This way we don't have to iterate twice through the dataset 
> either)
>
> WDYT? (2.b sounds a good option for me) What other options do we have?
>
> Thanks,
> Supun
>
> --
> *Supun Sethunga*
> Software Engineer
> WSO2, Inc.
> http://wso2.com/
> lean | enterprise | middleware
> Mobile : +94 716546324
>



-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to