Hi all,
I was looking at how Spark treats categorical data, It seems spark expect
the categorical features to be one-hot encoded, and uses the sparsity of
the vector to identify whether it is a categorical feature or a continuous
feature. (Decision Tree is an exception. There we have to explicitly define
which features are categorical)
Hence we needs to do $subject. Could think of Three possible options:
1. Encode the features through a spark-transformation, when the data
point are passed to train the model.
- *Limitation*: If we encode the data during training, we need to
store the encoded details, to be used when predicting. BUT, when
we pass a
data point to a spark-transformation, what it returns is the transformed
data points (to sent to the next filter/transformation). Hence we cannot
retrieve the encoded details, unless we store them in a third party
location during training.
2. Provide a separate functionality for data encoding.
- Physically encode the data, then and there.
- OR, Ask user how to encode each categorical feature. But not
encoding at that point, rather use those information and encode like in
(1). So we can use the same information to encode during the
prediction as
well. (This way we don't have to iterate twice through the dataset either)
WDYT? (2.b sounds a good option for me) What other options do we have?
Thanks,
Supun
--
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev