[Dev] On-Hot encoding for ML input data

Supun Sethunga Mon, 11 May 2015 21:52:10 -0700

Hi all,

I was looking at how Spark treats categorical data, It seems spark expect
the categorical features to be one-hot encoded, and uses the sparsity of
the vector to identify whether it is a categorical feature or a continuous
feature. (Decision Tree is an exception. There we have to explicitly define
which features are categorical)


Hence we needs to do $subject. Could think of Three possible options:

   1. Encode the features through a spark-transformation, when the data
   point are passed to train the model.
      - *Limitation*: If we encode the data during training, we need to
      store the encoded details, to be used when predicting. BUT, when
we pass a
      data point to a spark-transformation, what it returns is the transformed
      data points (to sent to the next filter/transformation). Hence we cannot
      retrieve the encoded details, unless we store them in a third party
      location during training.
   2. Provide a separate functionality for data encoding.
      - Physically encode the data, then and there.
      - OR, Ask user how to encode each categorical feature. But not
      encoding at that point, rather use those information and encode like in
      (1). So we can use the same information to encode during the
prediction as
      well. (This way we don't have to iterate twice through the dataset either)

WDYT? (2.b sounds a good option for me) What other options do we have?

Thanks,
Supun

-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324

_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

[Dev] On-Hot encoding for ML input data

Reply via email to