One-hot-encoding by nature requires the input feature dimension from
fitting to be the same at transform time.

Take a look at DictVectorizer (
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer),
which will assign unknown (new) feature values to zero at transform time.
Also FeatureHasher (
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher)
which is an approximate (and bounded-memory) version of DictVectorizer.

On Tue, Nov 17, 2015 at 8:19 AM, Startup Hire <blrstartuph...@gmail.com>
wrote:

> Hi Pypers,
>
> Hope you are doing well.
>
> I am doing multi label classification in which my X and Y are sparse
> matrices with Y properly binarized.
>
> I am able to get done with multi label classification with 12338
> features. I saved the model and tried and used it for prediction on new
> data.
>
> This is the issue I am facing:
>
>
>    -          The number of features which are there in the model is
>    quite different from new data. This is because of OneHotEncoding of
>    categorical variables leading to different # of features on training data
>    vs new data.
>
>
> Let me know in what are the ways this can be resolved. Should I make any 
> upstream changes?
>
>
> Regards,
>
> Sanant
>
>
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to