One-hot-encoding by nature requires the input feature dimension from fitting to be the same at transform time.
Take a look at DictVectorizer ( http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer), which will assign unknown (new) feature values to zero at transform time. Also FeatureHasher ( http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher) which is an approximate (and bounded-memory) version of DictVectorizer. On Tue, Nov 17, 2015 at 8:19 AM, Startup Hire <blrstartuph...@gmail.com> wrote: > Hi Pypers, > > Hope you are doing well. > > I am doing multi label classification in which my X and Y are sparse > matrices with Y properly binarized. > > I am able to get done with multi label classification with 12338 > features. I saved the model and tried and used it for prediction on new > data. > > This is the issue I am facing: > > > - The number of features which are there in the model is > quite different from new data. This is because of OneHotEncoding of > categorical variables leading to different # of features on training data > vs new data. > > > Let me know in what are the ways this can be resolved. Should I make any > upstream changes? > > > Regards, > > Sanant > > > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general