Hello, I have a question about dummy coding using DictVectorizer or FeatureHasher.
``` >>> from sklearn.feature_extraction import DictVectorizer, FeatureHasher >>> D = [{'age': 23, 'gender': 'm'},{'age': 34, 'gender': 'f'},{'age': 18, 'gender': 'f'},{'age': 50, 'gender': 'm'}] >>> m1 = FeatureHasher(n_features=10) >>> m1.fit_transform(D).toarray() array([[ 0., 0., -1., 0., 0., 0., 0., 0., 0., 23.], [ 0., 0., 0., 0., 0., 0., 0., 0., 1., 34.], [ 0., 0., 0., 0., 0., 0., 0., 0., 1., 18.], [ 0., 0., -1., 0., 0., 0., 0., 0., 0., 50.]]) >>> m2 = DictVectorizer(sparse=False) >>> m2.fit_transform(D) array([[ 23., 0., 1.], [ 34., 1., 0.], [ 18., 1., 0.], [ 50., 0., 1.]]) >>> m2.feature_names_ ['age', 'gender=f', 'gender=m'] ``` Since both DictVectorizer and FeatureHasher generate dimensions for 'gender=m' and 'gender=f', these dimensions are perfectly correlated. This is because DictVectorizer and FeatureHasher by default generate n dimensions for n categorical values of 1 feature. My questions are as follows: 1. My expectation is for them to generate n-1 dimensions for n categorical values, and is there any way to do this using DictVectorizer and FeatureHasher? 2. How should I handle these correlated dimensions? In my understanding, the training on data which has colinearity will make prediction unstable. Will L1 or L2 regularization work for this problem? If there is any issue or article related to these questions, would you please tell me the URL? Thank you. Regards, Yusuke
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn