Hi, I'm trying to make the hashingvectorizer work for online learning. To do this, I need it to give actual token counts.
The HashingVectorizer in Sci-Kit learn doesn't give token counts, but by default gives a normalized count either l1 or l2. I need the tokenized counts, so I set norm = None. However, after I do this, I'm no longer getting decimals, but I'm still getting negative numbers. It seems like the negatives can be removed by setting non_negative = True, which takes the absolute value of the values. However, I don't understand why the negatives are there in the first place, or what they mean. I'm not sure if the absolute values are corresponding to the token counts. Can someone please help explain what the HashingVectorizer is doing? How do I get the HashingVectorizer to return token counts? You can replicate my results with the following code - I'm using the 20newsgroups dataset which comes with sci-kit learn: from sklearn.datasets import fetch_20newsgroups twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42) from sklearn.feature_extraction.text import HashingVectorizer # produces normalized results with mean 0 and unit variance cv = HashingVectorizer(stop_words = 'english') X_train = cv.fit_transform(twenty_train.data) print(X_train) # produces integer results both positive and negative cv = HashingVectorizer(stop_words = 'english', norm=None) X_train = cv.fit_transform(twenty_train.data) print(X_train) # produces only positive results but not sure if they correspond to counts cv = HashingVectorizer(stop_words = 'english', norm=None, non_negative = True) X_train = cv.fit_transform(twenty_train.data) print(X_train)
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
