Hi, After upgrading to scikit-learn 0.18 HashingVectorizer is about 10 times slower.
Before: scikit-learn 0.17. Numpy 1.11.2. Python 3.5.2 AMD64 Vectorizing 20newsgroup 11314 documents Vectorization completed in 4.594092130661011 seconds, resulting shape (11314, 1048576) After upgrade: scikit-learn 0.18. Numpy 1.11.2. Python 3.5.2 AMD64 Vectorizing 20newsgroup 11314 documents Vectorization completed in 43.587692737579346 seconds, resulting shape (11314, 1048576) Code: import time, sklearn, platform, numpy from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import HashingVectorizer data_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42) print('scikit-learn {}. Numpy {}. Python {} {}'.format(sklearn.__version__, numpy.version.full_version, platform.python_version(), platform.machine())) vectorizer = HashingVectorizer() print("Vectorizing 20newsgroup {} documents".format(len(data_train.data))) start = time.time() data = vectorizer.fit_transform(data_train.data) print("Vectorization completed in ", time.time() - start, ' seconds, resulting shape ', data.shape) Should I submit a bug report? Thank you, Gabriel Trautmann
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn