2013/4/10 Terry Peng <[email protected]>: > Hi Lars Buitinck, Replying to the ML, please send this kind of message there next time.
> I thought the order of words are same as the indices order after > FeatureHasher.Transform. but it turn out it's not. the reason is > sum_duplicates in FeatureHasher: > > X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype, > shape=(n_samples, self.n_features)) > X.sum_duplicates() # also sorts the indices > > which added by your change f5a4ad2bfc3c7e487c3855abfc8f83b670d89d0c (ENH > speed up hashing and reduce memory usage by 1/3) > sum_duplicates not only sum the values of duplicated indices, but it also > sort the indice in natural order (from small to large). i think it's more > convenient to not sort the indices. so we can easily get the feature back > from the indices. I'm not sure what effect that would have an dot products performed with FeatureHasher output. In the best case, they'd be much slower. In the worst case, they'd break. Before we implement anything, I'd like to see how slow/broken the resulting CSR matrices become. Feel free to try it out and send us a report. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
