Re: [Scikit-learn-general] inconvenient side effect of FeatureHasher.transform after f5a4ad2bfc3c7e487c3855abfc8f83b670d89d0c

Lars Buitinck Wed, 10 Apr 2013 02:48:34 -0700

2013/4/10 Terry Peng <[email protected]>:
> Hi Lars Buitinck,

Replying to the ML, please send this kind of message there next time.


> I thought the order of words are same as the indices order after
> FeatureHasher.Transform. but it turn out it's not. the reason is
> sum_duplicates in FeatureHasher:
>
>         X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
>                           shape=(n_samples, self.n_features))
>         X.sum_duplicates()  # also sorts the indices
>
> which added by your change f5a4ad2bfc3c7e487c3855abfc8f83b670d89d0c (ENH
> speed up hashing and reduce memory usage by 1/3)
> sum_duplicates not only sum the values of duplicated indices, but it also
> sort the indice in natural order (from small to large). i think it's more
> convenient to not sort the indices. so we can easily get the feature back
> from the indices.

I'm not sure what effect that would have an dot products performed
with FeatureHasher output. In the best case, they'd be much slower. In
the worst case, they'd break. Before we implement anything, I'd like
to see how slow/broken the resulting CSR matrices become. Feel free to
try it out and send us a report.

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] inconvenient side effect of FeatureHasher.transform after f5a4ad2bfc3c7e487c3855abfc8f83b670d89d0c

Reply via email to