Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

Andreas Mueller Mon, 26 Nov 2018 07:30:51 -0800

I think tries might be an interesting datastructure, but it reallydepends on where the bottleneck is.I'm really surprised they are not used more, but maybe that's justbecause implementations are missing?


On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote:

Hi Matthieu,


if you are interested in general questions regarding improving
scikit-learn performance, you might be want to have a look at the draft
roadmap
https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 --
there is a lot topics where suggestions / PRs on improving performance
would be very welcome.

For the particular case of TfidfVectorizer, it is a bit different from
the rest of the scikit-learn code base in the sense that it's not
limited by the performance of numerical calculation but rather that of
string processing and counting. TfidfVectorizer is equivalent to
CountVectorizer + TfidfTransformer and the later  has only a marginal
computational cost. As to CountVectorizer, last time I checked, its
profiling was something along the lines of,
   - part regexp for tokenization (see token_pattern.findall)
   - part token counting (see CountVectorizer._count_vocab)
   - and a comparable part for all the rest

Because of that, porting it to Cython is not that immediate, as one is
still going to use CPython regexp and token counting in a dict. For
instance, HashingVectorizer implements token counting in Cython -- it's
faster but not that much faster. Using C++ maps or some less common
structures have been discussed in
https://github.com/scikit-learn/scikit-learn/issues/2639

Currently, I think, there are ~3 main ways performance could be improved,
   1. Optimize the current implementation while remaining in Python.
Possible but IMO would require some effort, because there are not much
low hanging fruits left there. Though a new look would definitely be good.

   2. Parallelize computations. There was some earlier discussion about
this in scikit-learn issues, but at present, the better way would
probably be to add it in dask-ml (see
https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already
supported. Someone would need to implement CountVectorizer.

   3. Rewrite part of the implementation in a lower level language (e.g.
Cython). The question is how maintainable that would be, and whether the
performance gains would be worth it.  Now that Python 2 will be dropped,
at least not having to deal with Py2/3 compatibility for strings in
Cython might make things a bit easier. Though, if the processing is in
Cython it might also make using custom tokenizers/analyzers more difficult.

     On a related topic, I have been experimenting with implementing part
of this processing in Rust lately:
https://github.com/rth/text-vectorize. So far it looks promising.
Though, of course, it will remain a separate project because of language
constraints in scikit-learn.

In general if you have thoughts on things that can be improved, don't
hesitate to open issues,


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

Reply via email to