Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

Roman Yurchak via scikit-learn Mon, 26 Nov 2018 05:41:36 -0800

Hi Matthieu,

if you are interested in general questions regarding improving 
scikit-learn performance, you might be want to have a look at the draft 
roadmap
https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- 
there is a lot topics where suggestions / PRs on improving performance 
would be very welcome.

For the particular case of TfidfVectorizer, it is a bit different from 
the rest of the scikit-learn code base in the sense that it's not 
limited by the performance of numerical calculation but rather that of 
string processing and counting. TfidfVectorizer is equivalent to 
CountVectorizer + TfidfTransformer and the later  has only a marginal 
computational cost. As to CountVectorizer, last time I checked, its 
profiling was something along the lines of,
  - part regexp for tokenization (see token_pattern.findall)
  - part token counting (see CountVectorizer._count_vocab)
  - and a comparable part for all the rest

Because of that, porting it to Cython is not that immediate, as one is 
still going to use CPython regexp and token counting in a dict. For 
instance, HashingVectorizer implements token counting in Cython -- it's 
faster but not that much faster. Using C++ maps or some less common 
structures have been discussed in 
https://github.com/scikit-learn/scikit-learn/issues/2639

Currently, I think, there are ~3 main ways performance could be improved,
  1. Optimize the current implementation while remaining in Python. 
Possible but IMO would require some effort, because there are not much 
low hanging fruits left there. Though a new look would definitely be good.

  2. Parallelize computations. There was some earlier discussion about 
this in scikit-learn issues, but at present, the better way would 
probably be to add it in dask-ml (see 
https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already 
supported. Someone would need to implement CountVectorizer.

  3. Rewrite part of the implementation in a lower level language (e.g. 
Cython). The question is how maintainable that would be, and whether the 
performance gains would be worth it.  Now that Python 2 will be dropped, 
at least not having to deal with Py2/3 compatibility for strings in 
Cython might make things a bit easier. Though, if the processing is in 
Cython it might also make using custom tokenizers/analyzers more difficult.

    On a related topic, I have been experimenting with implementing part 
of this processing in Rust lately: 
https://github.com/rth/text-vectorize. So far it looks promising. 
Though, of course, it will remain a separate project because of language 
constraints in scikit-learn.

In general if you have thoughts on things that can be improved, don't 
hesitate to open issues,
-- 
Roman

On 25/11/2018 10:59, Matthieu Brucher wrote:
> Hi all,
> 
> I've noticed a few questions online (mainly SO) on TfidfVectorizer 
> speed, and I was wondering about the global effort on speeding up sklearn.
> Is there something I can help on this topic (Cython?), as well as a 
> discussion on this tough subject?
> 
> Cheers,
> 
> Matthieu
> -- 
> Quantitative analyst, Ph.D.
> Blog: http://blog.audio-tk.com/
> LinkedIn: http://www.linkedin.com/in/matthieubrucher

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

Reply via email to