Hi Matthieu, if you are interested in general questions regarding improving scikit-learn performance, you might be want to have a look at the draft roadmap https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- there is a lot topics where suggestions / PRs on improving performance would be very welcome.
For the particular case of TfidfVectorizer, it is a bit different from the rest of the scikit-learn code base in the sense that it's not limited by the performance of numerical calculation but rather that of string processing and counting. TfidfVectorizer is equivalent to CountVectorizer + TfidfTransformer and the later has only a marginal computational cost. As to CountVectorizer, last time I checked, its profiling was something along the lines of, - part regexp for tokenization (see token_pattern.findall) - part token counting (see CountVectorizer._count_vocab) - and a comparable part for all the rest Because of that, porting it to Cython is not that immediate, as one is still going to use CPython regexp and token counting in a dict. For instance, HashingVectorizer implements token counting in Cython -- it's faster but not that much faster. Using C++ maps or some less common structures have been discussed in https://github.com/scikit-learn/scikit-learn/issues/2639 Currently, I think, there are ~3 main ways performance could be improved, 1. Optimize the current implementation while remaining in Python. Possible but IMO would require some effort, because there are not much low hanging fruits left there. Though a new look would definitely be good. 2. Parallelize computations. There was some earlier discussion about this in scikit-learn issues, but at present, the better way would probably be to add it in dask-ml (see https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already supported. Someone would need to implement CountVectorizer. 3. Rewrite part of the implementation in a lower level language (e.g. Cython). The question is how maintainable that would be, and whether the performance gains would be worth it. Now that Python 2 will be dropped, at least not having to deal with Py2/3 compatibility for strings in Cython might make things a bit easier. Though, if the processing is in Cython it might also make using custom tokenizers/analyzers more difficult. On a related topic, I have been experimenting with implementing part of this processing in Rust lately: https://github.com/rth/text-vectorize. So far it looks promising. Though, of course, it will remain a separate project because of language constraints in scikit-learn. In general if you have thoughts on things that can be improved, don't hesitate to open issues, -- Roman On 25/11/2018 10:59, Matthieu Brucher wrote: > Hi all, > > I've noticed a few questions online (mainly SO) on TfidfVectorizer > speed, and I was wondering about the global effort on speeding up sklearn. > Is there something I can help on this topic (Cython?), as well as a > discussion on this tough subject? > > Cheers, > > Matthieu > -- > Quantitative analyst, Ph.D. > Blog: http://blog.audio-tk.com/ > LinkedIn: http://www.linkedin.com/in/matthieubrucher _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn