Hi qll, Sorry for the late reply, lots of things to work on currently.
I'll have a look at the roadmap and the pointers to see what could be done to enhance the situation. Cheers, Matthieu Le lun. 26 nov. 2018 à 20:09, Roman Yurchak via scikit-learn < scikit-learn@python.org> a écrit : > Tries are interesting, but it appears that while they use less memory > that dicts/maps they are generally slower than dicts for a large number > of elements. See e.g. > https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst. > This is also consistent with the results in the below linked > CountVectorizer PR that aimed to use tries, I think. > > Though maybe e.g. MARISA-Trie (and generally trie libraries available in > python) did improve significantly in 5 years since > https://github.com/scikit-learn/scikit-learn/issues/2639 was done. > > The thing is also that even HashingVecorizer that doesn't need to handle > the vocabulary is only a moderately faster, so using a better data > structure for the vocabulary might give us its performance at best.. > > -- > Roman > > On 26/11/2018 16:f28, Andreas Mueller wrote: > > I think tries might be an interesting datastructure, but it really > > depends on where the bottleneck is. > > I'm really surprised they are not used more, but maybe that's just > > because implementations are missing? > > > > On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote: > >> Hi Matthieu, > >> > >> if you are interested in general questions regarding improving > >> scikit-learn performance, you might be want to have a look at the draft > >> roadmap > >> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- > >> there is a lot topics where suggestions / PRs on improving performance > >> would be very welcome. > >> > >> For the particular case of TfidfVectorizer, it is a bit different from > >> the rest of the scikit-learn code base in the sense that it's not > >> limited by the performance of numerical calculation but rather that of > >> string processing and counting. TfidfVectorizer is equivalent to > >> CountVectorizer + TfidfTransformer and the later has only a marginal > >> computational cost. As to CountVectorizer, last time I checked, its > >> profiling was something along the lines of, > >> - part regexp for tokenization (see token_pattern.findall) > >> - part token counting (see CountVectorizer._count_vocab) > >> - and a comparable part for all the rest > >> > >> Because of that, porting it to Cython is not that immediate, as one is > >> still going to use CPython regexp and token counting in a dict. For > >> instance, HashingVectorizer implements token counting in Cython -- it's > >> faster but not that much faster. Using C++ maps or some less common > >> structures have been discussed in > >> https://github.com/scikit-learn/scikit-learn/issues/2639 > >> > >> Currently, I think, there are ~3 main ways performance could be > improved, > >> 1. Optimize the current implementation while remaining in Python. > >> Possible but IMO would require some effort, because there are not much > >> low hanging fruits left there. Though a new look would definitely be > good. > >> > >> 2. Parallelize computations. There was some earlier discussion about > >> this in scikit-learn issues, but at present, the better way would > >> probably be to add it in dask-ml (see > >> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already > >> supported. Someone would need to implement CountVectorizer. > >> > >> 3. Rewrite part of the implementation in a lower level language > (e.g. > >> Cython). The question is how maintainable that would be, and whether the > >> performance gains would be worth it. Now that Python 2 will be dropped, > >> at least not having to deal with Py2/3 compatibility for strings in > >> Cython might make things a bit easier. Though, if the processing is in > >> Cython it might also make using custom tokenizers/analyzers more > difficult. > >> > >> On a related topic, I have been experimenting with implementing > part > >> of this processing in Rust lately: > >> https://github.com/rth/text-vectorize. So far it looks promising. > >> Though, of course, it will remain a separate project because of language > >> constraints in scikit-learn. > >> > >> In general if you have thoughts on things that can be improved, don't > >> hesitate to open issues, > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Quantitative analyst, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn