Tries are interesting, but it appears that while they use less memory that dicts/maps they are generally slower than dicts for a large number of elements. See e.g. https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst. This is also consistent with the results in the below linked CountVectorizer PR that aimed to use tries, I think.
Though maybe e.g. MARISA-Trie (and generally trie libraries available in python) did improve significantly in 5 years since https://github.com/scikit-learn/scikit-learn/issues/2639 was done. The thing is also that even HashingVecorizer that doesn't need to handle the vocabulary is only a moderately faster, so using a better data structure for the vocabulary might give us its performance at best.. -- Roman On 26/11/2018 16:f28, Andreas Mueller wrote: > I think tries might be an interesting datastructure, but it really > depends on where the bottleneck is. > I'm really surprised they are not used more, but maybe that's just > because implementations are missing? > > On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote: >> Hi Matthieu, >> >> if you are interested in general questions regarding improving >> scikit-learn performance, you might be want to have a look at the draft >> roadmap >> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- >> there is a lot topics where suggestions / PRs on improving performance >> would be very welcome. >> >> For the particular case of TfidfVectorizer, it is a bit different from >> the rest of the scikit-learn code base in the sense that it's not >> limited by the performance of numerical calculation but rather that of >> string processing and counting. TfidfVectorizer is equivalent to >> CountVectorizer + TfidfTransformer and the later has only a marginal >> computational cost. As to CountVectorizer, last time I checked, its >> profiling was something along the lines of, >> - part regexp for tokenization (see token_pattern.findall) >> - part token counting (see CountVectorizer._count_vocab) >> - and a comparable part for all the rest >> >> Because of that, porting it to Cython is not that immediate, as one is >> still going to use CPython regexp and token counting in a dict. For >> instance, HashingVectorizer implements token counting in Cython -- it's >> faster but not that much faster. Using C++ maps or some less common >> structures have been discussed in >> https://github.com/scikit-learn/scikit-learn/issues/2639 >> >> Currently, I think, there are ~3 main ways performance could be improved, >> 1. Optimize the current implementation while remaining in Python. >> Possible but IMO would require some effort, because there are not much >> low hanging fruits left there. Though a new look would definitely be good. >> >> 2. Parallelize computations. There was some earlier discussion about >> this in scikit-learn issues, but at present, the better way would >> probably be to add it in dask-ml (see >> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already >> supported. Someone would need to implement CountVectorizer. >> >> 3. Rewrite part of the implementation in a lower level language (e.g. >> Cython). The question is how maintainable that would be, and whether the >> performance gains would be worth it. Now that Python 2 will be dropped, >> at least not having to deal with Py2/3 compatibility for strings in >> Cython might make things a bit easier. Though, if the processing is in >> Cython it might also make using custom tokenizers/analyzers more difficult. >> >> On a related topic, I have been experimenting with implementing part >> of this processing in Rust lately: >> https://github.com/rth/text-vectorize. So far it looks promising. >> Though, of course, it will remain a separate project because of language >> constraints in scikit-learn. >> >> In general if you have thoughts on things that can be improved, don't >> hesitate to open issues, > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn