Hi Dawid, Yes, the performance differences between Lucene's FST, morfologik's fsa and the PatriciaTrie are rather small.
I've managed to get something working well for my use-case. Thanks Dawid and Michael for all your insight! On 20 February 2017 at 19:21, Dawid Weiss <dawid.we...@gmail.com> wrote: > > PatriciaTrie. In particular building an FST with doShareSuffix = false is > > the fastest of any option, > > If you don't share the suffix then you are building a kind of Patricia > trie... But suffix sharing is cheap and can give you a memory saving > (and resulting cache locality sometimes) that is non-trivial to > neglect. > > > and morfologik's fsa has the lowest heap usage of > > all (both in terms of garbage and the final size of the FSA). > > I think these would be minuscule differences, really. Something not > worth the effort of maintaining a compatibility with two libraries, > for example. Lucene's FST are transducers, so they do have an > additional benefit in that you can store something extra with each > entry (this may be handy sometimes -- for example frequency > information attached to each term, etc.). > > > to work with BytesRef) and it supports both prefix > > (MatchResult.SEQUENCE_IS_A_PREFIX) and exact matching > > (MatchResult.EXACT_MATCH). > > So does Lucene's FST, although at a lower level (you'd need to descend > manually). > > > Given the efficiencies and functionality similarity of the FST vs > Automation > > in Lucene, I'm kind of curious as to why you would ever use Automaton? > > The Automaton class was ported from Brics library and it supports a > much more generic class of automata, including optimizations from > NFA->DFA, etc. The FST class is built directly from one specific kind > of data (unique sorted keys on input) and it's heavily optimized > towards this case. > > Dawid >