Hi Mikhail, I don't have any spectacular suggestions but something stemming from experience.
1) While the problem is intellectually interesting, I rarely found anybody who'd be comfortable with using infix suggestions - people are very used to "completions" happening on a prefix of one or multiple words (see my note below, though). 2) Wouldn't it be better/ more efficient to maintain an fst/ index of word suffix(es) -> complete word instead of offsets within the block? This can be combined with term frequency to limit the number of suggested words to just certain categories (or most frequent terms) which would make the fst smaller still. 3) I'd never try to store infixes shorter than 2, 3 characters (you said you did it - "I even limited suffixes length to reduce their number"). This requires folks to type in longer input but prevents fst bloat and in general leads to higher-quality suggestions (since there'll be so many of them). > Otherwise, with many smaller segments fully scanning term dictionaries is > comparable to seeking suffixes FST and scanning certain blocks. Yeah, I'd expect the automaton here to be huge. The complexity of the vocabulary and number of characters in the language will also play a key role. 4) IntelliJ idea has this kind of "search everywhere" functionality which greps for infixes (it is really nice). I recall looking at the (open source engine) to see how it was done and my conclusion from glancing over the code was that it's a fixed, coarse, n-gram based index of consecutive letters pointing at potential matches, which are then revalidated against the query. So you have a super-simple index, with a very fast lookup and the cost of verifying and finding exact matches is shifted to once you have a candidate list. While this doesn't help with Lucene indexes, perhaps it's a sign that for this particular task a different index/search paradigm is needed? Dawid --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
