Hi Mikhail,

I don't have any spectacular suggestions but something stemming from experience.

1) While the problem is intellectually interesting, I rarely found
anybody who'd be comfortable with using infix suggestions - people are
very used to "completions" happening on a prefix of one or multiple
words (see my note below, though).

2) Wouldn't it be better/ more efficient to maintain an fst/ index of
word suffix(es) -> complete word instead of offsets within the block?
This can be combined with term frequency to limit the number of
suggested words to just certain categories (or most frequent terms)
which would make the fst smaller still.

3) I'd never try to store infixes shorter than 2, 3 characters (you
said you did it - "I even limited suffixes length to reduce their
number"). This requires folks to type in longer input but prevents fst
bloat and in general leads to higher-quality suggestions (since
there'll be so many of them).

> Otherwise, with many smaller segments fully scanning term dictionaries is 
> comparable to seeking suffixes FST and scanning certain blocks.

Yeah, I'd expect the automaton here to be huge. The complexity of the
vocabulary and number of characters in the language will also play a
key role.

4) IntelliJ idea has this kind of "search everywhere" functionality
which greps for infixes (it is really nice). I recall looking at the
(open source engine) to see how it was done and my conclusion from
glancing over the code was that it's a fixed, coarse, n-gram based
index of consecutive letters pointing at potential matches, which are
then revalidated against the query. So you have a super-simple index,
with a very fast lookup and the cost of verifying and finding exact
matches is shifted to once you have a candidate list. While this
doesn't help with Lucene indexes, perhaps it's a sign that for this
particular task a different index/search paradigm is needed?


Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to