Hello, David. Thanks for your answers. Let me comment below. On Tue, Apr 26, 2022 at 3:13 PM Dawid Weiss <[email protected]> wrote:
> Hi Mikhail, > > I don't have any spectacular suggestions but something stemming from > experience. > > 1) While the problem is intellectually interesting, I rarely found > anybody who'd be comfortable with using infix suggestions - people are > very used to "completions" happening on a prefix of one or multiple > words (see my note below, though). > It's interesting that I asked about generic search for *foo* queries, but you read it as a question about infix suggestions. It's a little bit odd but I meet customers who ask about generic search for *infix* often - find me everything including these letters 'foo'. I usually try to convince them that they are focusing on positive results, but such high recall search is prone for false positives, and this makes it quite useless. > > 2) Wouldn't it be better/ more efficient to maintain an fst/ index of > word suffix(es) -> complete word instead of offsets within the block? > This can be combined with term frequency to limit the number of > suggested words to just certain categories (or most frequent terms) > which would make the fst smaller still. > Well, I did a prototype which uses infix suggester for query expansion. It looks quite good. But it a small lucene index, not FST with terms outputs. Also, for such odd requirements pruning is undesirable - find me everything, you know. > > 3) I'd never try to store infixes shorter than 2, 3 characters (you > said you did it - "I even limited suffixes length to reduce their > number"). This requires folks to type in longer input but prevents fst > bloat and in general leads to higher-quality suggestions (since > there'll be so many of them). > Good spot. Short infixes are out of use. > > > Otherwise, with many smaller segments fully scanning term dictionaries > is comparable to seeking suffixes FST and scanning certain blocks. > > Yeah, I'd expect the automaton here to be huge. The complexity of the > vocabulary and number of characters in the language will also play a > key role. > > 4) IntelliJ idea has this kind of "search everywhere" functionality > which greps for infixes (it is really nice). I recall looking at the > (open source engine) to see how it was done and my conclusion from > glancing over the code was that it's a fixed, coarse, n-gram based > index of consecutive letters pointing at potential matches, which are > then revalidated against the query. So you have a super-simple index, > with a very fast lookup and the cost of verifying and finding exact > matches is shifted to once you have a candidate list. While this > doesn't help with Lucene indexes, perhaps it's a sign that for this > particular task a different index/search paradigm is needed? > > > Dawid > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Sincerely yours Mikhail Khludnev
