Re: FST codec for infix queries. No luck so far.

Mikhail Khludnev Tue, 26 Apr 2022 14:27:48 -0700

Hello, David.
Thanks for your answers. Let me comment below.

On Tue, Apr 26, 2022 at 3:13 PM Dawid Weiss <[email protected]> wrote:


> Hi Mikhail,
>
> I don't have any spectacular suggestions but something stemming from
> experience.
>
> 1) While the problem is intellectually interesting, I rarely found
> anybody who'd be comfortable with using infix suggestions - people are
> very used to "completions" happening on a prefix of one or multiple
> words (see my note below, though).
>
It's interesting that I asked about generic search for *foo* queries, but
you read it as a question about infix suggestions.
It's a little bit odd but I meet customers who ask about generic search for
*infix* often - find me everything including these letters 'foo'.
I usually try to convince them that they are focusing on positive results,
but such high recall search is prone for false positives, and this makes it
quite useless.


>
> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
> word suffix(es) -> complete word instead of offsets within the block?
> This can be combined with term frequency to limit the number of
> suggested words to just certain categories (or most frequent terms)
> which would make the fst smaller still.
>
Well, I did a prototype which uses infix suggester for query expansion. It
looks quite good. But it a small lucene index, not FST with terms outputs.
Also, for such odd requirements pruning is undesirable - find me
everything, you know.


>
> 3) I'd never try to store infixes shorter than 2, 3 characters (you
> said you did it - "I even limited suffixes length to reduce their
> number"). This requires folks to type in longer input but prevents fst
> bloat and in general leads to higher-quality suggestions (since
> there'll be so many of them).
>
Good spot. Short infixes are out of use.


>
> > Otherwise, with many smaller segments fully scanning term dictionaries
> is comparable to seeking suffixes FST and scanning certain blocks.
>
> Yeah, I'd expect the automaton here to be huge. The complexity of the
> vocabulary and number of characters in the language will also play a
> key role.
>
> 4) IntelliJ idea has this kind of "search everywhere" functionality
> which greps for infixes (it is really nice). I recall looking at the
> (open source engine) to see how it was done and my conclusion from
> glancing over the code was that it's a fixed, coarse, n-gram based
> index of consecutive letters pointing at potential matches, which are
> then revalidated against the query. So you have a super-simple index,
> with a very fast lookup and the cost of verifying and finding exact
> matches is shifted to once you have a candidate list. While this
> doesn't help with Lucene indexes, perhaps it's a sign that for this
> particular task a different index/search paradigm is needed?
>
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 
Sincerely yours
Mikhail Khludnev

Re: FST codec for *infix* queries. No luck so far.

Reply via email to

Re: FST codec for infix queries. No luck so far.