Hi, Michael.

On Tue, Apr 26, 2022 at 10:45 PM Michael Sokolov <[email protected]> wrote:

> I'm not sure under which scenario ngrams (edgengrams) would not be an
> option?

Edgengrams bumps index size a few times, since postings are repeated per
every derived term. Some systems can't afford such a big footprint.


> Another to try maybe would be something like BPE (byte pair
> encoding). In this encoding, you train a set of tokens from a
> vocabulary based on frequency of occurrence, and agglomerate them
> iteratively until you have the vocabulary at a size you like. You tend
> to end up with commonly-ocurring subwords (morphemes) that can
> possibly be good indexing choices for this sort of thing?
>
It's a productive idea, but there will be some queries which yield
no results due to this pruning.


>
> On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless
> <[email protected]> wrote:
> >
> > One small datapoint: Amazon's customer facing product search now
> includes some infix suggestions (using Lucene's AnalyzingInfixSuggester),
> but only in fallback cases when the prefix suggesters didn't find
> compelling options.
> >
> > And I think Netflix's suggester used to be primarily infix, but now when
> I tested it, I get no suggestions at all, only live search results, which I
> like less :)
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <[email protected]>
> wrote:
> >>
> >> Hi Mikhail,
> >>
> >> I don't have any spectacular suggestions but something stemming from
> experience.
> >>
> >> 1) While the problem is intellectually interesting, I rarely found
> >> anybody who'd be comfortable with using infix suggestions - people are
> >> very used to "completions" happening on a prefix of one or multiple
> >> words (see my note below, though).
> >>
> >> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
> >> word suffix(es) -> complete word instead of offsets within the block?
> >> This can be combined with term frequency to limit the number of
> >> suggested words to just certain categories (or most frequent terms)
> >> which would make the fst smaller still.
> >>
> >> 3) I'd never try to store infixes shorter than 2, 3 characters (you
> >> said you did it - "I even limited suffixes length to reduce their
> >> number"). This requires folks to type in longer input but prevents fst
> >> bloat and in general leads to higher-quality suggestions (since
> >> there'll be so many of them).
> >>
> >> > Otherwise, with many smaller segments fully scanning term
> dictionaries is comparable to seeking suffixes FST and scanning certain
> blocks.
> >>
> >> Yeah, I'd expect the automaton here to be huge. The complexity of the
> >> vocabulary and number of characters in the language will also play a
> >> key role.
> >>
> >> 4) IntelliJ idea has this kind of "search everywhere" functionality
> >> which greps for infixes (it is really nice). I recall looking at the
> >> (open source engine) to see how it was done and my conclusion from
> >> glancing over the code was that it's a fixed, coarse, n-gram based
> >> index of consecutive letters pointing at potential matches, which are
> >> then revalidated against the query. So you have a super-simple index,
> >> with a very fast lookup and the cost of verifying and finding exact
> >> matches is shifted to once you have a candidate list. While this
> >> doesn't help with Lucene indexes, perhaps it's a sign that for this
> >> particular task a different index/search paradigm is needed?
> >>
> >>
> >> Dawid
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to