Re: FST codec for infix queries. No luck so far.

Walter Underwood Tue, 26 Apr 2022 14:25:05 -0700

I built the original Netflix autocomplete. That used edge Ngrams running on 
Solr 1.3.


It isn’t a really big index. There just aren’t that many movies and TV shows. I 
think we had 70k titles and 150k people (actors, directors, …).

We handled one corner case in the client code. Movies with a one-character 
title must show up for that character or they are unmatchable. You can’t type 
more characters to match A, M, X, or Z (all movies). That special case still 
works on dvd.netflix.com, but not on the streaming site. 

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)

> On Apr 26, 2022, at 12:45 PM, Michael Sokolov <[email protected]> wrote:
> 
> I'm not sure under which scenario ngrams (edgengrams) would not be an
> option? Another to try maybe would be something like BPE (byte pair
> encoding). In this encoding, you train a set of tokens from a
> vocabulary based on frequency of occurrence, and agglomerate them
> iteratively until you have the vocabulary at a size you like. You tend
> to end up with commonly-ocurring subwords (morphemes) that can
> possibly be good indexing choices for this sort of thing?
> 
> On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless
> <[email protected]> wrote:
>> 
>> One small datapoint: Amazon's customer facing product search now includes 
>> some infix suggestions (using Lucene's AnalyzingInfixSuggester), but only in 
>> fallback cases when the prefix suggesters didn't find compelling options.
>> 
>> And I think Netflix's suggester used to be primarily infix, but now when I 
>> tested it, I get no suggestions at all, only live search results, which I 
>> like less :)
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> 
>> On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <[email protected]> wrote:
>>> 
>>> Hi Mikhail,
>>> 
>>> I don't have any spectacular suggestions but something stemming from 
>>> experience.
>>> 
>>> 1) While the problem is intellectually interesting, I rarely found
>>> anybody who'd be comfortable with using infix suggestions - people are
>>> very used to "completions" happening on a prefix of one or multiple
>>> words (see my note below, though).
>>> 
>>> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
>>> word suffix(es) -> complete word instead of offsets within the block?
>>> This can be combined with term frequency to limit the number of
>>> suggested words to just certain categories (or most frequent terms)
>>> which would make the fst smaller still.
>>> 
>>> 3) I'd never try to store infixes shorter than 2, 3 characters (you
>>> said you did it - "I even limited suffixes length to reduce their
>>> number"). This requires folks to type in longer input but prevents fst
>>> bloat and in general leads to higher-quality suggestions (since
>>> there'll be so many of them).
>>> 
>>>> Otherwise, with many smaller segments fully scanning term dictionaries is 
>>>> comparable to seeking suffixes FST and scanning certain blocks.
>>> 
>>> Yeah, I'd expect the automaton here to be huge. The complexity of the
>>> vocabulary and number of characters in the language will also play a
>>> key role.
>>> 
>>> 4) IntelliJ idea has this kind of "search everywhere" functionality
>>> which greps for infixes (it is really nice). I recall looking at the
>>> (open source engine) to see how it was done and my conclusion from
>>> glancing over the code was that it's a fixed, coarse, n-gram based
>>> index of consecutive letters pointing at potential matches, which are
>>> then revalidated against the query. So you have a super-simple index,
>>> with a very fast lookup and the cost of verifying and finding exact
>>> matches is shifted to once you have a candidate list. While this
>>> doesn't help with Lucene indexes, perhaps it's a sign that for this
>>> particular task a different index/search paradigm is needed?
>>> 
>>> 
>>> Dawid
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

Re: FST codec for *infix* queries. No luck so far.

Reply via email to

Re: FST codec for infix queries. No luck so far.