I built the original Netflix autocomplete. That used edge Ngrams running on Solr 1.3.
It isn’t a really big index. There just aren’t that many movies and TV shows. I think we had 70k titles and 150k people (actors, directors, …). We handled one corner case in the client code. Movies with a one-character title must show up for that character or they are unmatchable. You can’t type more characters to match A, M, X, or Z (all movies). That special case still works on dvd.netflix.com, but not on the streaming site. wunder Walter Underwood [email protected] http://observer.wunderwood.org/ (my blog) > On Apr 26, 2022, at 12:45 PM, Michael Sokolov <[email protected]> wrote: > > I'm not sure under which scenario ngrams (edgengrams) would not be an > option? Another to try maybe would be something like BPE (byte pair > encoding). In this encoding, you train a set of tokens from a > vocabulary based on frequency of occurrence, and agglomerate them > iteratively until you have the vocabulary at a size you like. You tend > to end up with commonly-ocurring subwords (morphemes) that can > possibly be good indexing choices for this sort of thing? > > On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless > <[email protected]> wrote: >> >> One small datapoint: Amazon's customer facing product search now includes >> some infix suggestions (using Lucene's AnalyzingInfixSuggester), but only in >> fallback cases when the prefix suggesters didn't find compelling options. >> >> And I think Netflix's suggester used to be primarily infix, but now when I >> tested it, I get no suggestions at all, only live search results, which I >> like less :) >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <[email protected]> wrote: >>> >>> Hi Mikhail, >>> >>> I don't have any spectacular suggestions but something stemming from >>> experience. >>> >>> 1) While the problem is intellectually interesting, I rarely found >>> anybody who'd be comfortable with using infix suggestions - people are >>> very used to "completions" happening on a prefix of one or multiple >>> words (see my note below, though). >>> >>> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of >>> word suffix(es) -> complete word instead of offsets within the block? >>> This can be combined with term frequency to limit the number of >>> suggested words to just certain categories (or most frequent terms) >>> which would make the fst smaller still. >>> >>> 3) I'd never try to store infixes shorter than 2, 3 characters (you >>> said you did it - "I even limited suffixes length to reduce their >>> number"). This requires folks to type in longer input but prevents fst >>> bloat and in general leads to higher-quality suggestions (since >>> there'll be so many of them). >>> >>>> Otherwise, with many smaller segments fully scanning term dictionaries is >>>> comparable to seeking suffixes FST and scanning certain blocks. >>> >>> Yeah, I'd expect the automaton here to be huge. The complexity of the >>> vocabulary and number of characters in the language will also play a >>> key role. >>> >>> 4) IntelliJ idea has this kind of "search everywhere" functionality >>> which greps for infixes (it is really nice). I recall looking at the >>> (open source engine) to see how it was done and my conclusion from >>> glancing over the code was that it's a fixed, coarse, n-gram based >>> index of consecutive letters pointing at potential matches, which are >>> then revalidated against the query. So you have a super-simple index, >>> with a very fast lookup and the cost of verifying and finding exact >>> matches is shifted to once you have a candidate list. While this >>> doesn't help with Lucene indexes, perhaps it's a sign that for this >>> particular task a different index/search paradigm is needed? >>> >>> >>> Dawid >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
