Hi, Michael. On Tue, Apr 26, 2022 at 10:45 PM Michael Sokolov <[email protected]> wrote:
> I'm not sure under which scenario ngrams (edgengrams) would not be an > option? Edgengrams bumps index size a few times, since postings are repeated per every derived term. Some systems can't afford such a big footprint. > Another to try maybe would be something like BPE (byte pair > encoding). In this encoding, you train a set of tokens from a > vocabulary based on frequency of occurrence, and agglomerate them > iteratively until you have the vocabulary at a size you like. You tend > to end up with commonly-ocurring subwords (morphemes) that can > possibly be good indexing choices for this sort of thing? > It's a productive idea, but there will be some queries which yield no results due to this pruning. > > On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless > <[email protected]> wrote: > > > > One small datapoint: Amazon's customer facing product search now > includes some infix suggestions (using Lucene's AnalyzingInfixSuggester), > but only in fallback cases when the prefix suggesters didn't find > compelling options. > > > > And I think Netflix's suggester used to be primarily infix, but now when > I tested it, I get no suggestions at all, only live search results, which I > like less :) > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <[email protected]> > wrote: > >> > >> Hi Mikhail, > >> > >> I don't have any spectacular suggestions but something stemming from > experience. > >> > >> 1) While the problem is intellectually interesting, I rarely found > >> anybody who'd be comfortable with using infix suggestions - people are > >> very used to "completions" happening on a prefix of one or multiple > >> words (see my note below, though). > >> > >> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of > >> word suffix(es) -> complete word instead of offsets within the block? > >> This can be combined with term frequency to limit the number of > >> suggested words to just certain categories (or most frequent terms) > >> which would make the fst smaller still. > >> > >> 3) I'd never try to store infixes shorter than 2, 3 characters (you > >> said you did it - "I even limited suffixes length to reduce their > >> number"). This requires folks to type in longer input but prevents fst > >> bloat and in general leads to higher-quality suggestions (since > >> there'll be so many of them). > >> > >> > Otherwise, with many smaller segments fully scanning term > dictionaries is comparable to seeking suffixes FST and scanning certain > blocks. > >> > >> Yeah, I'd expect the automaton here to be huge. The complexity of the > >> vocabulary and number of characters in the language will also play a > >> key role. > >> > >> 4) IntelliJ idea has this kind of "search everywhere" functionality > >> which greps for infixes (it is really nice). I recall looking at the > >> (open source engine) to see how it was done and my conclusion from > >> glancing over the code was that it's a fixed, coarse, n-gram based > >> index of consecutive letters pointing at potential matches, which are > >> then revalidated against the query. So you have a super-simple index, > >> with a very fast lookup and the cost of verifying and finding exact > >> matches is shifted to once you have a candidate list. While this > >> doesn't help with Lucene indexes, perhaps it's a sign that for this > >> particular task a different index/search paradigm is needed? > >> > >> > >> Dawid > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > -- Sincerely yours Mikhail Khludnev
