Stemming is not 1:1. There are contractions that go to two words (won't -> will not), German decompounding can create a nearly arbitrary number of subwords, and there are two-token sequences that stem to a single word.
Synonyms also are often multi-word. I just added a symmetrical synonym for "A&M" and "A & M" to our college name search. wunder On Jun 6, 2013, at 4:01 AM, Dmitry Kan wrote: > Thanks! > > Yes, it could be that allowing single term to point to several posting lists > is good e.g. for synonyms. So that there would be a single entry point for > one synonym (term) of the synonym set and it would find all doc ids where > synonyms of the entry point occur. Or is it being done like this already? > > For the exact / inexact matching, the implementation we have now would > suggest all surface forms occurred in the doc corpus of a word and its stem > to be pointing to a single posting list. Which potentially makes the inverted > index more compact. But maybe maintaining N lists + mergesort is faster? > > For the reverse expansion idea, which I personally like as well, we could > > > 2013/6/6 Michael McCandless <[email protected]> > Neat idea! > > Would this idea allow a single term to point to (the union of) N other > posting lists? It seems like that's necessary e.g. to handle the > exact/inexact case. > > And then, to produce the Docs/AndPositionsEnum you'd need to do the > merge sort across those N posting lists? > > Such a thing might also be do-able as runtime only wrapper around the > postings API (FieldsProducer), if you could at runtime do the reverse > expansion (e.g. stem -> all of its surface forms). > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Jun 6, 2013 at 3:51 AM, Dmitry Kan <[email protected]> wrote: > > > Robert Muir and I have discussed what Robert eventually named "postings > > lists deduplication" at bbuzz 2013 conference in Berlin. > > > > The idea is to allow multiple terms to point to the same postings list to > > save space. > > > > The application / impact of this is positive for synonyms, exact / inexact > > terms, leading wildcard support via storing reversed term etc. > > > > At the moment, when supporting exact (unstemmed) and inexact (stemmed) > > searches, we store both unstemmed and stemmed variant of a word form and > > that leads to index bloating. For example, we had to remove the leading > > wildcard support via reversing a token on index and query time because of > > the same index size considerations. > > > > Would you like a jira for this? > > > > Thanks, > > > > Dmitry Kan > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Walter Underwood [email protected]
