I've seen this behavior in commercial tokenizers and stemmers that I've used in other products. I would not be surprised if the Basistech package for Lucene did this.
wunder On Jun 6, 2013, at 8:44 AM, Dmitry Kan wrote: > Walter, > > How are cases like (won't -> will not) are handled now? Does not it depend on > tokenizer before stemmer kicks in? I.e. in the example, if ' gets removed by > tokenizer we end up having won and t as separate tokens? Is there any lucene > filter able to do the expansion? > > Dmitry > > > 2013/6/6 Walter Underwood <[email protected]> > Stemming is not 1:1. There are contractions that go to two words (won't -> > will not), German decompounding can create a nearly arbitrary number of > subwords, and there are two-token sequences that stem to a single word. > > Synonyms also are often multi-word. I just added a symmetrical synonym for > "A&M" and "A & M" to our college name search. > > wunder > > On Jun 6, 2013, at 4:01 AM, Dmitry Kan wrote: > >> Thanks! >> >> Yes, it could be that allowing single term to point to several posting lists >> is good e.g. for synonyms. So that there would be a single entry point for >> one synonym (term) of the synonym set and it would find all doc ids where >> synonyms of the entry point occur. Or is it being done like this already? >> >> For the exact / inexact matching, the implementation we have now would >> suggest all surface forms occurred in the doc corpus of a word and its stem >> to be pointing to a single posting list. Which potentially makes the >> inverted index more compact. But maybe maintaining N lists + mergesort is >> faster? >> >> For the reverse expansion idea, which I personally like as well, we could >> >> >> 2013/6/6 Michael McCandless <[email protected]> >> Neat idea! >> >> Would this idea allow a single term to point to (the union of) N other >> posting lists? It seems like that's necessary e.g. to handle the >> exact/inexact case. >> >> And then, to produce the Docs/AndPositionsEnum you'd need to do the >> merge sort across those N posting lists? >> >> Such a thing might also be do-able as runtime only wrapper around the >> postings API (FieldsProducer), if you could at runtime do the reverse >> expansion (e.g. stem -> all of its surface forms). >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Jun 6, 2013 at 3:51 AM, Dmitry Kan <[email protected]> wrote: >> >> > Robert Muir and I have discussed what Robert eventually named "postings >> > lists deduplication" at bbuzz 2013 conference in Berlin. >> > >> > The idea is to allow multiple terms to point to the same postings list to >> > save space. >> > >> > The application / impact of this is positive for synonyms, exact / inexact >> > terms, leading wildcard support via storing reversed term etc. >> > >> > At the moment, when supporting exact (unstemmed) and inexact (stemmed) >> > searches, we store both unstemmed and stemmed variant of a word form and >> > that leads to index bloating. For example, we had to remove the leading >> > wildcard support via reversing a token on index and query time because of >> > the same index size considerations. >> > >> > Would you like a jira for this? >> > >> > Thanks, >> > >> > Dmitry Kan >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > -- > Walter Underwood > [email protected] > > > > -- Walter Underwood [email protected]
