Re: postings lists deduplication

Walter Underwood Thu, 06 Jun 2013 08:12:30 -0700

Stemming is not 1:1. There are contractions that go to two words (won't -> will 
not), German decompounding can create a nearly arbitrary number of subwords, 
and there are two-token sequences that stem to a single word.


Synonyms also are often multi-word. I just added a symmetrical synonym for 
"A&M" and "A & M" to our college name search.

wunder

On Jun 6, 2013, at 4:01 AM, Dmitry Kan wrote:

> Thanks!
> 
> Yes, it could be that allowing single term to point to several posting lists 
> is good e.g. for synonyms. So that there would be a single entry point for 
> one synonym (term) of the synonym set and it would find all doc ids where 
> synonyms of the entry point occur. Or is it being done like this already?
> 
> For the exact / inexact matching, the implementation we have now would 
> suggest all surface forms occurred in the doc corpus of a word and its stem 
> to be pointing to a single posting list. Which potentially makes the inverted 
> index more compact. But maybe maintaining N lists + mergesort is faster?
> 
> For the reverse expansion idea, which I personally like as well, we could 
> 
> 
> 2013/6/6 Michael McCandless <[email protected]>
> Neat idea!
> 
> Would this idea allow a single term to point to (the union of) N other
> posting lists?  It seems like that's necessary e.g. to handle the
> exact/inexact case.
> 
> And then, to produce the Docs/AndPositionsEnum you'd need to do the
> merge sort across those N posting lists?
> 
> Such a thing might also be do-able as runtime only wrapper around the
> postings API (FieldsProducer), if you could at runtime do the reverse
> expansion (e.g. stem -> all of its surface forms).
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, Jun 6, 2013 at 3:51 AM, Dmitry Kan <[email protected]> wrote:
> 
> > Robert Muir and I have discussed what Robert eventually named "postings
> > lists deduplication" at bbuzz 2013 conference in Berlin.
> >
> > The idea is to allow multiple terms to point to the same postings list to
> > save space.
> >
> > The application / impact of this is positive for synonyms, exact / inexact
> > terms, leading wildcard support via storing reversed term etc.
> >
> > At the moment, when supporting exact (unstemmed) and inexact (stemmed)
> > searches, we store both unstemmed and stemmed variant of a word form and
> > that leads to index bloating. For example, we had to remove the leading
> > wildcard support via reversing a token on index and query time because of
> > the same index size considerations.
> >
> > Would you like a jira for this?
> >
> > Thanks,
> >
> > Dmitry Kan
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 

--
Walter Underwood
[email protected]

Re: postings lists deduplication

Reply via email to