Re: postings lists deduplication

Dmitry Kan Thu, 06 Jun 2013 04:01:51 -0700

Thanks!

Yes, it could be that allowing single term to point to several posting
lists is good e.g. for synonyms. So that there would be a single entry
point for one synonym (term) of the synonym set and it would find all doc
ids where synonyms of the entry point occur. Or is it being done like this
already?


For the exact / inexact matching, the implementation we have now would
suggest all surface forms occurred in the doc corpus of a word and its stem
to be pointing to a single posting list. Which potentially makes the
inverted index more compact. But maybe maintaining N lists + mergesort is
faster?

For the reverse expansion idea, which I personally like as well, we could


2013/6/6 Michael McCandless <[email protected]>

> Neat idea!
>
> Would this idea allow a single term to point to (the union of) N other
> posting lists?  It seems like that's necessary e.g. to handle the
> exact/inexact case.
>
> And then, to produce the Docs/AndPositionsEnum you'd need to do the
> merge sort across those N posting lists?
>
> Such a thing might also be do-able as runtime only wrapper around the
> postings API (FieldsProducer), if you could at runtime do the reverse
> expansion (e.g. stem -> all of its surface forms).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Jun 6, 2013 at 3:51 AM, Dmitry Kan <[email protected]>
> wrote:
>
> > Robert Muir and I have discussed what Robert eventually named "postings
> > lists deduplication" at bbuzz 2013 conference in Berlin.
> >
> > The idea is to allow multiple terms to point to the same postings list to
> > save space.
> >
> > The application / impact of this is positive for synonyms, exact /
> inexact
> > terms, leading wildcard support via storing reversed term etc.
> >
> > At the moment, when supporting exact (unstemmed) and inexact (stemmed)
> > searches, we store both unstemmed and stemmed variant of a word form and
> > that leads to index bloating. For example, we had to remove the leading
> > wildcard support via reversing a token on index and query time because of
> > the same index size considerations.
> >
> > Would you like a jira for this?
> >
> > Thanks,
> >
> > Dmitry Kan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: postings lists deduplication

Reply via email to