Re: postings lists deduplication

Dmitry Kan Thu, 06 Jun 2013 08:04:09 -0700

Mike, Robert,

Is *Pluggable Codec* good way for setting up this postingformat experiment?


Dmitry


2013/6/6 Dmitry Kan <[email protected]>

> Thanks!
>
> Yes, it could be that allowing single term to point to several posting
> lists is good e.g. for synonyms. So that there would be a single entry
> point for one synonym (term) of the synonym set and it would find all doc
> ids where synonyms of the entry point occur. Or is it being done like this
> already?
>
> For the exact / inexact matching, the implementation we have now would
> suggest all surface forms occurred in the doc corpus of a word and its stem
> to be pointing to a single posting list. Which potentially makes the
> inverted index more compact. But maybe maintaining N lists + mergesort is
> faster?
>
> For the reverse expansion idea, which I personally like as well, we could
>
>
> 2013/6/6 Michael McCandless <[email protected]>
>
>> Neat idea!
>>
>> Would this idea allow a single term to point to (the union of) N other
>> posting lists?  It seems like that's necessary e.g. to handle the
>> exact/inexact case.
>>
>> And then, to produce the Docs/AndPositionsEnum you'd need to do the
>> merge sort across those N posting lists?
>>
>> Such a thing might also be do-able as runtime only wrapper around the
>> postings API (FieldsProducer), if you could at runtime do the reverse
>> expansion (e.g. stem -> all of its surface forms).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Jun 6, 2013 at 3:51 AM, Dmitry Kan <[email protected]>
>> wrote:
>>
>> > Robert Muir and I have discussed what Robert eventually named "postings
>> > lists deduplication" at bbuzz 2013 conference in Berlin.
>> >
>> > The idea is to allow multiple terms to point to the same postings list
>> to
>> > save space.
>> >
>> > The application / impact of this is positive for synonyms, exact /
>> inexact
>> > terms, leading wildcard support via storing reversed term etc.
>> >
>> > At the moment, when supporting exact (unstemmed) and inexact (stemmed)
>> > searches, we store both unstemmed and stemmed variant of a word form and
>> > that leads to index bloating. For example, we had to remove the leading
>> > wildcard support via reversing a token on index and query time because
>> of
>> > the same index size considerations.
>> >
>> > Would you like a jira for this?
>> >
>> > Thanks,
>> >
>> > Dmitry Kan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: postings lists deduplication

Reply via email to