Re: postings lists deduplication

Dmitry Kan Thu, 06 Jun 2013 08:44:35 -0700

Walter,

How are cases like (won't -> will not) are handled now? Does not it depend
on tokenizer before stemmer kicks in? I.e. in the example, if ' gets
removed by tokenizer we end up having won and t as separate tokens? Is
there any lucene filter able to do the expansion?


Dmitry


2013/6/6 Walter Underwood <[email protected]>

> Stemming is not 1:1. There are contractions that go to two words (won't ->
> will not), German decompounding can create a nearly arbitrary number of
> subwords, and there are two-token sequences that stem to a single word.
>
> Synonyms also are often multi-word. I just added a symmetrical synonym for
> "A&M" and "A & M" to our college name search.
>
> wunder
>
> On Jun 6, 2013, at 4:01 AM, Dmitry Kan wrote:
>
> Thanks!
>
> Yes, it could be that allowing single term to point to several posting
> lists is good e.g. for synonyms. So that there would be a single entry
> point for one synonym (term) of the synonym set and it would find all doc
> ids where synonyms of the entry point occur. Or is it being done like this
> already?
>
> For the exact / inexact matching, the implementation we have now would
> suggest all surface forms occurred in the doc corpus of a word and its stem
> to be pointing to a single posting list. Which potentially makes the
> inverted index more compact. But maybe maintaining N lists + mergesort is
> faster?
>
> For the reverse expansion idea, which I personally like as well, we could
>
>
> 2013/6/6 Michael McCandless <[email protected]>
>
>> Neat idea!
>>
>> Would this idea allow a single term to point to (the union of) N other
>> posting lists?  It seems like that's necessary e.g. to handle the
>> exact/inexact case.
>>
>> And then, to produce the Docs/AndPositionsEnum you'd need to do the
>> merge sort across those N posting lists?
>>
>> Such a thing might also be do-able as runtime only wrapper around the
>> postings API (FieldsProducer), if you could at runtime do the reverse
>> expansion (e.g. stem -> all of its surface forms).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Jun 6, 2013 at 3:51 AM, Dmitry Kan <[email protected]>
>> wrote:
>>
>> > Robert Muir and I have discussed what Robert eventually named "postings
>> > lists deduplication" at bbuzz 2013 conference in Berlin.
>> >
>> > The idea is to allow multiple terms to point to the same postings list
>> to
>> > save space.
>> >
>> > The application / impact of this is positive for synonyms, exact /
>> inexact
>> > terms, leading wildcard support via storing reversed term etc.
>> >
>> > At the moment, when supporting exact (unstemmed) and inexact (stemmed)
>> > searches, we store both unstemmed and stemmed variant of a word form and
>> > that leads to index bloating. For example, we had to remove the leading
>> > wildcard support via reversing a token on index and query time because
>> of
>> > the same index size considerations.
>> >
>> > Would you like a jira for this?
>> >
>> > Thanks,
>> >
>> > Dmitry Kan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
> --
> Walter Underwood
> [email protected]
>
>
>
>

Re: postings lists deduplication

Reply via email to