Re: postings lists deduplication

Dmitry Kan Fri, 07 Jun 2013 05:03:58 -0700

Thanks for your input Walter, it is valuable.


I have noticed an abruptly cut message of mine earlier in the chain:

For the reverse expansion idea, which I personally like as well, we could
open up an opportunity for folks, who experiment with surface forms
generation based on POS tags and other grammatical features from lemmas at
query time.

Where should we take this? I was thinking of setting up a codec experiment,
if that is the good starting point.

Dmitry


2013/6/6 Walter Underwood <[email protected]>

> I've seen this behavior in commercial tokenizers and stemmers that I've
> used in other products. I would not be surprised if the Basistech package
> for Lucene did this.
>
> wunder
>
> On Jun 6, 2013, at 8:44 AM, Dmitry Kan wrote:
>
> Walter,
>
> How are cases like (won't -> will not) are handled now? Does not it depend
> on tokenizer before stemmer kicks in? I.e. in the example, if ' gets
> removed by tokenizer we end up having won and t as separate tokens? Is
> there any lucene filter able to do the expansion?
>
> Dmitry
>
>
> 2013/6/6 Walter Underwood <[email protected]>
>
>> Stemming is not 1:1. There are contractions that go to two words (won't
>> -> will not), German decompounding can create a nearly arbitrary number of
>> subwords, and there are two-token sequences that stem to a single word.
>>
>> Synonyms also are often multi-word. I just added a symmetrical synonym
>> for "A&M" and "A & M" to our college name search.
>>
>> wunder
>>
>> On Jun 6, 2013, at 4:01 AM, Dmitry Kan wrote:
>>
>> Thanks!
>>
>> Yes, it could be that allowing single term to point to several posting
>> lists is good e.g. for synonyms. So that there would be a single entry
>> point for one synonym (term) of the synonym set and it would find all doc
>> ids where synonyms of the entry point occur. Or is it being done like this
>> already?
>>
>> For the exact / inexact matching, the implementation we have now would
>> suggest all surface forms occurred in the doc corpus of a word and its stem
>> to be pointing to a single posting list. Which potentially makes the
>> inverted index more compact. But maybe maintaining N lists + mergesort is
>> faster?
>>
>> For the reverse expansion idea, which I personally like as well, we could
>>
>>
>> 2013/6/6 Michael McCandless <[email protected]>
>>
>>> Neat idea!
>>>
>>> Would this idea allow a single term to point to (the union of) N other
>>> posting lists?  It seems like that's necessary e.g. to handle the
>>> exact/inexact case.
>>>
>>> And then, to produce the Docs/AndPositionsEnum you'd need to do the
>>> merge sort across those N posting lists?
>>>
>>> Such a thing might also be do-able as runtime only wrapper around the
>>> postings API (FieldsProducer), if you could at runtime do the reverse
>>> expansion (e.g. stem -> all of its surface forms).
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Jun 6, 2013 at 3:51 AM, Dmitry Kan <[email protected]>
>>> wrote:
>>>
>>> > Robert Muir and I have discussed what Robert eventually named "postings
>>> > lists deduplication" at bbuzz 2013 conference in Berlin.
>>> >
>>> > The idea is to allow multiple terms to point to the same postings list
>>> to
>>> > save space.
>>> >
>>> > The application / impact of this is positive for synonyms, exact /
>>> inexact
>>> > terms, leading wildcard support via storing reversed term etc.
>>> >
>>> > At the moment, when supporting exact (unstemmed) and inexact (stemmed)
>>> > searches, we store both unstemmed and stemmed variant of a word form
>>> and
>>> > that leads to index bloating. For example, we had to remove the leading
>>> > wildcard support via reversing a token on index and query time because
>>> of
>>> > the same index size considerations.
>>> >
>>> > Would you like a jira for this?
>>> >
>>> > Thanks,
>>> >
>>> > Dmitry Kan
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>>  --
>> Walter Underwood
>> [email protected]
>>
>>
>>
>>
>
> --
> Walter Underwood
> [email protected]
>
>
>
>

Re: postings lists deduplication

Reply via email to