Re: postings lists deduplication

Walter Underwood Thu, 06 Jun 2013 09:27:20 -0700

I've seen this behavior in commercial tokenizers and stemmers that I've used in 
other products. I would not be surprised if the Basistech package for Lucene 
did this.


wunder

On Jun 6, 2013, at 8:44 AM, Dmitry Kan wrote:

> Walter,
> 
> How are cases like (won't -> will not) are handled now? Does not it depend on 
> tokenizer before stemmer kicks in? I.e. in the example, if ' gets removed by 
> tokenizer we end up having won and t as separate tokens? Is there any lucene 
> filter able to do the expansion?
> 
> Dmitry
> 
> 
> 2013/6/6 Walter Underwood <[email protected]>
> Stemming is not 1:1. There are contractions that go to two words (won't -> 
> will not), German decompounding can create a nearly arbitrary number of 
> subwords, and there are two-token sequences that stem to a single word.
> 
> Synonyms also are often multi-word. I just added a symmetrical synonym for 
> "A&M" and "A & M" to our college name search.
> 
> wunder
> 
> On Jun 6, 2013, at 4:01 AM, Dmitry Kan wrote:
> 
>> Thanks!
>> 
>> Yes, it could be that allowing single term to point to several posting lists 
>> is good e.g. for synonyms. So that there would be a single entry point for 
>> one synonym (term) of the synonym set and it would find all doc ids where 
>> synonyms of the entry point occur. Or is it being done like this already?
>> 
>> For the exact / inexact matching, the implementation we have now would 
>> suggest all surface forms occurred in the doc corpus of a word and its stem 
>> to be pointing to a single posting list. Which potentially makes the 
>> inverted index more compact. But maybe maintaining N lists + mergesort is 
>> faster?
>> 
>> For the reverse expansion idea, which I personally like as well, we could 
>> 
>> 
>> 2013/6/6 Michael McCandless <[email protected]>
>> Neat idea!
>> 
>> Would this idea allow a single term to point to (the union of) N other
>> posting lists?  It seems like that's necessary e.g. to handle the
>> exact/inexact case.
>> 
>> And then, to produce the Docs/AndPositionsEnum you'd need to do the
>> merge sort across those N posting lists?
>> 
>> Such a thing might also be do-able as runtime only wrapper around the
>> postings API (FieldsProducer), if you could at runtime do the reverse
>> expansion (e.g. stem -> all of its surface forms).
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> On Thu, Jun 6, 2013 at 3:51 AM, Dmitry Kan <[email protected]> wrote:
>> 
>> > Robert Muir and I have discussed what Robert eventually named "postings
>> > lists deduplication" at bbuzz 2013 conference in Berlin.
>> >
>> > The idea is to allow multiple terms to point to the same postings list to
>> > save space.
>> >
>> > The application / impact of this is positive for synonyms, exact / inexact
>> > terms, leading wildcard support via storing reversed term etc.
>> >
>> > At the moment, when supporting exact (unstemmed) and inexact (stemmed)
>> > searches, we store both unstemmed and stemmed variant of a word form and
>> > that leads to index bloating. For example, we had to remove the leading
>> > wildcard support via reversing a token on index and query time because of
>> > the same index size considerations.
>> >
>> > Would you like a jira for this?
>> >
>> > Thanks,
>> >
>> > Dmitry Kan
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 
> 
> --
> Walter Underwood
> [email protected]
> 
> 
> 
> 

--
Walter Underwood
[email protected]

Re: postings lists deduplication

Reply via email to