[jira] [Created] (LUCENE-5422) Postings lists deduplication

Dmitry Kan (JIRA) Thu, 30 Jan 2014 01:30:32 -0800

Dmitry Kan created LUCENE-5422:
----------------------------------

             Summary: Postings lists deduplication
                 Key: LUCENE-5422
                 URL: https://issues.apache.org/jira/browse/LUCENE-5422
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/codecs, core/index
            Reporter: Dmitry Kan



The context:
http://markmail.org/thread/tywtrjjcfdbzww6f

Robert Muir and I have discussed what Robert eventually named "postings
lists deduplication" at Berlin Buzzwords 2013 conference.

The idea is to allow multiple terms to point to the same postings list to
save space. This can be achieved by new index codec implementation, but this 
jira is open to other ideas as well.

The application / impact of this is positive for synonyms, exact / inexact
terms, leading wildcard support via storing reversed term etc.

For example, at the moment, when supporting exact (unstemmed) and inexact 
(stemmed)
searches, we store both unstemmed and stemmed variant of a word form and
that leads to index bloating. That is why we had to remove the leading
wildcard support via reversing a token on index and query time because of
the same index size considerations.

Comment from Mike McCandless:
Neat idea!

Would this idea allow a single term to point to (the union of) N other
posting lists?  It seems like that's necessary e.g. to handle the
exact/inexact case.

And then, to produce the Docs/AndPositionsEnum you'd need to do the
merge sort across those N posting lists?

Such a thing might also be do-able as runtime only wrapper around the
postings API (FieldsProducer), if you could at runtime do the reverse
expansion (e.g. stem -> all of its surface forms).


Comment from Robert Muir:
I think the exact/inexact is trickier (detecting it would be the hard
part), and you are right, another solution might work better.

but for the reverse wildcard and synonyms situation, it seems we could even
detect it on write if we created some hash of the previous terms postings.
if the hash matches for the current term, we know it might be a "duplicate"
and would have to actually do the costly check they are the same.

maybe there are better ways to do it, but it might be a fun postingformat
experiment to try.





--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-5422) Postings lists deduplication

Reply via email to