[jira] [Commented] (LUCENE-5422) Postings lists deduplication

Vishmi Money (JIRA) Tue, 18 Mar 2014 23:01:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940214#comment-13940214
 ]


Vishmi Money commented on LUCENE-5422:
--------------------------------------

[~dmitry_key] , 
yes, I agree with you. Better scoping is needed and expert ideas are also 
needed for that. As [~mikemccand] said, a clearer use case may solve the 
problem. If we can come up with a clear use case deciding when deduplication 
should really happen, it will help a lot to achieve this objective . So that 
the ideas are needed.

As I didn't tell you about the progress of my work, I like to let you know that 
now I'm analyzing Lucene tokenizing and indexing as it is the main area I have 
to work with, than searching. But  for the improvement or persistence of 
performance, I also learn about improving search queries for Lucene, according 
to our objective here. For that understanding and debugging purposes, I use 
Luke, a Index Browser tool for Lucene. Please let me know if there are other 
areas which I should look in.

Also I again remind you about reviewing my proposal, as it will be a great 
support for me if I can get feedback from you.

> Postings lists deduplication
> ----------------------------
>
>                 Key: LUCENE-5422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/index
>            Reporter: Dmitry Kan
>              Labels: gsoc2014
>
> The context:
> http://markmail.org/thread/tywtrjjcfdbzww6f
> Robert Muir and I have discussed what Robert eventually named "postings
> lists deduplication" at Berlin Buzzwords 2013 conference.
> The idea is to allow multiple terms to point to the same postings list to
> save space. This can be achieved by new index codec implementation, but this 
> jira is open to other ideas as well.
> The application / impact of this is positive for synonyms, exact / inexact
> terms, leading wildcard support via storing reversed term etc.
> For example, at the moment, when supporting exact (unstemmed) and inexact 
> (stemmed)
> searches, we store both unstemmed and stemmed variant of a word form and
> that leads to index bloating. That is why we had to remove the leading
> wildcard support via reversing a token on index and query time because of
> the same index size considerations.
> Comment from Mike McCandless:
> Neat idea!
> Would this idea allow a single term to point to (the union of) N other
> posting lists?  It seems like that's necessary e.g. to handle the
> exact/inexact case.
> And then, to produce the Docs/AndPositionsEnum you'd need to do the
> merge sort across those N posting lists?
> Such a thing might also be do-able as runtime only wrapper around the
> postings API (FieldsProducer), if you could at runtime do the reverse
> expansion (e.g. stem -> all of its surface forms).
> Comment from Robert Muir:
> I think the exact/inexact is trickier (detecting it would be the hard
> part), and you are right, another solution might work better.
> but for the reverse wildcard and synonyms situation, it seems we could even
> detect it on write if we created some hash of the previous terms postings.
> if the hash matches for the current term, we know it might be a "duplicate"
> and would have to actually do the costly check they are the same.
> maybe there are better ways to do it, but it might be a fun postingformat
> experiment to try.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5422) Postings lists deduplication

Reply via email to