[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Mark Harwood (JIRA) Thu, 31 May 2012 09:20:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286707#comment-13286707
 ]


Mark Harwood commented on LUCENE-4069:
--------------------------------------

bq.  To solve what you speak of we just need to resolve LUCENE-4093. 

Presumably the main objective here is that in order to cut down on the number 
of files we store, content consumers of various types should aim to consolidate 
multiple fields' contents into a single file (if they share common config 
choices). 

bq. Then multiple postings format instances that are 'the same' will be 
deduplicated correctly.

The complication in this case is that we essentially have 2 consumers (Bloom 
and Lucene40), one wrapped in the other with different but overlapping choices 
of fields e.g we want a single Lucene40 to process all fields but we want Bloom 
to handle only a subset of these fields. This will be a tough one for PFPF to 
untangle while we are stuck with a delegating model for composing consumers. 

This may be made easier if instead of delegating a single stream we have a 
*stream-splitting* capability via a multicast subscription e.g. Bloom filtering 
consumer registers interest in content streams for fields A and B while 
Lucene40 is consolidating content from fields A, B, C and D. A broadcast 
mechanism feeds each consumer a copy of the relevant stream and each consumer 
is responsible for inventing their own file-naming convention that avoids 
muddling files.

While that may help for writing streams it doesn't solve the re-assembly of 
"producer" streams at read-time where BloomFilter absolutely has to position 
itself in front of the standard Lucene40 producer in order to offer fast-fail 
lookups. 

In the absence of a fancy optimised routing mechanism (this all may be 
overkill) my current solution was to put BloomFilter in the delegate chain 
armed with a subset of fieldnames to observe as a larger array of fields flow 
past to a common delegate. I added some Javadocs to describe the need to do it 
this way for an efficient configuration.
You are right that this is messy (ie open to bad configuration) but operating 
this deep down in Lucene that's always a possibility regardless of what we put 
in place.




                
> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Reply via email to