[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428839#comment-13428839
 ] 

Michael McCandless commented on LUCENE-4069:


Hmm in testing Bloom + Lucene40 PF on luceneutil's id field I don't see any 
gains with what was committed (base=Lucene40 for id field, comp=Bloom + 
Lucene40 for id field; all other fields are just Lucene40):

{noformat}
TaskQPS base StdDev base   QPS bloomStdDev bloom  Pct 
diff
PKLookup  196.719.29  192.945.40   -8% -
5%
HighSloppyPhrase0.750.040.750.04  -10% -
9%
  HighPhrase1.780.021.760.01   -2% -
0%
   MedPhrase5.000.094.950.15   -5% -
3%
  AndHighMed   74.663.52   74.024.50  -11% -   
10%
 AndHighHigh   16.940.82   16.830.78   -9% -
9%
 MedSloppyPhrase4.100.114.080.12   -5% -
4%
   LowPhrase   13.350.19   13.310.24   -3% -
2%
 LowSloppyPhrase   31.910.52   31.900.58   -3% -
3%
  AndHighLow 1114.88   43.26 1121.52   33.72   -6% -
7%
 LowSpanNear9.670.229.760.26   -4% -
5%
 LowTerm  421.55   14.96  426.555.23   -3% -
6%
 MedTerm  176.457.66  179.402.84   -4% -
7%
HighSpanNear1.430.031.450.05   -3% -
7%
 MedSpanNear2.510.052.560.09   -3% -
7%
HighTerm   15.520.95   15.900.24   -4% -   
10%
 Respell   69.773.20   71.653.61   -6% -   
13%
  Fuzzy1   88.192.49   91.043.06   -2% -
9%
  Fuzzy2   36.181.26   37.501.44   -3% -   
11%
Wildcard   45.331.52   47.041.10   -1% -
9%
 Prefix3   65.652.53   68.261.99   -2% -   
11%
   OrHighLow   37.450.88   39.872.48   -2% -   
15%
  OrHighHigh9.580.10   10.230.640% -   
14%
   OrHighMed   14.890.14   15.961.010% -   
15%
  IntNRQ   10.241.04   11.071.33  -13% -   
34%
{noformat}

I just used the DefaultBloomFilterFactory, which should be good here since it 
targets 10% fill based on doc count.

Maybe even slight loss (but could easily be noise).  I confirmed that I'm 
getting .blm files and that at search time the filter does in fact work 
(quickly returns false from seekExact).

The differences vs. the dedicated benchmark on this issue is 1) luceneutil is 
only testing lookups, 2) the id field is just one field (ie we also have eg 
body/title/date for searching), 3) luceneutil is testing a bunch of other 
queries in other threads in addition to PKLookup.

Out of curiosity I also tried Memory vs Bloom + Memory, and Bloom was ~15% 
slower, I guess because computing the hash  doing the lookup is non-trivial 
cost vs just walking the FST to find out the term doesn't exist.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a 

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427317#comment-13427317
 ] 

Robert Muir commented on LUCENE-4069:
-

Hi Mark: I noticed this was committed only to the 4.x branch. 

can you also merge the change to trunk?


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427318#comment-13427318
 ] 

Adrien Grand commented on LUCENE-4069:
--

Mark, is there a reason why this patch hasn't been committed to trunk too?

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427322#comment-13427322
 ] 

Mark Harwood commented on LUCENE-4069:
--

Will do.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418314#comment-13418314
 ] 

Mark Harwood commented on LUCENE-4069:
--

One more remaining issue before I commit which has appeared sporadically and 
looks to be consistently raised by this test:
ant test  -Dtestcase=TestIndexWriterCommit 
-Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
-Dtests.slow=true -Dtests.postingsformat=TestBloomFilteredLucene40Postings 
-Dtests.locale=no -Dtests.timezone=Europe/Belfast 
-Dtests.file.encoding=ISO-8859-1

The error it produces is this: 
[junit4:junit4] Caused by: java.lang.IllegalStateException: Missing 
file:_9_TestBloomFilteredLucene40Postings_0.blm
[junit4:junit4]at 
org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat$BloomFilteredFieldsProducer.init(BloomFilteringPostingsFormat.java:175)
[junit4:junit4]at 
org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat.fieldsProducer(BloomFilteringPostingsFormat.java:156)


MockDirectoryWrapper looks to be randomly deleting files (probably my blm 
file shown above) to simulate the effects of crashes.
Presumably I am doing the right thing in always throwing an exception if the 
.blm file is missing? The alternative would be to silently ignore the missing 
file which seems undesirable.
IF MDW is intended to only delete uncommitted files I'm not sure how we end up 
in a scenario where BloomPF is being asked to open the uncommitted segment?








 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418393#comment-13418393
 ] 

Michael McCandless commented on LUCENE-4069:


bq. Presumably I am doing the right thing in always throwing an exception if 
the .blm file is missing? 

Definitely!  This should be considered a corrupt index.

bq. IF MDW is intended to only delete uncommitted files I'm not sure how we end 
up in a scenario where BloomPF is being asked to open the uncommitted segment?

Very strange... IW will sync all files written by the codec, and only once all 
files for a given segment are sync'd will it write a segments_N referencing 
that segment.  Not sure offhand why this file isn't being sync'd... I wonder if 
it has to do w/ only opening the file in the close() method (most other PFs 
open files well before then).  It should be fine to do that ... but it's 
something different.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418411#comment-13418411
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. I wonder if it has to do w/ only opening the file in the close() method (

Just tried opening the file earlier (in BloomFilteredConsumer constructor) and 
that didn't fix it.
I previously also added an extra Directory.fileExists() sanity check 
immediately after closing the IndexOutput and all was well so I think it's 
something happening after that. Will need to dig deeper.
I'm running on WinXP 64bit if that is of any significance to MDW's behaviour.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416007#comment-13416007
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. At a minimum I think before committing we should make the SegmentWriteState 
accessible.

OK. Will that be the subject of a new Jira?

bq. Hmm why is anonymity at search time important?

It would seem to be an established design principle - see 
https://issues.apache.org/jira/browse/LUCENE-4069#comment-13285726

It would be a pain if user config settings require a custom SPI-registered 
class around just to decode the index contents. There's the resource/classpath 
hell, the chance for misconfiguration and running Luke suddenly gets more 
complex.
The line to be drawn is between what are just config settings (field names, 
memory limits) and what are fundamentally different file formats (e.g. codec 
choices).
The design principle that looks to be adopted is that the former ought to be 
accommodated without the need for custom SPI-registered classes and the latter 
would need to locate an implementation via SPI to decode stored content. Seems 
reasonable.
The choice of hash algo does not fundamentally alter the on-disk format (they 
all produce an int) so I would suggest we treat this as a config setting rather 
than a fundamentally different choice of file format.







 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416014#comment-13416014
 ] 

Uwe Schindler commented on LUCENE-4069:
---

{quote}
It would be a pain if user config settings require a custom SPI-registered 
class around just to decode the index contents. There's the resource/classpath 
hell, the chance for misconfiguration and running Luke suddenly gets more 
complex.
The line to be drawn is between what are just config settings (field names, 
memory limits) and what are fundamentally different file formats (e.g. codec 
choices).
The design principle that looks to be adopted is that the former ought to be 
accommodated without the need for custom SPI-registered classes and the latter 
would need to locate an implementation via SPI to decode stored content. Seems 
reasonable.
The choice of hash algo does not fundamentally alter the on-disk format (they 
all produce an int) so I would suggest we treat this as a config setting rather 
than a fundamentally different choice of file format.
{quote}

The design principle here is very easy: We must follow the SPI pattern, if you 
write an index that could otherwise not be read with default settings and 
produces e.g. CorruptIndexException. If you have a codec that writes some 
special things for specific fields, it is required to write this information 
about the fields to the index. If you want to open this index using IndexReader 
again, there must not be any requirement for configuration settings on the 
reader itsself - a simple DirectoryReader.open() must be possible and a query 
must be able to execute. The IndexReader must be able to get all this 
information from the index files. If a special decoder for foobar is needed, it 
must be loadable by SPI. This is similar to postings. A new postings format 
needs a new SPI, otherwise you cannot read the index.

And it is not true that Luke is more complex to configure. Just put the JAR 
file into classpath that contains the SPI and you are fine. Setting up a build 
environment is more complicated but thats more the problem of shitty Eclipse 
resource handling. ANT/MAVEN/IDEA is easy.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416037#comment-13416037
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq.  If a special decoder for foobar is needed, it must be loadable by SPI. 

I think we are in agreement on the broad principles. The fundamental question 
here though is do you want to treat an index's choice of Hash algo as something 
that would require a new SPI-registered PostingsFormat to decode or can that be 
handled as I have done here with a general purpose SPI framework for hashing 
algos? 

Actually, re-thinking this, I suspect rather than creating our own, I can use 
Java's existing SPI framework for hashing in the form of MessageDigest. I'll 
take a closer look into that...



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416039#comment-13416039
 ] 

Uwe Schindler commented on LUCENE-4069:
---

bq. Actually, re-thinking this, I suspect rather than creating our own, I can 
use Java's existing SPI framework for hashing in the form of MessageDigest. 
I'll take a closer look into that...

That's a good idea. MessageDigest.getInstance(name) should be the way to go. 
And this name string is encoded to the index. If the MessageDigest API is 
enough then one can e.g. use bouncycastle by just plugging into the calsspath. 
If on opening index, this is not available, he will get Exception just like 
when codec/postingsformat is missing.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416084#comment-13416084
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. MessageDigest.getInstance(name) should be the way to go

I'm less keen now - a quick scan of the docs around MessageDigest throws up 
some issues:
1) SPI registration of MessageDigest providers looks to get into permissions 
hell as it is closely related to security - see 
http://docs.oracle.com/javase/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling
 which talks about the steps required to approve a trusted provider.
2) MessageDigest as an interface is designed to stream content in potentially 
many method calls past the hashing algo. MurmurHash2.java is not currently 
written to process content this way and suits our needs in hashing small blocks 
of content in one hit. 

For these 2 reasons it looks like MessageDigest may be a pain to adopt and the 
existing approach proposed in this patch may be preferable.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416290#comment-13416290
 ] 

Michael McCandless commented on LUCENE-4069:


{quote}
bq. At a minimum I think before committing we should make the SegmentWriteState 
accessible.

OK. Will that be the subject of a new Jira?
{quote}

No, I mean we shouldn't commit this patch until SegmentWriteState is
accessible when creating the FuzzySet.  I think we can just pass it to
BloomFilterFactory.getSetForField?  This way if the app knows it's a
PK field then it can use maxDoc to always size an appropriate
bit set up front.

bq. I think we are in agreement on the broad principles. The fundamental 
question here though is do you want to treat an index's choice of Hash algo as 
something that would require a new SPI-registered PostingsFormat to decode or 
can that be handled as I have done here with a general purpose SPI framework 
for hashing algos?

+1, that's exactly the question.

Ie, where to draw the line between config of an existing PF and
different PF.

But I guess swapping in different hash impl should be seen as simple
config change, so I think using SPI to find it at read time is OK.

I still don't like how trappy this approach is: the default hardwired
(8 MB) can be way too big (silently slows down your NRT reopens,
especially if you bloom all fields) or way too small (silently turns
off bloom filter for fields that have too many unique terms).

I also don't think this PF should be per-field: we have
PerFieldPostingsFormat for that, and if there are limitations in PFPF,
we should address them rather than having to make all future PFs
handle per-field-ness themselves.  This PF should really handle one
field.

But I don't think these issues need to hold up commit (except for
making SegmentWriteState accessible)... we can improve over time.  I
think we may simply want to fold this into the terms dict somehow.

Can you add @lucene.experimental to all the new APIs?


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416383#comment-13416383
 ] 

Mark Harwood commented on LUCENE-4069:
--

A quick benchmark looks like the new right-sized bitset as opposed to the old 
worst-case-scenario-sized bitset is buying us a small performance improvement.

bq. I also don't think this PF should be per-field

There was a lengthy discussion earlier on this topic. The approach presented 
here seems reasonable.
For the average user you have the DefaultBloomFilterFactory default which now 
has reasonable sizing for all fields passed its way (assuming a heuristic based 
on numDocs=numKeys to anticipate). For expert users you can provide a 
BloomFilterFactory with a custom choice of sizing heuristic per-field and can 
also simply return null for non-bloomed fields.

Having a single, carefully configured BloomPF wrapper is preferable because you 
can channel appropriately configured bloom settings to a common PF delegate and 
avoid creating multiple .tii, .tis files etc because the PerFieldPF isn't smart 
enough to figure out that these Bloom-ing choices do not require different 
physical files for all the delegated tii etc structures.

You don't *have* to use the Per-field stuff in BloomPF but there are benefits 
to be had in doing so which can't otherwise be achieved.


bq. Can you add @lucene.experimental to all the new APIs?

Done.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416417#comment-13416417
 ] 

Michael McCandless commented on LUCENE-4069:


Thanks Mark.  +1 to commit.

I still think we could/should somehow fold this into the terms dict but we can 
do that another day... this is a good step forward.

bq. because the PerFieldPF isn't smart enough to figure out that these 
Bloom-ing choices do not require different physical files for all the delegated 
tii etc structures.

We need to fix that, so that every PF wrapper that comes along doesn't feel 
like it must handle per-field-ness itself.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415219#comment-13415219
 ] 

Michael McCandless commented on LUCENE-4069:


bq. Estimating key volumes in context 1 is probably hard without some 
additional hints from the end user.

Actually, the indexer knows exactly how many unique terms it's about to
flush.  It's the unique term count (for this one segment) that you
need right?  We just have to plumb it somehow (add to
SegmentWriteState?).

bq. 2) Merged segments e.g. guessing how many unique keys survive a merge 
operation

Seems like LUCENE-4198 needs to solve this same problem.

bq. Not sure how we get the OneMerge instance fed through the call stack - 
could that be held somewhere on a ThreadLocal as generally useful context?

Yeah I'm not sure either ... but I agree an approx heuristic based on
merging segments unique term counts is better than nothing.

I don't like how you silently just get bad perf now if your up-front
guess was too small ... I also don't like that the default guessing is
static (8 MB): flushing tiny NRT segs are gonna pay an absurd price.
I think we need a solution for this...

Maybe another option is to simply make a 2nd pass after the segment
(flush or merge) is written, to build up the bit set; this way we know
exactly how many unique terms we have.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415221#comment-13415221
 ] 

Michael McCandless commented on LUCENE-4069:


I don't think we should use a separate factory class to provide the
bloom filters; I think we should simplify.  EG add overridable methods
to BloomFilteringPF; this way the default impl (8 MB, isSaturated)
would be in BloomFilteringPF.

Also, why do we need to use SPI to find the HashFunction?  Seems like
overkill... we don't (yet) have a bunch of hash functions that are
vying here right?  We can make this pluggable at a later time... can't
the postings format impl pass in an instance of HashFunction when
making the FuzzySet?  Default would just be MurmurHash.

Can you move the imports under the copyright header? (eg in
HashFunction.java but maybe others).


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415362#comment-13415362
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. It's the unique term count (for this one segment) that you need right? 
Yes, I need it before I start processing the stream of terms being flushed.
 
bq. Seems like LUCENE-4198 needs to solve this same problem.

Another possibly related point on more access to merge context - custom 
codecs have a great opportunity at merge time to piggy-back some analysis on 
the data being streamed e.g. to spot trending terms whose term frequencies 
differ drastically between the merging source segments. This would require 
access to source segment as term postings are streamed to observe the change 
in counts. 

bq. Also, why do we need to use SPI to find the HashFunction? Seems like 
overkill... we don't (yet) have a bunch of hash functions that are vying here 
right?

There's already a MurmurHash3 algo - we're currently using v2 and so could 
anticipate an upgrade at some stage. This patch provides that future proofing.

bq. can't the postings format impl pass in an instance of HashFunction when 
making the FuzzySet

I don't think that is going to work. Currently all PostingFormat impls that 
extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). 
All their settings (fields, hash algo, thresholds) etc are recorded at write 
time by the base class in the segment. At read-time it is the 
BloomFilterPostingsFormat base class that is instantiated, not the write-time 
subclass and so we need to store the hash algo choice. We can't rely on the 
original subclass being around and configured appropriately with the original 
write-time choice of hashing function.

I think the current way feels safer over all and also allows other Lucene 
functions to safely record hashes along with a hashname string that can be used 
to reconstitute results. 

bq. Can you move the imports under the copyright header?

Will do



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415604#comment-13415604
 ] 

Michael McCandless commented on LUCENE-4069:


{quote}
bq. It's the unique term count (for this one segment) that you need right? 

Yes, I need it before I start processing the stream of terms being flushed.
{quote}

At a minimum I think before committing we should make the
SegmentWriteState accessible.  EG then at least I can use numDocs for
my primary key field.

I really don't like how easy it is to silently mis-configure this PF:
the default 8 MB is way to high for an NRT setting and way too low for
a large index.

bq. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can 
be anonymous (i.e. unregistered via SPI). 

Hmm why is anonymity at search time important?  I think that's a
non-feature, and we shouldn't make our core code more complex for it?

Ie, it's fine to require the app to have to make a named PF (that is
accessible via SPI), implementing all their custom bloom logic (which
fields are bloom'd, what hash to use, etc.).

When an app makes a custom Codec/PostingsFormat, it's expected that
that class is accessible via SPI at both index time and search time.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-10 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410145#comment-13410145
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. So now we are close to 1M lookups/sec for a single thread!

Cool!

bq. I wonder if somehow we can do a better job picking the right sized bit 
vector up front? 
bq. You basically need to know up front how many unique terms will be in the 
given field for this segment right?

Yes - the job of anticipating the number of unique keys probably has 2 
different contexts:
1) Net new segments e.g. guessing up front how many docs/keys a user is likely 
to generate in a new segment before the flush settings kick in.
2) Merged segments e.g. guessing how many unique keys survive a merge operation

Estimating key volumes in context 1 is probably hard without some additional 
hints from the end user. Arguably the BloomFilterFactory.getSetForField() 
method already represents where this setting can be controlled.
In context 2 where potentially large merges occur we could look at adding an 
extra method to BloomFilterFactory to handle this different context e.g. 
something like
   FuzzySet getSetForMergeOpOnField(FieldInfo fi, OneMerge mergeContext)
Based on the size of the segments being merged and volumes of deletes a more 
appropriate size of Bloom bitset could be allocated based on a worst-case 
estimate.
Not sure how we get the OneMerge instance fed through the call stack - could 
that be held somewhere on a ThreadLocal as generally useful context?





 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409676#comment-13409676
 ] 

Michael McCandless commented on LUCENE-4069:


bq. That's tightened performance but that lookss a scary amount of code for the 
optimal solution of this basic incrementing operation

Well the code now looks horribly complex since we have a ton of booleans to do 
the same thing in different ways :)

From my limited poking-around testing the fastest way seems to be 
lookupByReader=false, shareEnums=true, doSortKeys=true, useDocValues=true, 
useBase36=true, doDirectDelete=true.  Once you specialize the code for those 
booleans then it looks much more sane.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409679#comment-13409679
 ] 

Michael McCandless commented on LUCENE-4069:


Also: in digging into this I found and fixed a silly performance bug in the 
nightly PKLookup perf test:

http://people.apache.org/~mikemccand/lucenebench/PKLookup.html

So now we are close to 1M lookups/sec for a single thread!

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409687#comment-13409687
 ] 

Michael McCandless commented on LUCENE-4069:


bq. I added a 100mb start size for the BloomFilter for large-scale tests 
because without this it gets saturated and there were occasional big spikes in 
batch times.

I wonder if somehow we can do a better job picking the right sized bit vector 
up front?  You basically need to know up front how many unique terms will be in 
the given field for this segment right?  This seems similar to LUCENE-4198, 
where we want to improve Codec API so it has access to collection stats (like 
number of unique terms in this field) up front.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409693#comment-13409693
 ] 

Michael McCandless commented on LUCENE-4069:


I'll open a separate issue for IndexWriter.tryDeleteDocument; I think it's 
worth exploring...

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409746#comment-13409746
 ] 

Michael McCandless commented on LUCENE-4069:


I ran with with the fastest configuration (lookupByReader=false,
shareEnums=true, doSortKeys=true, useDocValues=true,
doDirectDelete=true, useBase36=true).

I used MMapDir, 10M numDocs and keySpaceSize, Java 1.7.0_04, 2 GB
starting/max heap.

With doWorstCaseFlushing=false:

{noformat}
  Lucene40
55978 ms

  Bloom
49880 ms

  Memory
55965 ms

  Pulsing
56059 ms
{noformat}

With doWorstCaseFlushing=true:

{noformat}
  Lucene40
61097 ms

  Bloom
53698 ms

  Memory
61066 ms

  Pulsing
61355 ms
{noformat}

So Bloom is ~11-12% faster.  Very curious that Lucene40, Pulsing and
Memory are basically the same!


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409762#comment-13409762
 ] 

Michael McCandless commented on LUCENE-4069:


I opened LUCENE-4203 for IndexWriter.tryDeleteDocument.

Urgh, patch is on the wrong issue ...

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-06 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408097#comment-13408097
 ] 

Mark Harwood commented on LUCENE-4069:
--

Thanks for the extra tests, Mike. That's tightened performance but that lookss 
a scary amount of code for the optimal solution of this basic incrementing 
operation :)

I've done some more benchmarks with the updated test and the performance 
characteristics are becoming clearer as shown in these results: 
http://goo.gl/dtWSb
Bloom performance is better than Pulsing but the gap narrows with the volumes 
of deletes lying around in old segments, caused by updates. In these cases the 
BloomFilter gives a false positive and falls back to the equivalent operations 
of Pulsing. I added a 100mb start size for the BloomFilter for large-scale 
tests because without this it gets saturated and there were occasional big 
spikes in batch times.
So overall there still looks to be a benefit and especially in low-frequency 
update scenarios.

I'll wait for the dust to settle on Lucene-4190 (given this Codec introduces a 
new file) before thinking about committing.

Cheers
Mark



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397425#comment-13397425
 ] 

Michael McCandless commented on LUCENE-4069:


Hmm somehow I messed up when testing the bloom postings format: all of my 
on-disk bloom bits sets are 8 MB, regardless of the size of the segment... so 
something is wrong (I would expect segments w/ more terms to have a larger bit 
set...).  This is what I did for the codec:
{noformat}
final Codec codec = new Lucene40Codec() {
  @Override
  public PostingsFormat getPostingsFormatForField(String field) {
final String pfName = field.equals(id) ? idFieldPostingsFormat : 
defaultPostingsFormat;
if (pfName.equals(Bloom)) {
  return new 
BloomFilteringPostingsFormat(PostingsFormat.forName(Lucene40),
  new BloomFilterFactory() {
@Override
public FuzzySet 
getSetForField(FieldInfo info) {
  return 
FuzzySet.createSetBasedOnMaxMemory(100*1024*1024, new MurmurHash2());
}
  });
} else {
  return PostingsFormat.forName(pfName);
}
  }
};
{noformat}

I thought that would use plenty of RAM (100 MB), and then downsize to the 
appropriately sized bit set (~10% saturation) when writing ... but somehow it 
didn't.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397445#comment-13397445
 ] 

Michael McCandless commented on LUCENE-4069:


OK I think I found the problem w/ the patch: in 
BloomFilteringPostingsFormat.saveAppropriatelySizedBloomFilter, it should be 
{{rightSizedSet.serialize(bloomOutput);}} not 
{{bloomFilter.serialize(bloomOutput);}} (it was saving the un-downsized 
version).  Now I see my .blm files varying in size according to how large the 
segment is...

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397457#comment-13397457
 ] 

Michael McCandless commented on LUCENE-4069:


Results after fixing the not downsizing bug:

{noformat}
TaskQPS base StdDev base   QPS bloomStdDev bloom  Pct 
diff
  Fuzzy1   97.021.43   88.673.08  -13% -   
-4%
  Fuzzy2   42.501.15   40.321.75  -11% -
1%
 AndHighHigh   17.990.29   17.740.47   -5% -
2%
 Respell   82.762.96   81.663.54   -8% -
6%
  AndHighMed   55.771.02   55.061.51   -5% -
3%
 TermGroup1M   42.350.79   41.980.85   -4% -
3%
TermBGroup1M   49.550.58   49.140.60   -3% -
1%
SpanNear   12.530.15   12.490.09   -2% -
1%
  TermBGroup1M1P   64.360.83   64.340.83   -2% -
2%
 Prefix3   34.181.25   34.451.38   -6% -
8%
Wildcard   34.421.12   34.751.35   -6% -
8%
SloppyPhrase   11.480.17   11.610.22   -2% -
4%
  Phrase8.470.308.570.25   -5% -
7%
Term   94.064.50   95.452.34   -5% -
9%
  IntNRQ9.840.63   10.070.71  -10% -   
16%
   OrHighMed   34.572.07   35.782.24   -8% -   
16%
  OrHighHigh   11.660.64   12.110.73   -7% -   
16%
PKLookup  129.752.37  153.802.31   14% -   
22%
{noformat}

Basically same as before ... which is good (it means target 10% saturation 
doesn't lose any perf).

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13396934#comment-13396934
 ] 

Mark Harwood commented on LUCENE-4069:
--

Mike, currently having various issues getting this benchmark framework up and 
running on my Windows platform here - is it easy for you to kick off another 
run with the latest patch on your setup? The latest change to the patch 
shouldn't require an index rebuild from your last run.

No worries if this is too much hassle for you - I'll probably just try switch 
to testing on OSX at home.

Cheers,
Mark

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397010#comment-13397010
 ] 

Michael McCandless commented on LUCENE-4069:


bq. Mike, currently having various issues getting this benchmark framework up 
and running on my Windows platform here

Alas it's not easy ... please report back on how to make it easier to set up!

bq. is it easy for you to kick off another run with the latest patch on your 
setup?

No problem: I'll run perf test again.  It's easy...

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397054#comment-13397054
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. problem: I'll run perf test again. It's easy...

Great, thanks.

bq. Alas it's not easy ... please report back on how to make it easier to set 
up!

My Windows-based woes were:
1) Had to install python (used 2.7)
2) Figure out python proxy settings for Wikipedia download
3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't 
installed/available so gave up and did svn checkout manually
4) Ran first python test and it aborted with complaint about GnuPlot missing

I imagine most of what is needed here comes out of the box on typical OSX/Linux 
setup.



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397082#comment-13397082
 ] 

Michael McCandless commented on LUCENE-4069:


Results from last patch:
{noformat}
TaskQPS base StdDev base   QPS bloomStdDev bloom  Pct 
diff
  IntNRQ   11.351.27   10.140.62  -24% -
6%
  Fuzzy1  108.523.34  101.822.90  -11% -
0%
 Prefix3   64.872.17   61.551.61  -10% -
0%
Wildcard   43.181.74   41.331.17  -10% -
2%
  Fuzzy2   41.761.40   40.051.00   -9% -
1%
Term  151.714.38  147.244.42   -8% -
2%
SpanNear5.230.095.110.12   -6% -
1%
   OrHighMed   12.600.88   12.340.48  -11% -
9%
SloppyPhrase8.250.208.090.07   -5% -
1%
TermBGroup1M   69.980.68   68.801.13   -4% -
0%
  OrHighHigh   10.060.669.930.39  -11% -
9%
  Phrase   12.730.30   12.570.35   -6% -
3%
 TermGroup1M   35.440.42   35.080.67   -4% -
2%
  AndHighMed   63.402.27   62.901.11   -5% -
4%
 Respell   93.113.70   92.812.33   -6% -
6%
  TermBGroup1M1P   50.931.53   50.961.75   -6% -
6%
 AndHighHigh   15.860.71   15.930.27   -5% -
6%
PKLookup  127.442.15  134.858.68   -2% -   
14%
{noformat}

Looks like FuzzyN/Respell is good again ... PKLookup is a bit faster ... the 
rest is likely noise.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397174#comment-13397174
 ] 

Michael McCandless commented on LUCENE-4069:


{quote}
My Windows-based woes were:
 1) Had to install python (used 2.7)
{quote}

OK.

bq. 2) Figure out python proxy settings for Wikipedia download

Hmmm ... you can also manually download separately.

bq. 3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't 
installed/available so gave up and did svn checkout manually

Oh I see: setup.py uses PySVN; hmm.  Just checking out manually is better...

bq. 4) Ran first python test and it aborted with complaint about GnuPlot missing

Hmm we should default printIndexCharts to False (I'll fix).

bq. I imagine most of what is needed here comes out of the box on typical 
OSX/Linux setup.

Hmm I don't think pysvn does, nor gnuplot ...

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395773#comment-13395773
 ] 

Mark Harwood commented on LUCENE-4069:
--

Interesting results, Mike - thanks for taking the time to run them.

bq.  BloomFilteredFieldsProducer should just pass through intersect to the 
delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the 
client app and the delegate PostingsFormat as soon as it is safe to do so i.e. 
when the user is safely focused on a non-filtered field. While there is a 
chance the client may end up making a call to TermsEnum.seekExact(..) on a 
filtered field then I need to have a wrapper object in place which is in a 
position to intercept this call. In all other method invocations I just end up 
delegating calls so I wonder if all these extra method calls are the cause of 
the slowdown you see e.g. when Fuzzy is enumerating over many terms. 
The only other alternatives to endlessly wrapping in this way are:
a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for 
just this one method.
b) Mess around with byte-code manipulation techniques to weave in Bloom 
filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to 
live with fuzzy+bloom not being as fast as straight fuzzy.

For completeness sake - I don't have access to your benchmarking code but I 
would hope that PostingsFormat.fieldsProducer() isn't called more than once for 
the same segment as that's where the Bloom filters get loaded from disk so 
there's inherent cost there too. I can't imagine this is the case.

BTW I've just finished a long-running set of tests which mixes up reads and 
writes here: http://goo.gl/KJmGv
This benchmark represents how graph databases such as Neo4j use Lucene for an 
index when loading (I typically use the Wikipedia links as a test set). I look 
to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over 
the comparatively slower 3.6 codebase.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395830#comment-13395830
 ] 

Michael McCandless commented on LUCENE-4069:


bq. Interesting results, Mike - thanks for taking the time to run them.

You're welcome!

{quote}
bq. BloomFilteredFieldsProducer should just pass through intersect to the 
delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the 
client app and the delegate PostingsFormat as soon as it is safe to do so i.e. 
when the user is safely focused on a non-filtered field. While there is a 
chance the client may end up making a call to TermsEnum.seekExact(..) on a 
filtered field then I need to have a wrapper object in place which is in a 
position to intercept this call. In all other method invocations I just end up 
delegating calls so I wonder if all these extra method calls are the cause of 
the slowdown you see e.g. when Fuzzy is enumerating over many terms. 
 The only other alternatives to endlessly wrapping in this way are:
 a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out 
for just this one method.
 b) Mess around with byte-code manipulation techniques to weave in Bloom 
filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to 
live with fuzzy+bloom not being as fast as straight fuzzy.
{quote}

I think the fix is simple: you are not overriding Terms.intersect now,
in BloomFilteredTerms.  I think you should override it and immediately
delegate and then FuzzyN/Respell performance should be just as good as
Lucene40 codec.

bq. For completeness sake - I don't have access to your benchmarking code

All the benchmarking code is here: 
http://code.google.com/a/apache-extras.org/p/luceneutil/

I run it nightly (trunk) and publish the results here: 
http://people.apache.org/~mikemccand/lucenebench/

bq. but I would hope that PostingsFormat.fieldsProducer() isn't called more 
than once for the same segment as that's where the Bloom filters get loaded 
from disk so there's inherent cost there too. I can't imagine this is the case.

It's only called once on init'ing the SegmentReader (or at least it
better be!).

{quote}
BTW I've just finished a long-running set of tests which mixes up reads and 
writes here: http://goo.gl/KJmGv
 This benchmark represents how graph databases such as Neo4j use Lucene for an 
index when loading (I typically use the Wikipedia links as a test set). I look 
to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over 
the comparatively slower 3.6 codebase.
{quote}

Nice results!  It looks like bloom(3.6) is faster than bloom(4.0)?
Why is that...

Also I wonder why you see such sizable (3.5X speedup) gains on PK
lookup but in my benchmark I see only ~13% - 24%.  My index has 5
segments per level...


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395857#comment-13395857
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. I think the fix is simple: you are not overriding Terms.intersect now, in 
BloomFilteredTerms

Good catch - a quick test indeed shows a speed up on fuzzy queries. 
I'll prepare a new patch.

I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a 
closer look at your benchmark.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395980#comment-13395980
 ] 

Michael McCandless commented on LUCENE-4069:


bq. Didn't get too far with running the Wikipedia perf tests due to missing 
data file (see 
http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=7 )

Woops sorry: I posted a comment there (new, more recent wikipedia export).

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13295748#comment-13295748
 ] 

Michael McCandless commented on LUCENE-4069:


I ran a benchmark on 10 M Wikipedia index; for the factory I used 
createSetBasedOnMemory and passed it 100 MB; I think that's enough to ensure we 
get the 10% saturation on save ...:

{noformat}
TaskQPS base StdDev base   QPS bloomStdDev bloom  Pct 
diff
  Fuzzy1  102.473.67   41.950.78  -61% -  
-56%
  Fuzzy2   38.361.76   18.680.37  -54% -  
-47%
 Respell   89.894.38   44.090.52  -53% -  
-47%
Wildcard   40.482.82   36.200.64  -17% -   
-2%
SloppyPhrase7.960.288.070.07   -3% -
5%
 Prefix3   61.945.34   63.350.37   -6% -   
12%
TermBGroup1M   71.376.79   73.731.55   -7% -   
16%
  AndHighMed   64.095.51   66.731.75   -6% -   
16%
  TermBGroup1M1P   49.553.78   51.752.67   -7% -   
18%
 AndHighHigh   16.051.12   16.770.53   -5% -   
15%
 TermGroup1M   35.873.07   37.560.74   -5% -   
16%
  OrHighHigh9.601.38   10.150.65  -13% -   
31%
   OrHighMed   11.931.91   12.630.93  -15% -   
35%
  IntNRQ9.121.259.680.11   -7% -   
24%
Term  154.55   19.60  165.320.97   -5% -   
23%
  Phrase   11.400.33   12.210.182% -   
11%
SpanNear4.310.074.730.037% -   
12%
PKLookup  122.781.42  145.955.22   13% -   
24%
{noformat}

Baseline is Lucene40 PostingsFormat even for the id field ... so PKLookup gets 
a good improvement.  This is on an index w/ 5 segments at each level.

Other queries seem to speed up as well (eg Term, Or*).

The queries that rely on Terms.intersect got much worse: is the 
BloomFilteredFieldsProducer should just pass through intersect to the delegate?

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-01 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287258#comment-13287258
 ] 

Mark Harwood commented on LUCENE-4069:
--

I've thought some more about option 2 (PerFieldPF reusing wrapped PFs) and it 
looks to get very ugly very quickly.
There's only so much PerFieldPF can do to rationalize a random jumble of PF 
instances presented to it by clients. I think the right place to draw the line 
is Lucene-4093 i.e. a simple .equals() comparison on top-level PFs to eliminate 
any duplicates. Any other approach that also tries to de-dup nested PFs looks 
to be adding a lot of complexity, especially when you consider what that does 
to the model of read-time object instantiation. This would be significant added 
complexity to solve a problem you have already suggested is insignificant (i.e. 
too many files doesn't really matter when using CFS).

I can remove the per-field stuff from BloomPF if you want but I imagine I will 
routinely subclass it to add this optimisation back in to my apps.




 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286555#comment-13286555
 ] 

Robert Muir commented on LUCENE-4069:
-

I don't think the abstract class should be registered in the SPI.

Instead i think the concrete Bloom+Lucene40 that you have in tests should be 
moved into src/java and registered there, just call it Bloom40 or something. 
The abstract api is still available for someone that wants to do something more 
specialized.

This is just like how pulsing (another wrapper) is implemented.

As far as disabling this for certain tests, import 
o.a.l.util.LuceneTestCase.SuppressCodecs and put something like this at class 
level:

{code}
@SuppressCodecs(Bloom40)
public class TestFoo...

@SuppressCodecs({Bloom40, Memory})
public class TestBar...
{code}

The strings in here can be codecs or postings formats

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286562#comment-13286562
 ] 

Robert Muir commented on LUCENE-4069:
-

Seeing the tests in question though, i dont think you want to disable this for 
these entire test classes.

We dont have a way to disable this on a per-method basis: and I think its 
generally not possible because
many classes create indexes in @BeforeClass etc.

An alternative would be to just pick this less often in RandomCodec: see the 
SimpleText hack :)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286598#comment-13286598
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. Instead i think the concrete Bloom+Lucene40 that you have in tests should 
be moved into src/java and registered there

What problem would that be trying to solve? Registration (or creation) of any 
BloomFilteringPostingsFormat subclasses is not necessary to decode index 
contents. Offering a Bloom40 would only buy users a pairing of 
Lucene40Postings and Bloom filtering but they would still have to declare which 
fields they want Bloom filtering on at write time. This isn't too hard using 
the code in the existing patch:

{code:title=ThisWorks.java}
final SetStringbloomFilteredFields=new HashSetString();
bloomFilteredFields.add(PRIMARY_KEY_FIELD_NAME);

iwc.setCodec(new Lucene40Codec(){
  BloomFilteringPostingsFormat postingOptions=new 
BloomFilteringPostingsFormat(new Lucene40PostingsFormat(), bloomFilteredFields);
  @Override
  public PostingsFormat getPostingsFormatForField(String field) {
return postingOptions;
  }  
});
{code}
No extra subclasses/registration required here to read the index built with the 
above setup.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286599#comment-13286599
 ] 

Robert Muir commented on LUCENE-4069:
-

I dont understand why this handles fields. Someone should just pick that with 
perfieldpostingsformat.

So you have the abstract wrapper(takes the wrapped postings format, and a 
String name), not registered.
And you have a concrete impl registered that is just abstractWrapper(lucene40, 
Bloom40): done.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286600#comment-13286600
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. An alternative would be to just pick this less often in RandomCodec: see 
the SimpleText hack 

Another option might be to make the TestBloomFilteredLucene40Postings pick a 
ludicrously small Bitset sizing option for each field so that we can 
accommodate tests that create silly numbers of fields. The bitsets being so 
small will just quickly reach saturation and force all reads to hit the 
underlying FieldsProducer.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286616#comment-13286616
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. I dont understand why this handles fields. Someone should just pick that 
with perfieldpostingsformat.

That would be inefficient because your PFPF will see 
BloomFilteringPostingsFormat(field1 + Lucene40) and 
BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different 
PostingsFormat instances and consequently create multiple files named 
differently because it assumes these instances may be capable of using 
radically different file structures.
In reality, the choice of BloomFilter with field 1 or BloomFilter with field 2 
or indeed no BloomFilter does not fundamentally alter the underlying delegate 
PostingFormat's file format - it only adds a supplementary blm file on the 
side with the field summaries. For this reason it is a mistake to configure 
seperate BloomFilterPostingsFormat instances on a per-field basis if they can 
share a common delegate.



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286619#comment-13286619
 ] 

Robert Muir commented on LUCENE-4069:
-

{quote}
That would be inefficient because your PFPF will see 
BloomFilteringPostingsFormat(field1 + Lucene40) and 
BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different 
PostingsFormat instances and consequently create multiple files named 
differently because it assumes these instances may be capable of using 
radically different file structures.
{quote}

But adding per-field handling here is not the way to solve this: its messy. 

Per-Field handling should all be handled at a level above in 
PerFieldPostingsFormat.

To solve what you speak of we just need to resolve LUCENE-4093. Then multiple 
postings format instances that are 'the same' will be deduplicated correctly.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286707#comment-13286707
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq.  To solve what you speak of we just need to resolve LUCENE-4093. 

Presumably the main objective here is that in order to cut down on the number 
of files we store, content consumers of various types should aim to consolidate 
multiple fields' contents into a single file (if they share common config 
choices). 

bq. Then multiple postings format instances that are 'the same' will be 
deduplicated correctly.

The complication in this case is that we essentially have 2 consumers (Bloom 
and Lucene40), one wrapped in the other with different but overlapping choices 
of fields e.g we want a single Lucene40 to process all fields but we want Bloom 
to handle only a subset of these fields. This will be a tough one for PFPF to 
untangle while we are stuck with a delegating model for composing consumers. 

This may be made easier if instead of delegating a single stream we have a 
*stream-splitting* capability via a multicast subscription e.g. Bloom filtering 
consumer registers interest in content streams for fields A and B while 
Lucene40 is consolidating content from fields A, B, C and D. A broadcast 
mechanism feeds each consumer a copy of the relevant stream and each consumer 
is responsible for inventing their own file-naming convention that avoids 
muddling files.

While that may help for writing streams it doesn't solve the re-assembly of 
producer streams at read-time where BloomFilter absolutely has to position 
itself in front of the standard Lucene40 producer in order to offer fast-fail 
lookups. 

In the absence of a fancy optimised routing mechanism (this all may be 
overkill) my current solution was to put BloomFilter in the delegate chain 
armed with a subset of fieldnames to observe as a larger array of fields flow 
past to a common delegate. I added some Javadocs to describe the need to do it 
this way for an efficient configuration.
You are right that this is messy (ie open to bad configuration) but operating 
this deep down in Lucene that's always a possibility regardless of what we put 
in place.





 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286712#comment-13286712
 ] 

Robert Muir commented on LUCENE-4069:
-

{quote}
but overlapping choices of fields e.g we want a single Lucene40 to process all 
fields but we want Bloom to handle only a subset of these fields.
{quote}

Thats not true: I disagree. Its an implementation detail that Bloom as a 
postingsformat wraps another one (thats just the abstract implementation), and 
the file formats should not expose this in general for any format.

This is true for a number of reasons: e.g. in the pulsing case the wrapped 
writer only gets a subset of the postings: therefore the wrapped writer's files 
are incomplete and an implementation detail.

its enough here that if you have 5 fields: 2 bloom and 3 not, that we detect 
there are only two postings formats in use, regardless of whether you have 2 or 
5 actual object instances.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286717#comment-13286717
 ] 

Robert Muir commented on LUCENE-4069:
-

And separately, you can always contain the number of files even today by:
* using only unique instances yourself when writing (rather than waiting on 
LUCENE-4093)
* using the compound file format.

The purpose of LUCENE-4093 is just to make this simpler, but I opened it as a 
separate
issue because its really solely an optimization, and only for a pretty rare 
case where
people are customizing the index format for different fields.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286754#comment-13286754
 ] 

Mark Harwood commented on LUCENE-4069:
--

Its true to say that Bloom is a different case to Pulsing - Bloom does not 
interfere in any with the normal recording of content in the wrapped delegate 
whereas Pulsing does.
It may prove useful for us to mark a formal distinction between these 
mutating/non mutating types so we can treat them differently and provide 
optimisations?


bq. And separately, you can always contain the number of files even today by 
using only unique instances yourself when writing

Contained but not optimal - roughly double the number of required files if I 
want the common case of a primary key indexed with Bloom. I can't see a way of 
indexing with Bloom-plus-Lucene40 on field A and indexing with just Lucene40 
on fields B,C and D and winding up with only one Lucene40 set of files with a 
common segment suffix. The way I did find of achieving this was to add a 
bloomFilteredFields set into my single Bloom+Lucene40 instance used for all 
fields. Is there any other option here currently? 

Looking to the future, 4093 may have more capabilities at optimising if it 
understands the distinction between mutating wrappers and non-mutating ones and 
how they are composed?



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286756#comment-13286756
 ] 

Robert Muir commented on LUCENE-4069:
-

{quote}
Contained but not optimal - roughly double the number of required files if I 
want the common case of a primary key indexed with Bloom.
{quote}

Then use CFS, its optimal always (1). 

I really dont think we should make this complex to save 2 or 3 files total 
(even in a complex config with many fields). Its not worth the complexity.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286815#comment-13286815
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. Its not worth the complexity

There's no real added complexity in BloomFilterPostingsFormat - it has to be 
capable of storing blooms for 1 field anyway and using the fieldname set is 
roughly 2 extra lines of code to see if a TermsConsumer needs wrapping or not.


From a client side you don't have to use this feature - the fieldname set can 
be null in which case it will wrap all fields sent its way. If you do chose to 
supply a set the wrapped PostingsFormat will have the advantage of being 
shared for bloomed and non-bloomed fields. We could add a constructor that 
removes the set and mark the others expert.

For me this falls into one of the many faster-if-you-know-about-it 
optimisations like FieldSelectors or recycling certain objects. Basically a 
useful hint to Lucene to save some extra effort but one which you dont *need* 
to use.

Lucene-4093 may in future resolve the multi-file issue but I'm not sure it will 
do so without significant complication.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286868#comment-13286868
 ] 

Simon Willnauer commented on LUCENE-4069:
-

bq. I really dont think we should make this complex to save 2 or 3 files total 
(even in a complex config with many fields). Its not worth the complexity.

I agree. I think those postings formats should only deal with encoding and not 
with handling certain fields different. A user / app should handle this in the 
codec. Ideally you don't have any conditions in the relevant methods like 
termsConsumer etc. 

bq. For me this falls into one of the many faster-if-you-know-about-it 
optimisations like FieldSelectors or recycling certain objects. Basically a 
useful hint to Lucene to save some extra effort but one which you dont need to 
use.

why is this a speed improvement? reading from one file vs. multiple is not 
really faster though.

Anyway, I think we should make this patch as simple as possible and don't 
handle fields in the PF. We can still open another issue or wait until 
LUCENE-4093 is in to discuss this issue?

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286882#comment-13286882
 ] 

Robert Muir commented on LUCENE-4069:
-

{quote}
For me this falls into one of the many faster-if-you-know-about-it 
optimisations like FieldSelectors or recycling certain objects. Basically a 
useful hint to Lucene to save some extra effort but one which you dont need to 
use.
{quote}

I agree with Simon, its not going to be faster.

Worse, it creates a situation from the per-field perspective where multiple 
postings formats are sharing the same files for a segment.

This would make it harder to do things like refactorings of codec apis in the 
future.

So where is the benefit?

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286916#comment-13286916
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. why is this a speed improvement?

Sorry - misleading. Replace the word faster in my comment with better and 
that makes more sense - I mean better in terms of resource usage and reduced 
open file handles. This seemed relevant given the earlier comments about Solr's 
use of non-compound files:

bq. [Solr] create massive amounts of files if we did so (add to the fact it 
disables compound files by default and its a disaster...)

I can see there is a useful simplification being sought for here if PerFieldPF 
can consider each of the unique top-level PFs presented to it as looking after 
an exclusive set of files. As the centralised allocator of file names it can 
then simply call each unique PF with a choice of segment suffix to name its 
various files without conflicting with other PFs. Lucene 4093 is all about 
better determining which PF is unique using .equals(). Unfortunately I don't 
think this approach is sufficiently complex. In order to avoid allocating all 
unnecessary file names PerFieldPF would have to further understand the nuances 
of which PFs were being wrapped by other PFs and which wrapped PFs would be 
reusable outside of their wrapped PF (as is the case with BloomPF's wrapped 
PF). That seems a more complex task than implementing equals(). 

So it seems we have 3 options:
1) Ignore the problems of creating too many files in the case of BloomPF and 
any other examples of wrapping PFs
2) Create a PerFieldPF implementation that reuses wrapped PFs using some 
generic means of discovering recyclable wrapped PFs (i.e go further than what 
4093 currently proposes in adding .equals support)
3) Retain my BloomPF-specific solution to the problem for those prepared to use 
lower-level APIs.

Am I missing any other options and which one do you want to go for?



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286926#comment-13286926
 ] 

Simon Willnauer commented on LUCENE-4069:
-

bq. Create a PerFieldPF implementation that reuses wrapped PFs using some 
generic means of discovering recyclable wrapped PFs (i.e go further than what 
4093 currently proposes in adding .equals support)

I think we should investigate this further. Lets keep this issue simple and 
remove the field handling and fix this on a higher level!


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285518#comment-13285518
 ] 

Mark Harwood commented on LUCENE-4069:
--

Thanks for the comment, Rob.
While the choice of Codec can be an anonymous inner class, resolving the choice 
of PostingsFormat is trickier.
BloomFilterPostingsFormat is now intended to wrap any another choice of 
PostingsFormat and Simon has suggested leaving the Bloom support purely 
abstract.
However, as an end user if I want to use Bloom support on the standard Lucene 
codec I would then have to write one of these:
{code:title=MyBloomFilteredLucene40Postings.java}
public class MyBloomFilteredLucene40Postingsextends 
BloomFilteringPostingsFormatBase {

  public MyBloomFilteredLucene40Postings() {
//declare my choice of PostingsFormat to be wrapped and provide a unique 
name for this combo of Bloom-plus-delegate
super(myBL40, new Lucene40PostingsFormat());
  }
}
{code}
The resulting index files are then named [segname]_myBL40.[filetypesuffix].
At read-time the myBL40 bit of the filename is used to lookup via Service 
Provider registrations the decoding class so 
com.xx.MyBloomFilteredLucene40Postings would need adding to a 
o.a.l.codecs.PostingsFormat file for the registration to work.

I imagine Bloom-plus-Lucene40Postings would be a common combo and if both are 
in core it would be annoying to have to code support for this in each app or 
for things like Luke to have to have classpaths redefined to access the 
app-specific class that was created purely to bind this combo of core 
components.

I think a better option might be to change the Bloom filtering base class to 
record the choice of delegate PostingsFormat in it's own blm file at 
write-time and instantiate the appropriate delegate instance at read-time using 
the recorded name. The BloomFilteringBaseClass would need changing to a final 
class rather than an abstract so that core Lucene could load it as the handler 
for [segname]_BloomPosting.xxx files and it would then have to examine the 
[segname].blm file to discover and instantiate the choice of delegate 
PostingsFormat using the standard service registration mechanism. At write-time 
clients would need to instantiate the BloomFilterPostingsFormat, passing a 
choice of PostingsFormat delegate to the constructor. At read-time Lucene core 
would invoke a zero-arg constructor.
I'll look into this as an approach.






 

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285706#comment-13285706
 ] 

Mark Harwood commented on LUCENE-4069:
--

Aaaargh. Unless I've missed something, I have concerns with the fundamental 
design of the current Codec loading mechanism.

It seems too tied to the concept of a ServiceProvider class-loading mechanism, 
forcing users to write new SPI-registered classes in order to simply declare 
what amount to index schema configuration choices.

Example: If I take Rob's sample Codec above and choose to use a subtly 
different configuration of the same PostingsFormat class for different fields 
it breaks:

{code:title=ThisBreaks.java}
  Codec fooCodec=new Lucene40Codec() {
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if (text.equals(field)) {
return new FooPostingsFormat(1);
  }
  if (title.equals(field)) {
//same impl as text field, different constructor settings
return new FooPostingsFormat(2);
  }  
  return super.getPostingsFormatForField(field);
} 
  };
{code}
This causes a file overwrite error as PerFieldPostingsFormat uses the same name 
from FooPostingsFormat(1) and FooPostingsFormat(2) to create files.
In order to safely make use of differently configured choices of the same 
PostingsFormat we are forced to declare a brand new subclass with a unique new 
service name and entry in the service provider registration. This is 
essentially where I have got to in trying to integrate this Bloom filtering 
logic.

This dependency on writing custom classes seems to make everything a bit 
fragile, no? What hope has Luke got in opening the average index without 
careful assembly of classpaths etc?
If I contrast this with the world of database schemas it seems absurd to have a 
reliance on writing custom classes with no behaviour simply in order to 
preserve a configuration of an application's schema settings. Even an IOC 
container with XML declarations would offer a more agile means of assembling 
pre-configured *beans* rather than relying on a Service Provider mechanism that 
is only serving as a registry of *classes*.

Anyone else see this as a major pain?











 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285726#comment-13285726
 ] 

Robert Muir commented on LUCENE-4069:
-

I'm not sure this is true: e.g. if your postings format requires parameters to 
decode the segment, then this enforces that it records said parameters,
e.g. Pulsing records these parameters.

Codec parameters are at index-time, at read-time its your responsibility to be 
able to decode them solely from the index (this enforces that there doesnt need
to be a crazy matching of user configuration at write and read time).


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285746#comment-13285746
 ] 

Michael McCandless commented on LUCENE-4069:


When I run all tests with the bloom 4.0 postings format ({{ant test-core 
-Dtests.postingsformat=BloomFilteredLucene40PostingsFormat}}), I see a lot of 
failures..., eg lots of CheckIndex failures like:

{noformat}
   [junit4]   1 java.lang.RuntimeException: vector term=[e4 b8 80 74 77 6f] 
field=textField2Utf8 does not exist in postings; doc=0
   [junit4]   1at 
org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1440)
   [junit4]   1at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:575)
   [junit4]   1at 
org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:186)
   [junit4]   1at 
org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:587)
   [junit4]   1at 
org.apache.lucene.index.TestFieldsReader.afterClass(TestFieldsReader.java:72)
   [junit4]   1at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
   [junit4]   1at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
{noformat}

But also other spooky failures.  Are these known issues?

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285744#comment-13285744
 ] 

Mark Harwood commented on LUCENE-4069:
--

This fails if you add docs with title and text fields:
{code:title=ThisCrashes.java}
  Codec fooCodec=new Lucene40Codec() {
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if (text.equals(field)) {
return new MemoryPostingsFormat(false);
  }
  if (title.equals(field)) {
return new MemoryPostingsFormat(true);
  }  
  else {
return super.getPostingsFormatForField(field);
  }
} 
  };
{code}

  Exception in thread main java.io.IOException: Cannot overwrite: 
C:\temp\luceneCodecs\_2_Memory.ram

This also fails:

{code:title=ThisToo.java}
   Codec fooCodec=new Lucene40Codec() {
SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if (text.equals(field)) {
return new SimpleTextPostingsFormat();
  }
  if (title.equals(field)) {
return new SimpleTextPostingsFormat();
  }  
  else {
return super.getPostingsFormatForField(field);
  }
} 
  };
{code}
with 
  Exception in thread main java.io.IOException: Cannot overwrite: 
C:\temp\luceneCodecs\_1_SimpleText.pst

Whereas sharing the same instance of a PostingsFormat class across fields works:

{code:title=ThisWorks.java}
  Codec fooCodec=new Lucene40Codec() {
SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if ((text.equals(field))|| (title.equals(field))) {
return theSimple;
  }  
  else {
return super.getPostingsFormatForField(field);
  }
} 
  };
{code}


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285745#comment-13285745
 ] 

Robert Muir commented on LUCENE-4069:
-

As far as your problem with parameters, this is actually just a limitation I 
created (on accident) of PerFieldPostingsFormat
on LUCENE-4055. Because we use the postingsformat name as the segment suffix, 
its not possible to have e.g. id field with Pulsing(1)
and date field with Pulsing(2).

I'll open an issue and fix this, again its just an issue with 
PerFieldPostingsFormat. you should be able to use the same name here
with different configs, as long as your PostingsFormat writes what it needs to 
read the files. Thanks for bringing this up.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Andrzej Bialecki

On 30/05/2012 17:09, Robert Muir (JIRA) wrote:

I'm not sure this is true: e.g. if your postings format requires parameters to 
decode the segment, then this enforces that it records said parameters,
e.g. Pulsing records these parameters.

Codec parameters are at index-time, at read-time its your responsibility to be 
able to decode them solely from the index (this enforces that there doesnt need
to be a crazy matching of user configuration at write and read time).


I think what Mark is missing (and I saw as a limiting factor in 
developing other codecs) is to make it easier to customize Codec-s based 
on composition of reusable blocks, without necessarily needing a 
separate Codec class implementation.


This could be worked around by having a configurable codec that stores 
its configuration and instantiates necessary reusable blocks, available 
using the SPI mechanism. On writing you could specify this configuration 
as Codec attributes, and they could be written out e.g. to SegmentInfos, 
and on read they would become available from SegmentInfos.attributes.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285757#comment-13285757
 ] 

Robert Muir commented on LUCENE-4069:
-

Mark: I opened LUCENE-4090.. I broke it, so I'll fix it :)



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Robert Muir
On Wed, May 30, 2012 at 11:43 AM, Andrzej Bialecki a...@getopt.org wrote:
 On 30/05/2012 17:09, Robert Muir (JIRA) wrote:

 I'm not sure this is true: e.g. if your postings format requires
 parameters to decode the segment, then this enforces that it records said
 parameters,
 e.g. Pulsing records these parameters.

 Codec parameters are at index-time, at read-time its your responsibility
 to be able to decode them solely from the index (this enforces that there
 doesnt need
 to be a crazy matching of user configuration at write and read time).


 I think what Mark is missing (and I saw as a limiting factor in developing
 other codecs) is to make it easier to customize Codec-s based on composition
 of reusable blocks, without necessarily needing a separate Codec class
 implementation.

 This could be worked around by having a configurable codec that stores its
 configuration and instantiates necessary reusable blocks, available using
 the SPI mechanism. On writing you could specify this configuration as Codec
 attributes, and they could be written out e.g. to SegmentInfos, and on read
 they would become available from SegmentInfos.attributes.


Well I think honestly here a bug in PerFieldPostingsFormat is
definitely confusing the situation (LUCENE-4090).

You should be able to set Pulsing(1) on id field and Pulsing(2) on
date field and everything just work: but I broke that. I think thats
whats causing the most grief.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285765#comment-13285765
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. its just an issue with PerFieldPostingsFormat

OK, thanks. My guess is you'll effectively be having to supplement 
postingsformat.getName() with object-instanceID in file names.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285768#comment-13285768
 ] 

Robert Muir commented on LUCENE-4069:
-

Right, its not bad. Separately I'm not happy that its a little trappy in 
situations like your ThisToo.java,

its certainly easy and safe to continue to use identityhashmap but I think we 
will need to require equals/hashcode on postingsformats.

Otherwise its going to make the situation trappy in the future, e.g. we will 
want to add mechanisms to solr so that you can specify these parameters
in the schema.xml (today you cannot), and this would create massive amounts of 
files if we did so (add to the fact it disables compound files by default
and its a disaster...)


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285773#comment-13285773
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. When I run all tests with the bloom 4.0 postings format (ant test-core 
-Dtests.postingsformat=BloomFilteredLucene40PostingsFormat), 

Thanks for the pointer on targeting codec testing. I have another patch to 
upload with various tweaks e.g. configurable choice of hash functions, 
RandomCodec additions so I will concentrate testing on that before uploading.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-29 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285054#comment-13285054
 ] 

Simon Willnauer commented on LUCENE-4069:
-

hey mark, 

I think you should really not provide any field handling at all. this should be 
done in user code ie. a general codes/postings format per field. What would be 
very useful here is integration into RandomCodec so we can randomly swap it in. 
Same is true for the hash algos  memory settings, I think they should be part 
of the abstract class ctor and if a user wants to use its own algo they should 
write their own codec ie. subclass. this stuff is very expert and apps like 
solr can handle this on top. 

I am still worried about the TermsEnum reuse code, are you planning to look 
into this?



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-29 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285327#comment-13285327
 ] 

Robert Muir commented on LUCENE-4069:
-

{quote}
I was trying to limit the extensions apps would have to write to use this 
service (1 custom postings format subclass, 1 custom Codec subclass and 1 
custom SPI config file)
{quote}

As far as Codec goes, we don't need to provide one for specialized per-field 
PostingsFormats.

Users can just use an anonymous Lucene40Codec, which already has a hook for 
doing stuff like this (a protected method you can override to changeup postings 
format per-field).

example that sets SimpleText on just the id field:
{code}
indexWriterConfig.setCodec(new Lucene40Codec() {
  @Override
  public PostingsFormat getPostingsFormatForField(String field) {
if (id.equals(field)) {
  return new SimpleTextPostingsFormat();
} else {
  return super.getPostingsFormatForField(field);
}
  } 
});
{code}

this is only necessary to set on the indexwriterconfig for writing: 
PerFieldPostingsFormat records for 
that id field that it was encoded with SimpleText

when the segment is opened, assuming SimpleText is in the classpath and 
registered, it will just work.

from the solr side, they dont need to set any codec either, as its schema 
already uses this hook, so there its just:

{code}
fieldType name=string_simpletext class=solr.StrField 
postingsFormat=SimpleText/
{code}



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-28 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284520#comment-13284520
 ] 

Simon Willnauer commented on LUCENE-4069:
-

hey mark, this looks very interesting. I wonder if that also helps indexing in 
terms of applying deletes. did you test that at all or is that something are 
looking into as well? For deletes this could be very handy since bloom filters 
should help a lot when a given key is likely to only in one segment.

Regarding the code, I think BloomFilter should be like Pulsing and only be a 
PostingsFormat. You should make the BloomFilterPostingsFormat abstract and pass 
in the wrapped postings format and delegate independent of the field. 
PostingsFormat is per field and is resolved by the codec. An actual 
implementation should subclass the abstract BloomFilterPostingsFormat that way 
you have a fixed PostingsFormat per delegate and don't need to deal with 
which field got which delegate bla bla. 

We should not have a Bloomfilter codec but rather swap in the bloomfilter 
postings format in test via random codec. I also wonder if we can extract a 
bloomfilter class into utils that builds an BloomFilter this could be useful 
elsewhere. 



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-28 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284536#comment-13284536
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq.  I wonder if that also helps indexing in terms of applying deletes. did you 
test that

I've not looked into that particularly but I imagine this may be relevant. 

Thanks for the tips for making the patch more generic. I'll get on it tomorrow 
and changed to FixedBitSet while I'm at it.

bq. I also wonder if we can extract a bloomfilter class into utils

There's some reusable stuff in this patch for downsizing the Bitset (according 
to desired saturation levels) having accumulated a stream of values.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-28 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284545#comment-13284545
 ] 

Simon Willnauer commented on LUCENE-4069:
-

bq. Thanks for the tips for making the patch more generic. I'll get on it 
tomorrow and changed to FixedBitSet while I'm at it.
YW! A couple of things regarding your patch. I think you should write a header 
for the bloomfilter file using CodecUtils.writeHeader() to be consistent. Once 
you are ready you should also look at the other *PostingsFormat impls how we 
document the file formats there. We try to have versioned file formats instead 
of the monolithic one on the website. I saw some minor things like 

{code}
finally {
  if (output != null)  {
output.close();
  }
}
{code}

you can use IOUtils.close*() for that which handles some exception cases etc.

I briefly looked at your reuse code for TermsEnum and it seems it has the same 
issues as pulsing has (sovled). When you get a TermsEnum that is infact a 
Bloomfilter wrapper but wraps another codec underneath you keep on switching 
back an forth creating new delegated TermsEnum instances. You can look at 
PulsingPostingsReader and in particular at PulsingEnumAttributeImpl how we 
solved that in Pulsing. 

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-28 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284562#comment-13284562
 ] 

Michael McCandless commented on LUCENE-4069:


It looks like MurmurHash.java is missing from the patch?

Also, with LUCENE-4055 landing it looks like the patch no longer compiles 
(sorry!)...

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283574#comment-13283574
 ] 

Michael McCandless commented on LUCENE-4069:


bq. I see a 35% improvement over standard Codecs on random lookups on a warmed 
index.

Impressive!  This is for primary key lookups?

It looks like the primary keys are GUID-like right?  (Ie randomly generated).  
I wonder if they had some structure instead (eg '%09d' % (id++)) how the 
results would look...

bq. I also notice that the PulsingCodec is no longer faster than standard Codec 
- is this news to people as I thought it was supposed to be the way forward?

That's baffling to me: it should only save seeks vs Lucene40 codec, so on a 
cold index you should see substantial gains, and on a warm index I'd still 
expect some gains.  Not sure what's up...

bq. I can open a seperate JIRA issue for this 4.0 version of the code if that 
makes more sense.

I think it's fine to do it here?  Really 3.6.x is only for bug fixes now ... so 
I think we should commit this to trunk.

I wonder if you can wrap any other PostingsFormat (ie instead of hard-coding to 
Lucene40PostingsFormat)?  This way users can wrap any PF they have w/ the bloom 
filter...

Can you use FixedBitSet instead of OpenBitSet?  Or is there a reason to use 
OpenBitSet here...?

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283583#comment-13283583
 ] 

Mark Harwood commented on LUCENE-4069:
--

My current focus is speeding up primary key lookups but this may have 
applications outside of that (Zipf tells us there is a lot of low frequency 
stuff in free text).

Following the principle of the best IO is no IO the Bloom Filter helps us 
quickly understand which segments to even bother looking in. That has to be a 
win overall.

I started trying to write this Codec as a wrapper for any other Codec (it 
simply listens to a stream of terms and stores a bitset of recorded hashes in a 
.blm file). However that was trickier than I expected - I'd need to encode a 
special entry in my blm files just to know the name of the delegated codec I 
needed to instantiate at read-time because Lucene's normal Codec-instantiation 
logic would be looking for BloomCodec and I'd have to discover the delegate 
that was used to write all of the non-blm files.

Not looked at FixedBitSet but I imagine that could be used instead.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283615#comment-13283615
 ] 

Mark Harwood commented on LUCENE-4069:
--

Update- I've discovered this Bloom Filter Codec currently has a bug where it 
doesn't handle indexes with 1 field.
It's probably all tangled up in the PerField... codec logic so I need to do 
some more digging.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org