[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428839#comment-13428839 ] Michael McCandless commented on LUCENE-4069: Hmm in testing Bloom + Lucene40 PF on luceneutil's id field I don't see any gains with what was committed (base=Lucene40 for id field, comp=Bloom + Lucene40 for id field; all other fields are just Lucene40): {noformat} TaskQPS base StdDev base QPS bloomStdDev bloom Pct diff PKLookup 196.719.29 192.945.40 -8% - 5% HighSloppyPhrase0.750.040.750.04 -10% - 9% HighPhrase1.780.021.760.01 -2% - 0% MedPhrase5.000.094.950.15 -5% - 3% AndHighMed 74.663.52 74.024.50 -11% - 10% AndHighHigh 16.940.82 16.830.78 -9% - 9% MedSloppyPhrase4.100.114.080.12 -5% - 4% LowPhrase 13.350.19 13.310.24 -3% - 2% LowSloppyPhrase 31.910.52 31.900.58 -3% - 3% AndHighLow 1114.88 43.26 1121.52 33.72 -6% - 7% LowSpanNear9.670.229.760.26 -4% - 5% LowTerm 421.55 14.96 426.555.23 -3% - 6% MedTerm 176.457.66 179.402.84 -4% - 7% HighSpanNear1.430.031.450.05 -3% - 7% MedSpanNear2.510.052.560.09 -3% - 7% HighTerm 15.520.95 15.900.24 -4% - 10% Respell 69.773.20 71.653.61 -6% - 13% Fuzzy1 88.192.49 91.043.06 -2% - 9% Fuzzy2 36.181.26 37.501.44 -3% - 11% Wildcard 45.331.52 47.041.10 -1% - 9% Prefix3 65.652.53 68.261.99 -2% - 11% OrHighLow 37.450.88 39.872.48 -2% - 15% OrHighHigh9.580.10 10.230.640% - 14% OrHighMed 14.890.14 15.961.010% - 15% IntNRQ 10.241.04 11.071.33 -13% - 34% {noformat} I just used the DefaultBloomFilterFactory, which should be good here since it targets 10% fill based on doc count. Maybe even slight loss (but could easily be noise). I confirmed that I'm getting .blm files and that at search time the filter does in fact work (quickly returns false from seekExact). The differences vs. the dedicated benchmark on this issue is 1) luceneutil is only testing lookups, 2) the id field is just one field (ie we also have eg body/title/date for searching), 3) luceneutil is testing a bunch of other queries in other threads in addition to PKLookup. Out of curiosity I also tried Memory vs Bloom + Memory, and Bloom was ~15% slower, I guess because computing the hash doing the lookup is non-trivial cost vs just walking the FST to find out the term doesn't exist. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 4.0, 5.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427317#comment-13427317 ] Robert Muir commented on LUCENE-4069: - Hi Mark: I noticed this was committed only to the 4.x branch. can you also merge the change to trunk? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427318#comment-13427318 ] Adrien Grand commented on LUCENE-4069: -- Mark, is there a reason why this patch hasn't been committed to trunk too? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427322#comment-13427322 ] Mark Harwood commented on LUCENE-4069: -- Will do. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418314#comment-13418314 ] Mark Harwood commented on LUCENE-4069: -- One more remaining issue before I commit which has appeared sporadically and looks to be consistently raised by this test: ant test -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestBloomFilteredLucene40Postings -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=ISO-8859-1 The error it produces is this: [junit4:junit4] Caused by: java.lang.IllegalStateException: Missing file:_9_TestBloomFilteredLucene40Postings_0.blm [junit4:junit4]at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat$BloomFilteredFieldsProducer.init(BloomFilteringPostingsFormat.java:175) [junit4:junit4]at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat.fieldsProducer(BloomFilteringPostingsFormat.java:156) MockDirectoryWrapper looks to be randomly deleting files (probably my blm file shown above) to simulate the effects of crashes. Presumably I am doing the right thing in always throwing an exception if the .blm file is missing? The alternative would be to silently ignore the missing file which seems undesirable. IF MDW is intended to only delete uncommitted files I'm not sure how we end up in a scenario where BloomPF is being asked to open the uncommitted segment? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418393#comment-13418393 ] Michael McCandless commented on LUCENE-4069: bq. Presumably I am doing the right thing in always throwing an exception if the .blm file is missing? Definitely! This should be considered a corrupt index. bq. IF MDW is intended to only delete uncommitted files I'm not sure how we end up in a scenario where BloomPF is being asked to open the uncommitted segment? Very strange... IW will sync all files written by the codec, and only once all files for a given segment are sync'd will it write a segments_N referencing that segment. Not sure offhand why this file isn't being sync'd... I wonder if it has to do w/ only opening the file in the close() method (most other PFs open files well before then). It should be fine to do that ... but it's something different. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418411#comment-13418411 ] Mark Harwood commented on LUCENE-4069: -- bq. I wonder if it has to do w/ only opening the file in the close() method ( Just tried opening the file earlier (in BloomFilteredConsumer constructor) and that didn't fix it. I previously also added an extra Directory.fileExists() sanity check immediately after closing the IndexOutput and all was well so I think it's something happening after that. Will need to dig deeper. I'm running on WinXP 64bit if that is of any significance to MDW's behaviour. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416007#comment-13416007 ] Mark Harwood commented on LUCENE-4069: -- bq. At a minimum I think before committing we should make the SegmentWriteState accessible. OK. Will that be the subject of a new Jira? bq. Hmm why is anonymity at search time important? It would seem to be an established design principle - see https://issues.apache.org/jira/browse/LUCENE-4069#comment-13285726 It would be a pain if user config settings require a custom SPI-registered class around just to decode the index contents. There's the resource/classpath hell, the chance for misconfiguration and running Luke suddenly gets more complex. The line to be drawn is between what are just config settings (field names, memory limits) and what are fundamentally different file formats (e.g. codec choices). The design principle that looks to be adopted is that the former ought to be accommodated without the need for custom SPI-registered classes and the latter would need to locate an implementation via SPI to decode stored content. Seems reasonable. The choice of hash algo does not fundamentally alter the on-disk format (they all produce an int) so I would suggest we treat this as a config setting rather than a fundamentally different choice of file format. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416014#comment-13416014 ] Uwe Schindler commented on LUCENE-4069: --- {quote} It would be a pain if user config settings require a custom SPI-registered class around just to decode the index contents. There's the resource/classpath hell, the chance for misconfiguration and running Luke suddenly gets more complex. The line to be drawn is between what are just config settings (field names, memory limits) and what are fundamentally different file formats (e.g. codec choices). The design principle that looks to be adopted is that the former ought to be accommodated without the need for custom SPI-registered classes and the latter would need to locate an implementation via SPI to decode stored content. Seems reasonable. The choice of hash algo does not fundamentally alter the on-disk format (they all produce an int) so I would suggest we treat this as a config setting rather than a fundamentally different choice of file format. {quote} The design principle here is very easy: We must follow the SPI pattern, if you write an index that could otherwise not be read with default settings and produces e.g. CorruptIndexException. If you have a codec that writes some special things for specific fields, it is required to write this information about the fields to the index. If you want to open this index using IndexReader again, there must not be any requirement for configuration settings on the reader itsself - a simple DirectoryReader.open() must be possible and a query must be able to execute. The IndexReader must be able to get all this information from the index files. If a special decoder for foobar is needed, it must be loadable by SPI. This is similar to postings. A new postings format needs a new SPI, otherwise you cannot read the index. And it is not true that Luke is more complex to configure. Just put the JAR file into classpath that contains the SPI and you are fine. Setting up a build environment is more complicated but thats more the problem of shitty Eclipse resource handling. ANT/MAVEN/IDEA is easy. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416037#comment-13416037 ] Mark Harwood commented on LUCENE-4069: -- bq. If a special decoder for foobar is needed, it must be loadable by SPI. I think we are in agreement on the broad principles. The fundamental question here though is do you want to treat an index's choice of Hash algo as something that would require a new SPI-registered PostingsFormat to decode or can that be handled as I have done here with a general purpose SPI framework for hashing algos? Actually, re-thinking this, I suspect rather than creating our own, I can use Java's existing SPI framework for hashing in the form of MessageDigest. I'll take a closer look into that... Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416039#comment-13416039 ] Uwe Schindler commented on LUCENE-4069: --- bq. Actually, re-thinking this, I suspect rather than creating our own, I can use Java's existing SPI framework for hashing in the form of MessageDigest. I'll take a closer look into that... That's a good idea. MessageDigest.getInstance(name) should be the way to go. And this name string is encoded to the index. If the MessageDigest API is enough then one can e.g. use bouncycastle by just plugging into the calsspath. If on opening index, this is not available, he will get Exception just like when codec/postingsformat is missing. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416084#comment-13416084 ] Mark Harwood commented on LUCENE-4069: -- bq. MessageDigest.getInstance(name) should be the way to go I'm less keen now - a quick scan of the docs around MessageDigest throws up some issues: 1) SPI registration of MessageDigest providers looks to get into permissions hell as it is closely related to security - see http://docs.oracle.com/javase/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling which talks about the steps required to approve a trusted provider. 2) MessageDigest as an interface is designed to stream content in potentially many method calls past the hashing algo. MurmurHash2.java is not currently written to process content this way and suits our needs in hashing small blocks of content in one hit. For these 2 reasons it looks like MessageDigest may be a pain to adopt and the existing approach proposed in this patch may be preferable. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416290#comment-13416290 ] Michael McCandless commented on LUCENE-4069: {quote} bq. At a minimum I think before committing we should make the SegmentWriteState accessible. OK. Will that be the subject of a new Jira? {quote} No, I mean we shouldn't commit this patch until SegmentWriteState is accessible when creating the FuzzySet. I think we can just pass it to BloomFilterFactory.getSetForField? This way if the app knows it's a PK field then it can use maxDoc to always size an appropriate bit set up front. bq. I think we are in agreement on the broad principles. The fundamental question here though is do you want to treat an index's choice of Hash algo as something that would require a new SPI-registered PostingsFormat to decode or can that be handled as I have done here with a general purpose SPI framework for hashing algos? +1, that's exactly the question. Ie, where to draw the line between config of an existing PF and different PF. But I guess swapping in different hash impl should be seen as simple config change, so I think using SPI to find it at read time is OK. I still don't like how trappy this approach is: the default hardwired (8 MB) can be way too big (silently slows down your NRT reopens, especially if you bloom all fields) or way too small (silently turns off bloom filter for fields that have too many unique terms). I also don't think this PF should be per-field: we have PerFieldPostingsFormat for that, and if there are limitations in PFPF, we should address them rather than having to make all future PFs handle per-field-ness themselves. This PF should really handle one field. But I don't think these issues need to hold up commit (except for making SegmentWriteState accessible)... we can improve over time. I think we may simply want to fold this into the terms dict somehow. Can you add @lucene.experimental to all the new APIs? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416383#comment-13416383 ] Mark Harwood commented on LUCENE-4069: -- A quick benchmark looks like the new right-sized bitset as opposed to the old worst-case-scenario-sized bitset is buying us a small performance improvement. bq. I also don't think this PF should be per-field There was a lengthy discussion earlier on this topic. The approach presented here seems reasonable. For the average user you have the DefaultBloomFilterFactory default which now has reasonable sizing for all fields passed its way (assuming a heuristic based on numDocs=numKeys to anticipate). For expert users you can provide a BloomFilterFactory with a custom choice of sizing heuristic per-field and can also simply return null for non-bloomed fields. Having a single, carefully configured BloomPF wrapper is preferable because you can channel appropriately configured bloom settings to a common PF delegate and avoid creating multiple .tii, .tis files etc because the PerFieldPF isn't smart enough to figure out that these Bloom-ing choices do not require different physical files for all the delegated tii etc structures. You don't *have* to use the Per-field stuff in BloomPF but there are benefits to be had in doing so which can't otherwise be achieved. bq. Can you add @lucene.experimental to all the new APIs? Done. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416417#comment-13416417 ] Michael McCandless commented on LUCENE-4069: Thanks Mark. +1 to commit. I still think we could/should somehow fold this into the terms dict but we can do that another day... this is a good step forward. bq. because the PerFieldPF isn't smart enough to figure out that these Bloom-ing choices do not require different physical files for all the delegated tii etc structures. We need to fix that, so that every PF wrapper that comes along doesn't feel like it must handle per-field-ness itself. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415219#comment-13415219 ] Michael McCandless commented on LUCENE-4069: bq. Estimating key volumes in context 1 is probably hard without some additional hints from the end user. Actually, the indexer knows exactly how many unique terms it's about to flush. It's the unique term count (for this one segment) that you need right? We just have to plumb it somehow (add to SegmentWriteState?). bq. 2) Merged segments e.g. guessing how many unique keys survive a merge operation Seems like LUCENE-4198 needs to solve this same problem. bq. Not sure how we get the OneMerge instance fed through the call stack - could that be held somewhere on a ThreadLocal as generally useful context? Yeah I'm not sure either ... but I agree an approx heuristic based on merging segments unique term counts is better than nothing. I don't like how you silently just get bad perf now if your up-front guess was too small ... I also don't like that the default guessing is static (8 MB): flushing tiny NRT segs are gonna pay an absurd price. I think we need a solution for this... Maybe another option is to simply make a 2nd pass after the segment (flush or merge) is written, to build up the bit set; this way we know exactly how many unique terms we have. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415221#comment-13415221 ] Michael McCandless commented on LUCENE-4069: I don't think we should use a separate factory class to provide the bloom filters; I think we should simplify. EG add overridable methods to BloomFilteringPF; this way the default impl (8 MB, isSaturated) would be in BloomFilteringPF. Also, why do we need to use SPI to find the HashFunction? Seems like overkill... we don't (yet) have a bunch of hash functions that are vying here right? We can make this pluggable at a later time... can't the postings format impl pass in an instance of HashFunction when making the FuzzySet? Default would just be MurmurHash. Can you move the imports under the copyright header? (eg in HashFunction.java but maybe others). Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415362#comment-13415362 ] Mark Harwood commented on LUCENE-4069: -- bq. It's the unique term count (for this one segment) that you need right? Yes, I need it before I start processing the stream of terms being flushed. bq. Seems like LUCENE-4198 needs to solve this same problem. Another possibly related point on more access to merge context - custom codecs have a great opportunity at merge time to piggy-back some analysis on the data being streamed e.g. to spot trending terms whose term frequencies differ drastically between the merging source segments. This would require access to source segment as term postings are streamed to observe the change in counts. bq. Also, why do we need to use SPI to find the HashFunction? Seems like overkill... we don't (yet) have a bunch of hash functions that are vying here right? There's already a MurmurHash3 algo - we're currently using v2 and so could anticipate an upgrade at some stage. This patch provides that future proofing. bq. can't the postings format impl pass in an instance of HashFunction when making the FuzzySet I don't think that is going to work. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). All their settings (fields, hash algo, thresholds) etc are recorded at write time by the base class in the segment. At read-time it is the BloomFilterPostingsFormat base class that is instantiated, not the write-time subclass and so we need to store the hash algo choice. We can't rely on the original subclass being around and configured appropriately with the original write-time choice of hashing function. I think the current way feels safer over all and also allows other Lucene functions to safely record hashes along with a hashname string that can be used to reconstitute results. bq. Can you move the imports under the copyright header? Will do Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415604#comment-13415604 ] Michael McCandless commented on LUCENE-4069: {quote} bq. It's the unique term count (for this one segment) that you need right? Yes, I need it before I start processing the stream of terms being flushed. {quote} At a minimum I think before committing we should make the SegmentWriteState accessible. EG then at least I can use numDocs for my primary key field. I really don't like how easy it is to silently mis-configure this PF: the default 8 MB is way to high for an NRT setting and way too low for a large index. bq. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). Hmm why is anonymity at search time important? I think that's a non-feature, and we shouldn't make our core code more complex for it? Ie, it's fine to require the app to have to make a named PF (that is accessible via SPI), implementing all their custom bloom logic (which fields are bloom'd, what hash to use, etc.). When an app makes a custom Codec/PostingsFormat, it's expected that that class is accessible via SPI at both index time and search time. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410145#comment-13410145 ] Mark Harwood commented on LUCENE-4069: -- bq. So now we are close to 1M lookups/sec for a single thread! Cool! bq. I wonder if somehow we can do a better job picking the right sized bit vector up front? bq. You basically need to know up front how many unique terms will be in the given field for this segment right? Yes - the job of anticipating the number of unique keys probably has 2 different contexts: 1) Net new segments e.g. guessing up front how many docs/keys a user is likely to generate in a new segment before the flush settings kick in. 2) Merged segments e.g. guessing how many unique keys survive a merge operation Estimating key volumes in context 1 is probably hard without some additional hints from the end user. Arguably the BloomFilterFactory.getSetForField() method already represents where this setting can be controlled. In context 2 where potentially large merges occur we could look at adding an extra method to BloomFilterFactory to handle this different context e.g. something like FuzzySet getSetForMergeOpOnField(FieldInfo fi, OneMerge mergeContext) Based on the size of the segments being merged and volumes of deletes a more appropriate size of Bloom bitset could be allocated based on a worst-case estimate. Not sure how we get the OneMerge instance fed through the call stack - could that be held somewhere on a ThreadLocal as generally useful context? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409676#comment-13409676 ] Michael McCandless commented on LUCENE-4069: bq. That's tightened performance but that lookss a scary amount of code for the optimal solution of this basic incrementing operation Well the code now looks horribly complex since we have a ton of booleans to do the same thing in different ways :) From my limited poking-around testing the fastest way seems to be lookupByReader=false, shareEnums=true, doSortKeys=true, useDocValues=true, useBase36=true, doDirectDelete=true. Once you specialize the code for those booleans then it looks much more sane. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409679#comment-13409679 ] Michael McCandless commented on LUCENE-4069: Also: in digging into this I found and fixed a silly performance bug in the nightly PKLookup perf test: http://people.apache.org/~mikemccand/lucenebench/PKLookup.html So now we are close to 1M lookups/sec for a single thread! Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409687#comment-13409687 ] Michael McCandless commented on LUCENE-4069: bq. I added a 100mb start size for the BloomFilter for large-scale tests because without this it gets saturated and there were occasional big spikes in batch times. I wonder if somehow we can do a better job picking the right sized bit vector up front? You basically need to know up front how many unique terms will be in the given field for this segment right? This seems similar to LUCENE-4198, where we want to improve Codec API so it has access to collection stats (like number of unique terms in this field) up front. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409693#comment-13409693 ] Michael McCandless commented on LUCENE-4069: I'll open a separate issue for IndexWriter.tryDeleteDocument; I think it's worth exploring... Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409746#comment-13409746 ] Michael McCandless commented on LUCENE-4069: I ran with with the fastest configuration (lookupByReader=false, shareEnums=true, doSortKeys=true, useDocValues=true, doDirectDelete=true, useBase36=true). I used MMapDir, 10M numDocs and keySpaceSize, Java 1.7.0_04, 2 GB starting/max heap. With doWorstCaseFlushing=false: {noformat} Lucene40 55978 ms Bloom 49880 ms Memory 55965 ms Pulsing 56059 ms {noformat} With doWorstCaseFlushing=true: {noformat} Lucene40 61097 ms Bloom 53698 ms Memory 61066 ms Pulsing 61355 ms {noformat} So Bloom is ~11-12% faster. Very curious that Lucene40, Pulsing and Memory are basically the same! Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409762#comment-13409762 ] Michael McCandless commented on LUCENE-4069: I opened LUCENE-4203 for IndexWriter.tryDeleteDocument. Urgh, patch is on the wrong issue ... Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408097#comment-13408097 ] Mark Harwood commented on LUCENE-4069: -- Thanks for the extra tests, Mike. That's tightened performance but that lookss a scary amount of code for the optimal solution of this basic incrementing operation :) I've done some more benchmarks with the updated test and the performance characteristics are becoming clearer as shown in these results: http://goo.gl/dtWSb Bloom performance is better than Pulsing but the gap narrows with the volumes of deletes lying around in old segments, caused by updates. In these cases the BloomFilter gives a false positive and falls back to the equivalent operations of Pulsing. I added a 100mb start size for the BloomFilter for large-scale tests because without this it gets saturated and there were occasional big spikes in batch times. So overall there still looks to be a benefit and especially in low-frequency update scenarios. I'll wait for the dust to settle on Lucene-4190 (given this Codec introduces a new file) before thinking about committing. Cheers Mark Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397425#comment-13397425 ] Michael McCandless commented on LUCENE-4069: Hmm somehow I messed up when testing the bloom postings format: all of my on-disk bloom bits sets are 8 MB, regardless of the size of the segment... so something is wrong (I would expect segments w/ more terms to have a larger bit set...). This is what I did for the codec: {noformat} final Codec codec = new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField(String field) { final String pfName = field.equals(id) ? idFieldPostingsFormat : defaultPostingsFormat; if (pfName.equals(Bloom)) { return new BloomFilteringPostingsFormat(PostingsFormat.forName(Lucene40), new BloomFilterFactory() { @Override public FuzzySet getSetForField(FieldInfo info) { return FuzzySet.createSetBasedOnMaxMemory(100*1024*1024, new MurmurHash2()); } }); } else { return PostingsFormat.forName(pfName); } } }; {noformat} I thought that would use plenty of RAM (100 MB), and then downsize to the appropriately sized bit set (~10% saturation) when writing ... but somehow it didn't. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397445#comment-13397445 ] Michael McCandless commented on LUCENE-4069: OK I think I found the problem w/ the patch: in BloomFilteringPostingsFormat.saveAppropriatelySizedBloomFilter, it should be {{rightSizedSet.serialize(bloomOutput);}} not {{bloomFilter.serialize(bloomOutput);}} (it was saving the un-downsized version). Now I see my .blm files varying in size according to how large the segment is... Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397457#comment-13397457 ] Michael McCandless commented on LUCENE-4069: Results after fixing the not downsizing bug: {noformat} TaskQPS base StdDev base QPS bloomStdDev bloom Pct diff Fuzzy1 97.021.43 88.673.08 -13% - -4% Fuzzy2 42.501.15 40.321.75 -11% - 1% AndHighHigh 17.990.29 17.740.47 -5% - 2% Respell 82.762.96 81.663.54 -8% - 6% AndHighMed 55.771.02 55.061.51 -5% - 3% TermGroup1M 42.350.79 41.980.85 -4% - 3% TermBGroup1M 49.550.58 49.140.60 -3% - 1% SpanNear 12.530.15 12.490.09 -2% - 1% TermBGroup1M1P 64.360.83 64.340.83 -2% - 2% Prefix3 34.181.25 34.451.38 -6% - 8% Wildcard 34.421.12 34.751.35 -6% - 8% SloppyPhrase 11.480.17 11.610.22 -2% - 4% Phrase8.470.308.570.25 -5% - 7% Term 94.064.50 95.452.34 -5% - 9% IntNRQ9.840.63 10.070.71 -10% - 16% OrHighMed 34.572.07 35.782.24 -8% - 16% OrHighHigh 11.660.64 12.110.73 -7% - 16% PKLookup 129.752.37 153.802.31 14% - 22% {noformat} Basically same as before ... which is good (it means target 10% saturation doesn't lose any perf). Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13396934#comment-13396934 ] Mark Harwood commented on LUCENE-4069: -- Mike, currently having various issues getting this benchmark framework up and running on my Windows platform here - is it easy for you to kick off another run with the latest patch on your setup? The latest change to the patch shouldn't require an index rebuild from your last run. No worries if this is too much hassle for you - I'll probably just try switch to testing on OSX at home. Cheers, Mark Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397010#comment-13397010 ] Michael McCandless commented on LUCENE-4069: bq. Mike, currently having various issues getting this benchmark framework up and running on my Windows platform here Alas it's not easy ... please report back on how to make it easier to set up! bq. is it easy for you to kick off another run with the latest patch on your setup? No problem: I'll run perf test again. It's easy... Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397054#comment-13397054 ] Mark Harwood commented on LUCENE-4069: -- bq. problem: I'll run perf test again. It's easy... Great, thanks. bq. Alas it's not easy ... please report back on how to make it easier to set up! My Windows-based woes were: 1) Had to install python (used 2.7) 2) Figure out python proxy settings for Wikipedia download 3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't installed/available so gave up and did svn checkout manually 4) Ran first python test and it aborted with complaint about GnuPlot missing I imagine most of what is needed here comes out of the box on typical OSX/Linux setup. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397082#comment-13397082 ] Michael McCandless commented on LUCENE-4069: Results from last patch: {noformat} TaskQPS base StdDev base QPS bloomStdDev bloom Pct diff IntNRQ 11.351.27 10.140.62 -24% - 6% Fuzzy1 108.523.34 101.822.90 -11% - 0% Prefix3 64.872.17 61.551.61 -10% - 0% Wildcard 43.181.74 41.331.17 -10% - 2% Fuzzy2 41.761.40 40.051.00 -9% - 1% Term 151.714.38 147.244.42 -8% - 2% SpanNear5.230.095.110.12 -6% - 1% OrHighMed 12.600.88 12.340.48 -11% - 9% SloppyPhrase8.250.208.090.07 -5% - 1% TermBGroup1M 69.980.68 68.801.13 -4% - 0% OrHighHigh 10.060.669.930.39 -11% - 9% Phrase 12.730.30 12.570.35 -6% - 3% TermGroup1M 35.440.42 35.080.67 -4% - 2% AndHighMed 63.402.27 62.901.11 -5% - 4% Respell 93.113.70 92.812.33 -6% - 6% TermBGroup1M1P 50.931.53 50.961.75 -6% - 6% AndHighHigh 15.860.71 15.930.27 -5% - 6% PKLookup 127.442.15 134.858.68 -2% - 14% {noformat} Looks like FuzzyN/Respell is good again ... PKLookup is a bit faster ... the rest is likely noise. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397174#comment-13397174 ] Michael McCandless commented on LUCENE-4069: {quote} My Windows-based woes were: 1) Had to install python (used 2.7) {quote} OK. bq. 2) Figure out python proxy settings for Wikipedia download Hmmm ... you can also manually download separately. bq. 3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't installed/available so gave up and did svn checkout manually Oh I see: setup.py uses PySVN; hmm. Just checking out manually is better... bq. 4) Ran first python test and it aborted with complaint about GnuPlot missing Hmm we should default printIndexCharts to False (I'll fix). bq. I imagine most of what is needed here comes out of the box on typical OSX/Linux setup. Hmm I don't think pysvn does, nor gnuplot ... Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395773#comment-13395773 ] Mark Harwood commented on LUCENE-4069: -- Interesting results, Mike - thanks for taking the time to run them. bq. BloomFilteredFieldsProducer should just pass through intersect to the delegate? I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms. The only other alternatives to endlessly wrapping in this way are: a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method. b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to) Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy. For completeness sake - I don't have access to your benchmarking code but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case. BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395830#comment-13395830 ] Michael McCandless commented on LUCENE-4069: bq. Interesting results, Mike - thanks for taking the time to run them. You're welcome! {quote} bq. BloomFilteredFieldsProducer should just pass through intersect to the delegate? I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms. The only other alternatives to endlessly wrapping in this way are: a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method. b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to) Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy. {quote} I think the fix is simple: you are not overriding Terms.intersect now, in BloomFilteredTerms. I think you should override it and immediately delegate and then FuzzyN/Respell performance should be just as good as Lucene40 codec. bq. For completeness sake - I don't have access to your benchmarking code All the benchmarking code is here: http://code.google.com/a/apache-extras.org/p/luceneutil/ I run it nightly (trunk) and publish the results here: http://people.apache.org/~mikemccand/lucenebench/ bq. but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case. It's only called once on init'ing the SegmentReader (or at least it better be!). {quote} BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase. {quote} Nice results! It looks like bloom(3.6) is faster than bloom(4.0)? Why is that... Also I wonder why you see such sizable (3.5X speedup) gains on PK lookup but in my benchmark I see only ~13% - 24%. My index has 5 segments per level... Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395857#comment-13395857 ] Mark Harwood commented on LUCENE-4069: -- bq. I think the fix is simple: you are not overriding Terms.intersect now, in BloomFilteredTerms Good catch - a quick test indeed shows a speed up on fuzzy queries. I'll prepare a new patch. I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a closer look at your benchmark. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395980#comment-13395980 ] Michael McCandless commented on LUCENE-4069: bq. Didn't get too far with running the Wikipedia perf tests due to missing data file (see http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=7 ) Woops sorry: I posted a comment there (new, more recent wikipedia export). Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13295748#comment-13295748 ] Michael McCandless commented on LUCENE-4069: I ran a benchmark on 10 M Wikipedia index; for the factory I used createSetBasedOnMemory and passed it 100 MB; I think that's enough to ensure we get the 10% saturation on save ...: {noformat} TaskQPS base StdDev base QPS bloomStdDev bloom Pct diff Fuzzy1 102.473.67 41.950.78 -61% - -56% Fuzzy2 38.361.76 18.680.37 -54% - -47% Respell 89.894.38 44.090.52 -53% - -47% Wildcard 40.482.82 36.200.64 -17% - -2% SloppyPhrase7.960.288.070.07 -3% - 5% Prefix3 61.945.34 63.350.37 -6% - 12% TermBGroup1M 71.376.79 73.731.55 -7% - 16% AndHighMed 64.095.51 66.731.75 -6% - 16% TermBGroup1M1P 49.553.78 51.752.67 -7% - 18% AndHighHigh 16.051.12 16.770.53 -5% - 15% TermGroup1M 35.873.07 37.560.74 -5% - 16% OrHighHigh9.601.38 10.150.65 -13% - 31% OrHighMed 11.931.91 12.630.93 -15% - 35% IntNRQ9.121.259.680.11 -7% - 24% Term 154.55 19.60 165.320.97 -5% - 23% Phrase 11.400.33 12.210.182% - 11% SpanNear4.310.074.730.037% - 12% PKLookup 122.781.42 145.955.22 13% - 24% {noformat} Baseline is Lucene40 PostingsFormat even for the id field ... so PKLookup gets a good improvement. This is on an index w/ 5 segments at each level. Other queries seem to speed up as well (eg Term, Or*). The queries that rely on Terms.intersect got much worse: is the BloomFilteredFieldsProducer should just pass through intersect to the delegate? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287258#comment-13287258 ] Mark Harwood commented on LUCENE-4069: -- I've thought some more about option 2 (PerFieldPF reusing wrapped PFs) and it looks to get very ugly very quickly. There's only so much PerFieldPF can do to rationalize a random jumble of PF instances presented to it by clients. I think the right place to draw the line is Lucene-4093 i.e. a simple .equals() comparison on top-level PFs to eliminate any duplicates. Any other approach that also tries to de-dup nested PFs looks to be adding a lot of complexity, especially when you consider what that does to the model of read-time object instantiation. This would be significant added complexity to solve a problem you have already suggested is insignificant (i.e. too many files doesn't really matter when using CFS). I can remove the per-field stuff from BloomPF if you want but I imagine I will routinely subclass it to add this optimisation back in to my apps. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286555#comment-13286555 ] Robert Muir commented on LUCENE-4069: - I don't think the abstract class should be registered in the SPI. Instead i think the concrete Bloom+Lucene40 that you have in tests should be moved into src/java and registered there, just call it Bloom40 or something. The abstract api is still available for someone that wants to do something more specialized. This is just like how pulsing (another wrapper) is implemented. As far as disabling this for certain tests, import o.a.l.util.LuceneTestCase.SuppressCodecs and put something like this at class level: {code} @SuppressCodecs(Bloom40) public class TestFoo... @SuppressCodecs({Bloom40, Memory}) public class TestBar... {code} The strings in here can be codecs or postings formats Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286562#comment-13286562 ] Robert Muir commented on LUCENE-4069: - Seeing the tests in question though, i dont think you want to disable this for these entire test classes. We dont have a way to disable this on a per-method basis: and I think its generally not possible because many classes create indexes in @BeforeClass etc. An alternative would be to just pick this less often in RandomCodec: see the SimpleText hack :) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286598#comment-13286598 ] Mark Harwood commented on LUCENE-4069: -- bq. Instead i think the concrete Bloom+Lucene40 that you have in tests should be moved into src/java and registered there What problem would that be trying to solve? Registration (or creation) of any BloomFilteringPostingsFormat subclasses is not necessary to decode index contents. Offering a Bloom40 would only buy users a pairing of Lucene40Postings and Bloom filtering but they would still have to declare which fields they want Bloom filtering on at write time. This isn't too hard using the code in the existing patch: {code:title=ThisWorks.java} final SetStringbloomFilteredFields=new HashSetString(); bloomFilteredFields.add(PRIMARY_KEY_FIELD_NAME); iwc.setCodec(new Lucene40Codec(){ BloomFilteringPostingsFormat postingOptions=new BloomFilteringPostingsFormat(new Lucene40PostingsFormat(), bloomFilteredFields); @Override public PostingsFormat getPostingsFormatForField(String field) { return postingOptions; } }); {code} No extra subclasses/registration required here to read the index built with the above setup. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286599#comment-13286599 ] Robert Muir commented on LUCENE-4069: - I dont understand why this handles fields. Someone should just pick that with perfieldpostingsformat. So you have the abstract wrapper(takes the wrapped postings format, and a String name), not registered. And you have a concrete impl registered that is just abstractWrapper(lucene40, Bloom40): done. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286600#comment-13286600 ] Mark Harwood commented on LUCENE-4069: -- bq. An alternative would be to just pick this less often in RandomCodec: see the SimpleText hack Another option might be to make the TestBloomFilteredLucene40Postings pick a ludicrously small Bitset sizing option for each field so that we can accommodate tests that create silly numbers of fields. The bitsets being so small will just quickly reach saturation and force all reads to hit the underlying FieldsProducer. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286616#comment-13286616 ] Mark Harwood commented on LUCENE-4069: -- bq. I dont understand why this handles fields. Someone should just pick that with perfieldpostingsformat. That would be inefficient because your PFPF will see BloomFilteringPostingsFormat(field1 + Lucene40) and BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different PostingsFormat instances and consequently create multiple files named differently because it assumes these instances may be capable of using radically different file structures. In reality, the choice of BloomFilter with field 1 or BloomFilter with field 2 or indeed no BloomFilter does not fundamentally alter the underlying delegate PostingFormat's file format - it only adds a supplementary blm file on the side with the field summaries. For this reason it is a mistake to configure seperate BloomFilterPostingsFormat instances on a per-field basis if they can share a common delegate. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286619#comment-13286619 ] Robert Muir commented on LUCENE-4069: - {quote} That would be inefficient because your PFPF will see BloomFilteringPostingsFormat(field1 + Lucene40) and BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different PostingsFormat instances and consequently create multiple files named differently because it assumes these instances may be capable of using radically different file structures. {quote} But adding per-field handling here is not the way to solve this: its messy. Per-Field handling should all be handled at a level above in PerFieldPostingsFormat. To solve what you speak of we just need to resolve LUCENE-4093. Then multiple postings format instances that are 'the same' will be deduplicated correctly. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286707#comment-13286707 ] Mark Harwood commented on LUCENE-4069: -- bq. To solve what you speak of we just need to resolve LUCENE-4093. Presumably the main objective here is that in order to cut down on the number of files we store, content consumers of various types should aim to consolidate multiple fields' contents into a single file (if they share common config choices). bq. Then multiple postings format instances that are 'the same' will be deduplicated correctly. The complication in this case is that we essentially have 2 consumers (Bloom and Lucene40), one wrapped in the other with different but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields. This will be a tough one for PFPF to untangle while we are stuck with a delegating model for composing consumers. This may be made easier if instead of delegating a single stream we have a *stream-splitting* capability via a multicast subscription e.g. Bloom filtering consumer registers interest in content streams for fields A and B while Lucene40 is consolidating content from fields A, B, C and D. A broadcast mechanism feeds each consumer a copy of the relevant stream and each consumer is responsible for inventing their own file-naming convention that avoids muddling files. While that may help for writing streams it doesn't solve the re-assembly of producer streams at read-time where BloomFilter absolutely has to position itself in front of the standard Lucene40 producer in order to offer fast-fail lookups. In the absence of a fancy optimised routing mechanism (this all may be overkill) my current solution was to put BloomFilter in the delegate chain armed with a subset of fieldnames to observe as a larger array of fields flow past to a common delegate. I added some Javadocs to describe the need to do it this way for an efficient configuration. You are right that this is messy (ie open to bad configuration) but operating this deep down in Lucene that's always a possibility regardless of what we put in place. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286712#comment-13286712 ] Robert Muir commented on LUCENE-4069: - {quote} but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields. {quote} Thats not true: I disagree. Its an implementation detail that Bloom as a postingsformat wraps another one (thats just the abstract implementation), and the file formats should not expose this in general for any format. This is true for a number of reasons: e.g. in the pulsing case the wrapped writer only gets a subset of the postings: therefore the wrapped writer's files are incomplete and an implementation detail. its enough here that if you have 5 fields: 2 bloom and 3 not, that we detect there are only two postings formats in use, regardless of whether you have 2 or 5 actual object instances. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286717#comment-13286717 ] Robert Muir commented on LUCENE-4069: - And separately, you can always contain the number of files even today by: * using only unique instances yourself when writing (rather than waiting on LUCENE-4093) * using the compound file format. The purpose of LUCENE-4093 is just to make this simpler, but I opened it as a separate issue because its really solely an optimization, and only for a pretty rare case where people are customizing the index format for different fields. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286754#comment-13286754 ] Mark Harwood commented on LUCENE-4069: -- Its true to say that Bloom is a different case to Pulsing - Bloom does not interfere in any with the normal recording of content in the wrapped delegate whereas Pulsing does. It may prove useful for us to mark a formal distinction between these mutating/non mutating types so we can treat them differently and provide optimisations? bq. And separately, you can always contain the number of files even today by using only unique instances yourself when writing Contained but not optimal - roughly double the number of required files if I want the common case of a primary key indexed with Bloom. I can't see a way of indexing with Bloom-plus-Lucene40 on field A and indexing with just Lucene40 on fields B,C and D and winding up with only one Lucene40 set of files with a common segment suffix. The way I did find of achieving this was to add a bloomFilteredFields set into my single Bloom+Lucene40 instance used for all fields. Is there any other option here currently? Looking to the future, 4093 may have more capabilities at optimising if it understands the distinction between mutating wrappers and non-mutating ones and how they are composed? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286756#comment-13286756 ] Robert Muir commented on LUCENE-4069: - {quote} Contained but not optimal - roughly double the number of required files if I want the common case of a primary key indexed with Bloom. {quote} Then use CFS, its optimal always (1). I really dont think we should make this complex to save 2 or 3 files total (even in a complex config with many fields). Its not worth the complexity. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286815#comment-13286815 ] Mark Harwood commented on LUCENE-4069: -- bq. Its not worth the complexity There's no real added complexity in BloomFilterPostingsFormat - it has to be capable of storing blooms for 1 field anyway and using the fieldname set is roughly 2 extra lines of code to see if a TermsConsumer needs wrapping or not. From a client side you don't have to use this feature - the fieldname set can be null in which case it will wrap all fields sent its way. If you do chose to supply a set the wrapped PostingsFormat will have the advantage of being shared for bloomed and non-bloomed fields. We could add a constructor that removes the set and mark the others expert. For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont *need* to use. Lucene-4093 may in future resolve the multi-file issue but I'm not sure it will do so without significant complication. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286868#comment-13286868 ] Simon Willnauer commented on LUCENE-4069: - bq. I really dont think we should make this complex to save 2 or 3 files total (even in a complex config with many fields). Its not worth the complexity. I agree. I think those postings formats should only deal with encoding and not with handling certain fields different. A user / app should handle this in the codec. Ideally you don't have any conditions in the relevant methods like termsConsumer etc. bq. For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use. why is this a speed improvement? reading from one file vs. multiple is not really faster though. Anyway, I think we should make this patch as simple as possible and don't handle fields in the PF. We can still open another issue or wait until LUCENE-4093 is in to discuss this issue? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286882#comment-13286882 ] Robert Muir commented on LUCENE-4069: - {quote} For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use. {quote} I agree with Simon, its not going to be faster. Worse, it creates a situation from the per-field perspective where multiple postings formats are sharing the same files for a segment. This would make it harder to do things like refactorings of codec apis in the future. So where is the benefit? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286916#comment-13286916 ] Mark Harwood commented on LUCENE-4069: -- bq. why is this a speed improvement? Sorry - misleading. Replace the word faster in my comment with better and that makes more sense - I mean better in terms of resource usage and reduced open file handles. This seemed relevant given the earlier comments about Solr's use of non-compound files: bq. [Solr] create massive amounts of files if we did so (add to the fact it disables compound files by default and its a disaster...) I can see there is a useful simplification being sought for here if PerFieldPF can consider each of the unique top-level PFs presented to it as looking after an exclusive set of files. As the centralised allocator of file names it can then simply call each unique PF with a choice of segment suffix to name its various files without conflicting with other PFs. Lucene 4093 is all about better determining which PF is unique using .equals(). Unfortunately I don't think this approach is sufficiently complex. In order to avoid allocating all unnecessary file names PerFieldPF would have to further understand the nuances of which PFs were being wrapped by other PFs and which wrapped PFs would be reusable outside of their wrapped PF (as is the case with BloomPF's wrapped PF). That seems a more complex task than implementing equals(). So it seems we have 3 options: 1) Ignore the problems of creating too many files in the case of BloomPF and any other examples of wrapping PFs 2) Create a PerFieldPF implementation that reuses wrapped PFs using some generic means of discovering recyclable wrapped PFs (i.e go further than what 4093 currently proposes in adding .equals support) 3) Retain my BloomPF-specific solution to the problem for those prepared to use lower-level APIs. Am I missing any other options and which one do you want to go for? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286926#comment-13286926 ] Simon Willnauer commented on LUCENE-4069: - bq. Create a PerFieldPF implementation that reuses wrapped PFs using some generic means of discovering recyclable wrapped PFs (i.e go further than what 4093 currently proposes in adding .equals support) I think we should investigate this further. Lets keep this issue simple and remove the field handling and fix this on a higher level! Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285518#comment-13285518 ] Mark Harwood commented on LUCENE-4069: -- Thanks for the comment, Rob. While the choice of Codec can be an anonymous inner class, resolving the choice of PostingsFormat is trickier. BloomFilterPostingsFormat is now intended to wrap any another choice of PostingsFormat and Simon has suggested leaving the Bloom support purely abstract. However, as an end user if I want to use Bloom support on the standard Lucene codec I would then have to write one of these: {code:title=MyBloomFilteredLucene40Postings.java} public class MyBloomFilteredLucene40Postingsextends BloomFilteringPostingsFormatBase { public MyBloomFilteredLucene40Postings() { //declare my choice of PostingsFormat to be wrapped and provide a unique name for this combo of Bloom-plus-delegate super(myBL40, new Lucene40PostingsFormat()); } } {code} The resulting index files are then named [segname]_myBL40.[filetypesuffix]. At read-time the myBL40 bit of the filename is used to lookup via Service Provider registrations the decoding class so com.xx.MyBloomFilteredLucene40Postings would need adding to a o.a.l.codecs.PostingsFormat file for the registration to work. I imagine Bloom-plus-Lucene40Postings would be a common combo and if both are in core it would be annoying to have to code support for this in each app or for things like Luke to have to have classpaths redefined to access the app-specific class that was created purely to bind this combo of core components. I think a better option might be to change the Bloom filtering base class to record the choice of delegate PostingsFormat in it's own blm file at write-time and instantiate the appropriate delegate instance at read-time using the recorded name. The BloomFilteringBaseClass would need changing to a final class rather than an abstract so that core Lucene could load it as the handler for [segname]_BloomPosting.xxx files and it would then have to examine the [segname].blm file to discover and instantiate the choice of delegate PostingsFormat using the standard service registration mechanism. At write-time clients would need to instantiate the BloomFilterPostingsFormat, passing a choice of PostingsFormat delegate to the constructor. At read-time Lucene core would invoke a zero-arg constructor. I'll look into this as an approach. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285706#comment-13285706 ] Mark Harwood commented on LUCENE-4069: -- Aaaargh. Unless I've missed something, I have concerns with the fundamental design of the current Codec loading mechanism. It seems too tied to the concept of a ServiceProvider class-loading mechanism, forcing users to write new SPI-registered classes in order to simply declare what amount to index schema configuration choices. Example: If I take Rob's sample Codec above and choose to use a subtly different configuration of the same PostingsFormat class for different fields it breaks: {code:title=ThisBreaks.java} Codec fooCodec=new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField(String field) { if (text.equals(field)) { return new FooPostingsFormat(1); } if (title.equals(field)) { //same impl as text field, different constructor settings return new FooPostingsFormat(2); } return super.getPostingsFormatForField(field); } }; {code} This causes a file overwrite error as PerFieldPostingsFormat uses the same name from FooPostingsFormat(1) and FooPostingsFormat(2) to create files. In order to safely make use of differently configured choices of the same PostingsFormat we are forced to declare a brand new subclass with a unique new service name and entry in the service provider registration. This is essentially where I have got to in trying to integrate this Bloom filtering logic. This dependency on writing custom classes seems to make everything a bit fragile, no? What hope has Luke got in opening the average index without careful assembly of classpaths etc? If I contrast this with the world of database schemas it seems absurd to have a reliance on writing custom classes with no behaviour simply in order to preserve a configuration of an application's schema settings. Even an IOC container with XML declarations would offer a more agile means of assembling pre-configured *beans* rather than relying on a Service Provider mechanism that is only serving as a registry of *classes*. Anyone else see this as a major pain? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285726#comment-13285726 ] Robert Muir commented on LUCENE-4069: - I'm not sure this is true: e.g. if your postings format requires parameters to decode the segment, then this enforces that it records said parameters, e.g. Pulsing records these parameters. Codec parameters are at index-time, at read-time its your responsibility to be able to decode them solely from the index (this enforces that there doesnt need to be a crazy matching of user configuration at write and read time). Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285746#comment-13285746 ] Michael McCandless commented on LUCENE-4069: When I run all tests with the bloom 4.0 postings format ({{ant test-core -Dtests.postingsformat=BloomFilteredLucene40PostingsFormat}}), I see a lot of failures..., eg lots of CheckIndex failures like: {noformat} [junit4] 1 java.lang.RuntimeException: vector term=[e4 b8 80 74 77 6f] field=textField2Utf8 does not exist in postings; doc=0 [junit4] 1at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1440) [junit4] 1at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:575) [junit4] 1at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:186) [junit4] 1at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:587) [junit4] 1at org.apache.lucene.index.TestFieldsReader.afterClass(TestFieldsReader.java:72) [junit4] 1at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit4] 1at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) {noformat} But also other spooky failures. Are these known issues? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285744#comment-13285744 ] Mark Harwood commented on LUCENE-4069: -- This fails if you add docs with title and text fields: {code:title=ThisCrashes.java} Codec fooCodec=new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField(String field) { if (text.equals(field)) { return new MemoryPostingsFormat(false); } if (title.equals(field)) { return new MemoryPostingsFormat(true); } else { return super.getPostingsFormatForField(field); } } }; {code} Exception in thread main java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs\_2_Memory.ram This also fails: {code:title=ThisToo.java} Codec fooCodec=new Lucene40Codec() { SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat(); @Override public PostingsFormat getPostingsFormatForField(String field) { if (text.equals(field)) { return new SimpleTextPostingsFormat(); } if (title.equals(field)) { return new SimpleTextPostingsFormat(); } else { return super.getPostingsFormatForField(field); } } }; {code} with Exception in thread main java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs\_1_SimpleText.pst Whereas sharing the same instance of a PostingsFormat class across fields works: {code:title=ThisWorks.java} Codec fooCodec=new Lucene40Codec() { SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat(); @Override public PostingsFormat getPostingsFormatForField(String field) { if ((text.equals(field))|| (title.equals(field))) { return theSimple; } else { return super.getPostingsFormatForField(field); } } }; {code} Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285745#comment-13285745 ] Robert Muir commented on LUCENE-4069: - As far as your problem with parameters, this is actually just a limitation I created (on accident) of PerFieldPostingsFormat on LUCENE-4055. Because we use the postingsformat name as the segment suffix, its not possible to have e.g. id field with Pulsing(1) and date field with Pulsing(2). I'll open an issue and fix this, again its just an issue with PerFieldPostingsFormat. you should be able to use the same name here with different configs, as long as your PostingsFormat writes what it needs to read the files. Thanks for bringing this up. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
On 30/05/2012 17:09, Robert Muir (JIRA) wrote: I'm not sure this is true: e.g. if your postings format requires parameters to decode the segment, then this enforces that it records said parameters, e.g. Pulsing records these parameters. Codec parameters are at index-time, at read-time its your responsibility to be able to decode them solely from the index (this enforces that there doesnt need to be a crazy matching of user configuration at write and read time). I think what Mark is missing (and I saw as a limiting factor in developing other codecs) is to make it easier to customize Codec-s based on composition of reusable blocks, without necessarily needing a separate Codec class implementation. This could be worked around by having a configurable codec that stores its configuration and instantiates necessary reusable blocks, available using the SPI mechanism. On writing you could specify this configuration as Codec attributes, and they could be written out e.g. to SegmentInfos, and on read they would become available from SegmentInfos.attributes. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285757#comment-13285757 ] Robert Muir commented on LUCENE-4069: - Mark: I opened LUCENE-4090.. I broke it, so I'll fix it :) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
On Wed, May 30, 2012 at 11:43 AM, Andrzej Bialecki a...@getopt.org wrote: On 30/05/2012 17:09, Robert Muir (JIRA) wrote: I'm not sure this is true: e.g. if your postings format requires parameters to decode the segment, then this enforces that it records said parameters, e.g. Pulsing records these parameters. Codec parameters are at index-time, at read-time its your responsibility to be able to decode them solely from the index (this enforces that there doesnt need to be a crazy matching of user configuration at write and read time). I think what Mark is missing (and I saw as a limiting factor in developing other codecs) is to make it easier to customize Codec-s based on composition of reusable blocks, without necessarily needing a separate Codec class implementation. This could be worked around by having a configurable codec that stores its configuration and instantiates necessary reusable blocks, available using the SPI mechanism. On writing you could specify this configuration as Codec attributes, and they could be written out e.g. to SegmentInfos, and on read they would become available from SegmentInfos.attributes. Well I think honestly here a bug in PerFieldPostingsFormat is definitely confusing the situation (LUCENE-4090). You should be able to set Pulsing(1) on id field and Pulsing(2) on date field and everything just work: but I broke that. I think thats whats causing the most grief. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285765#comment-13285765 ] Mark Harwood commented on LUCENE-4069: -- bq. its just an issue with PerFieldPostingsFormat OK, thanks. My guess is you'll effectively be having to supplement postingsformat.getName() with object-instanceID in file names. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285768#comment-13285768 ] Robert Muir commented on LUCENE-4069: - Right, its not bad. Separately I'm not happy that its a little trappy in situations like your ThisToo.java, its certainly easy and safe to continue to use identityhashmap but I think we will need to require equals/hashcode on postingsformats. Otherwise its going to make the situation trappy in the future, e.g. we will want to add mechanisms to solr so that you can specify these parameters in the schema.xml (today you cannot), and this would create massive amounts of files if we did so (add to the fact it disables compound files by default and its a disaster...) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285773#comment-13285773 ] Mark Harwood commented on LUCENE-4069: -- bq. When I run all tests with the bloom 4.0 postings format (ant test-core -Dtests.postingsformat=BloomFilteredLucene40PostingsFormat), Thanks for the pointer on targeting codec testing. I have another patch to upload with various tweaks e.g. configurable choice of hash functions, RandomCodec additions so I will concentrate testing on that before uploading. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285054#comment-13285054 ] Simon Willnauer commented on LUCENE-4069: - hey mark, I think you should really not provide any field handling at all. this should be done in user code ie. a general codes/postings format per field. What would be very useful here is integration into RandomCodec so we can randomly swap it in. Same is true for the hash algos memory settings, I think they should be part of the abstract class ctor and if a user wants to use its own algo they should write their own codec ie. subclass. this stuff is very expert and apps like solr can handle this on top. I am still worried about the TermsEnum reuse code, are you planning to look into this? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285327#comment-13285327 ] Robert Muir commented on LUCENE-4069: - {quote} I was trying to limit the extensions apps would have to write to use this service (1 custom postings format subclass, 1 custom Codec subclass and 1 custom SPI config file) {quote} As far as Codec goes, we don't need to provide one for specialized per-field PostingsFormats. Users can just use an anonymous Lucene40Codec, which already has a hook for doing stuff like this (a protected method you can override to changeup postings format per-field). example that sets SimpleText on just the id field: {code} indexWriterConfig.setCodec(new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField(String field) { if (id.equals(field)) { return new SimpleTextPostingsFormat(); } else { return super.getPostingsFormatForField(field); } } }); {code} this is only necessary to set on the indexwriterconfig for writing: PerFieldPostingsFormat records for that id field that it was encoded with SimpleText when the segment is opened, assuming SimpleText is in the classpath and registered, it will just work. from the solr side, they dont need to set any codec either, as its schema already uses this hook, so there its just: {code} fieldType name=string_simpletext class=solr.StrField postingsFormat=SimpleText/ {code} Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284520#comment-13284520 ] Simon Willnauer commented on LUCENE-4069: - hey mark, this looks very interesting. I wonder if that also helps indexing in terms of applying deletes. did you test that at all or is that something are looking into as well? For deletes this could be very handy since bloom filters should help a lot when a given key is likely to only in one segment. Regarding the code, I think BloomFilter should be like Pulsing and only be a PostingsFormat. You should make the BloomFilterPostingsFormat abstract and pass in the wrapped postings format and delegate independent of the field. PostingsFormat is per field and is resolved by the codec. An actual implementation should subclass the abstract BloomFilterPostingsFormat that way you have a fixed PostingsFormat per delegate and don't need to deal with which field got which delegate bla bla. We should not have a Bloomfilter codec but rather swap in the bloomfilter postings format in test via random codec. I also wonder if we can extract a bloomfilter class into utils that builds an BloomFilter this could be useful elsewhere. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284536#comment-13284536 ] Mark Harwood commented on LUCENE-4069: -- bq. I wonder if that also helps indexing in terms of applying deletes. did you test that I've not looked into that particularly but I imagine this may be relevant. Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it. bq. I also wonder if we can extract a bloomfilter class into utils There's some reusable stuff in this patch for downsizing the Bitset (according to desired saturation levels) having accumulated a stream of values. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284545#comment-13284545 ] Simon Willnauer commented on LUCENE-4069: - bq. Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it. YW! A couple of things regarding your patch. I think you should write a header for the bloomfilter file using CodecUtils.writeHeader() to be consistent. Once you are ready you should also look at the other *PostingsFormat impls how we document the file formats there. We try to have versioned file formats instead of the monolithic one on the website. I saw some minor things like {code} finally { if (output != null) { output.close(); } } {code} you can use IOUtils.close*() for that which handles some exception cases etc. I briefly looked at your reuse code for TermsEnum and it seems it has the same issues as pulsing has (sovled). When you get a TermsEnum that is infact a Bloomfilter wrapper but wraps another codec underneath you keep on switching back an forth creating new delegated TermsEnum instances. You can look at PulsingPostingsReader and in particular at PulsingEnumAttributeImpl how we solved that in Pulsing. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284562#comment-13284562 ] Michael McCandless commented on LUCENE-4069: It looks like MurmurHash.java is missing from the patch? Also, with LUCENE-4055 landing it looks like the patch no longer compiles (sorry!)... Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283574#comment-13283574 ] Michael McCandless commented on LUCENE-4069: bq. I see a 35% improvement over standard Codecs on random lookups on a warmed index. Impressive! This is for primary key lookups? It looks like the primary keys are GUID-like right? (Ie randomly generated). I wonder if they had some structure instead (eg '%09d' % (id++)) how the results would look... bq. I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward? That's baffling to me: it should only save seeks vs Lucene40 codec, so on a cold index you should see substantial gains, and on a warm index I'd still expect some gains. Not sure what's up... bq. I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense. I think it's fine to do it here? Really 3.6.x is only for bug fixes now ... so I think we should commit this to trunk. I wonder if you can wrap any other PostingsFormat (ie instead of hard-coding to Lucene40PostingsFormat)? This way users can wrap any PF they have w/ the bloom filter... Can you use FixedBitSet instead of OpenBitSet? Or is there a reason to use OpenBitSet here...? Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283583#comment-13283583 ] Mark Harwood commented on LUCENE-4069: -- My current focus is speeding up primary key lookups but this may have applications outside of that (Zipf tells us there is a lot of low frequency stuff in free text). Following the principle of the best IO is no IO the Bloom Filter helps us quickly understand which segments to even bother looking in. That has to be a win overall. I started trying to write this Codec as a wrapper for any other Codec (it simply listens to a stream of terms and stores a bitset of recorded hashes in a .blm file). However that was trickier than I expected - I'd need to encode a special entry in my blm files just to know the name of the delegated codec I needed to instantiate at read-time because Lucene's normal Codec-instantiation logic would be looking for BloomCodec and I'd have to discover the delegate that was used to write all of the non-blm files. Not looked at FixedBitSet but I imagine that could be used instead. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283615#comment-13283615 ] Mark Harwood commented on LUCENE-4069: -- Update- I've discovered this Bloom Filter Codec currently has a bug where it doesn't handle indexes with 1 field. It's probably all tangled up in the PerField... codec logic so I need to do some more digging. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org