subject:"\[jira\] \[Commented\] \(LUCENE\-4069\) Segment\-level Bloom filters for a 2 x speed up on rare term searches"

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416007#comment-13416007
]

Mark Harwood commented on LUCENE-4069:
--

bq. At a minimum I think before committing we should make the SegmentWriteState
accessible.

OK. Will that be the subject of a new Jira?

bq. Hmm why is anonymity at search time important?

It would seem to be an established design principle - see
https://issues.apache.org/jira/browse/LUCENE-4069#comment-13285726

It would be a pain if user config settings require a custom SPI-registered
class around just to decode the index contents. There's the resource/classpath
hell, the chance for misconfiguration and running Luke suddenly gets more
complex.
The line to be drawn is between what are just config settings (field names,
memory limits) and what are fundamentally different file formats (e.g. codec
choices).
The design principle that looks to be adopted is that the former ought to be
accommodated without the need for custom SPI-registered classes and the latter
would need to locate an implementation via SPI to decode stored content. Seems
reasonable.
The choice of hash algo does not fundamentally alter the on-disk format (they
all produce an int) so I would suggest we treat this as a config setting rather
than a fundamentally different choice of file format.

Segment-level Bloom filters for a 2 x speed up on rare term searches

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416014#comment-13416014
]

Uwe Schindler commented on LUCENE-4069:
---

{quote}
It would be a pain if user config settings require a custom SPI-registered
class around just to decode the index contents. There's the resource/classpath
hell, the chance for misconfiguration and running Luke suddenly gets more
complex.
The line to be drawn is between what are just config settings (field names,
memory limits) and what are fundamentally different file formats (e.g. codec
choices).
The design principle that looks to be adopted is that the former ought to be
accommodated without the need for custom SPI-registered classes and the latter
would need to locate an implementation via SPI to decode stored content. Seems
reasonable.
The choice of hash algo does not fundamentally alter the on-disk format (they
all produce an int) so I would suggest we treat this as a config setting rather
than a fundamentally different choice of file format.
{quote}

The design principle here is very easy: We must follow the SPI pattern, if you
write an index that could otherwise not be read with default settings and
produces e.g. CorruptIndexException. If you have a codec that writes some
special things for specific fields, it is required to write this information
about the fields to the index. If you want to open this index using IndexReader
again, there must not be any requirement for configuration settings on the
reader itsself - a simple DirectoryReader.open() must be possible and a query
must be able to execute. The IndexReader must be able to get all this
information from the index files. If a special decoder for foobar is needed, it
must be loadable by SPI. This is similar to postings. A new postings format
needs a new SPI, otherwise you cannot read the index.

And it is not true that Luke is more complex to configure. Just put the JAR
file into classpath that contains the SPI and you are fine. Setting up a build
environment is more complicated but thats more the problem of shitty Eclipse
resource handling. ANT/MAVEN/IDEA is easy.

Segment-level Bloom filters for a 2 x speed up on rare term searches

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416037#comment-13416037
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq.  If a special decoder for foobar is needed, it must be loadable by SPI. 

I think we are in agreement on the broad principles. The fundamental question 
here though is do you want to treat an index's choice of Hash algo as something 
that would require a new SPI-registered PostingsFormat to decode or can that be 
handled as I have done here with a general purpose SPI framework for hashing 
algos? 

Actually, re-thinking this, I suspect rather than creating our own, I can use 
Java's existing SPI framework for hashing in the form of MessageDigest. I'll 
take a closer look into that...



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416039#comment-13416039
]

Uwe Schindler commented on LUCENE-4069:
---

bq. Actually, re-thinking this, I suspect rather than creating our own, I can
use Java's existing SPI framework for hashing in the form of MessageDigest.
I'll take a closer look into that...

That's a good idea. MessageDigest.getInstance(name) should be the way to go.
And this name string is encoded to the index. If the MessageDigest API is
enough then one can e.g. use bouncycastle by just plugging into the calsspath.
If on opening index, this is not available, he will get Exception just like
when codec/postingsformat is missing.

Segment-level Bloom filters for a 2 x speed up on rare term searches

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416084#comment-13416084
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. MessageDigest.getInstance(name) should be the way to go

I'm less keen now - a quick scan of the docs around MessageDigest throws up 
some issues:
1) SPI registration of MessageDigest providers looks to get into permissions 
hell as it is closely related to security - see 
http://docs.oracle.com/javase/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling
 which talks about the steps required to approve a trusted provider.
2) MessageDigest as an interface is designed to stream content in potentially 
many method calls past the hashing algo. MurmurHash2.java is not currently 
written to process content this way and suits our needs in hashing small blocks 
of content in one hit. 

For these 2 reasons it looks like MessageDigest may be a pain to adopt and the 
existing approach proposed in this patch may be preferable.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416290#comment-13416290
]

Michael McCandless commented on LUCENE-4069:

{quote}
bq. At a minimum I think before committing we should make the SegmentWriteState
accessible.

OK. Will that be the subject of a new Jira?
{quote}

No, I mean we shouldn't commit this patch until SegmentWriteState is
accessible when creating the FuzzySet. I think we can just pass it to
BloomFilterFactory.getSetForField? This way if the app knows it's a
PK field then it can use maxDoc to always size an appropriate
bit set up front.

bq. I think we are in agreement on the broad principles. The fundamental
question here though is do you want to treat an index's choice of Hash algo as
something that would require a new SPI-registered PostingsFormat to decode or
can that be handled as I have done here with a general purpose SPI framework
for hashing algos?

+1, that's exactly the question.

Ie, where to draw the line between config of an existing PF and
different PF.

But I guess swapping in different hash impl should be seen as simple
config change, so I think using SPI to find it at read time is OK.

I still don't like how trappy this approach is: the default hardwired
(8 MB) can be way too big (silently slows down your NRT reopens,
especially if you bloom all fields) or way too small (silently turns
off bloom filter for fields that have too many unique terms).

I also don't think this PF should be per-field: we have
PerFieldPostingsFormat for that, and if there are limitations in PFPF,
we should address them rather than having to make all future PFs
handle per-field-ness themselves. This PF should really handle one
field.

But I don't think these issues need to hold up commit (except for
making SegmentWriteState accessible)... we can improve over time. I
think we may simply want to fold this into the terms dict somehow.

Can you add @lucene.experimental to all the new APIs?

Segment-level Bloom filters for a 2 x speed up on rare term searches

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416383#comment-13416383
]

Mark Harwood commented on LUCENE-4069:
--

A quick benchmark looks like the new right-sized bitset as opposed to the old
worst-case-scenario-sized bitset is buying us a small performance improvement.

bq. I also don't think this PF should be per-field

There was a lengthy discussion earlier on this topic. The approach presented
here seems reasonable.
For the average user you have the DefaultBloomFilterFactory default which now
has reasonable sizing for all fields passed its way (assuming a heuristic based
on numDocs=numKeys to anticipate). For expert users you can provide a
BloomFilterFactory with a custom choice of sizing heuristic per-field and can
also simply return null for non-bloomed fields.

Having a single, carefully configured BloomPF wrapper is preferable because you
can channel appropriately configured bloom settings to a common PF delegate and
avoid creating multiple .tii, .tis files etc because the PerFieldPF isn't smart
enough to figure out that these Bloom-ing choices do not require different
physical files for all the delegated tii etc structures.

You don't *have* to use the Per-field stuff in BloomPF but there are benefits
to be had in doing so which can't otherwise be achieved.

bq. Can you add @lucene.experimental to all the new APIs?

Done.

Segment-level Bloom filters for a 2 x speed up on rare term searches

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13416417#comment-13416417
]

Michael McCandless commented on LUCENE-4069:

Thanks Mark. +1 to commit.

I still think we could/should somehow fold this into the terms dict but we can
do that another day... this is a good step forward.

bq. because the PerFieldPF isn't smart enough to figure out that these
Bloom-ing choices do not require different physical files for all the delegated
tii etc structures.

We need to fix that, so that every PF wrapper that comes along doesn't feel
like it must handle per-field-ness itself.

Segment-level Bloom filters for a 2 x speed up on rare term searches

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415219#comment-13415219
]

Michael McCandless commented on LUCENE-4069:

bq. Estimating key volumes in context 1 is probably hard without some
additional hints from the end user.

Actually, the indexer knows exactly how many unique terms it's about to
flush. It's the unique term count (for this one segment) that you
need right? We just have to plumb it somehow (add to
SegmentWriteState?).

bq. 2) Merged segments e.g. guessing how many unique keys survive a merge
operation

Seems like LUCENE-4198 needs to solve this same problem.

bq. Not sure how we get the OneMerge instance fed through the call stack -
could that be held somewhere on a ThreadLocal as generally useful context?

Yeah I'm not sure either ... but I agree an approx heuristic based on
merging segments unique term counts is better than nothing.

I don't like how you silently just get bad perf now if your up-front
guess was too small ... I also don't like that the default guessing is
static (8 MB): flushing tiny NRT segs are gonna pay an absurd price.
I think we need a solution for this...

Maybe another option is to simply make a 2nd pass after the segment
(flush or merge) is written, to build up the bit set; this way we know
exactly how many unique terms we have.

Segment-level Bloom filters for a 2 x speed up on rare term searches

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415221#comment-13415221
 ] 

Michael McCandless commented on LUCENE-4069:


I don't think we should use a separate factory class to provide the
bloom filters; I think we should simplify.  EG add overridable methods
to BloomFilteringPF; this way the default impl (8 MB, isSaturated)
would be in BloomFilteringPF.

Also, why do we need to use SPI to find the HashFunction?  Seems like
overkill... we don't (yet) have a bunch of hash functions that are
vying here right?  We can make this pluggable at a later time... can't
the postings format impl pass in an instance of HashFunction when
making the FuzzySet?  Default would just be MurmurHash.

Can you move the imports under the copyright header? (eg in
HashFunction.java but maybe others).


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415362#comment-13415362
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. It's the unique term count (for this one segment) that you need right? 
Yes, I need it before I start processing the stream of terms being flushed.
 
bq. Seems like LUCENE-4198 needs to solve this same problem.

Another possibly related point on more access to merge context - custom 
codecs have a great opportunity at merge time to piggy-back some analysis on 
the data being streamed e.g. to spot trending terms whose term frequencies 
differ drastically between the merging source segments. This would require 
access to source segment as term postings are streamed to observe the change 
in counts. 

bq. Also, why do we need to use SPI to find the HashFunction? Seems like 
overkill... we don't (yet) have a bunch of hash functions that are vying here 
right?

There's already a MurmurHash3 algo - we're currently using v2 and so could 
anticipate an upgrade at some stage. This patch provides that future proofing.

bq. can't the postings format impl pass in an instance of HashFunction when 
making the FuzzySet

I don't think that is going to work. Currently all PostingFormat impls that 
extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). 
All their settings (fields, hash algo, thresholds) etc are recorded at write 
time by the base class in the segment. At read-time it is the 
BloomFilterPostingsFormat base class that is instantiated, not the write-time 
subclass and so we need to store the hash algo choice. We can't rely on the 
original subclass being around and configured appropriately with the original 
write-time choice of hashing function.

I think the current way feels safer over all and also allows other Lucene 
functions to safely record hashes along with a hashname string that can be used 
to reconstitute results. 

bq. Can you move the imports under the copyright header?

Will do



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415604#comment-13415604
]

Michael McCandless commented on LUCENE-4069:

{quote}
bq. It's the unique term count (for this one segment) that you need right?

Yes, I need it before I start processing the stream of terms being flushed.
{quote}

At a minimum I think before committing we should make the
SegmentWriteState accessible. EG then at least I can use numDocs for
my primary key field.

I really don't like how easy it is to silently mis-configure this PF:
the default 8 MB is way to high for an NRT setting and way too low for
a large index.

bq. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can
be anonymous (i.e. unregistered via SPI).

Hmm why is anonymity at search time important? I think that's a
non-feature, and we shouldn't make our core code more complex for it?

Ie, it's fine to require the app to have to make a named PF (that is
accessible via SPI), implementing all their custom bloom logic (which
fields are bloom'd, what hash to use, etc.).

When an app makes a custom Codec/PostingsFormat, it's expected that
that class is accessible via SPI at both index time and search time.

Segment-level Bloom filters for a 2 x speed up on rare term searches

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-10 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410145#comment-13410145
]

Mark Harwood commented on LUCENE-4069:
--

bq. So now we are close to 1M lookups/sec for a single thread!

Cool!

bq. I wonder if somehow we can do a better job picking the right sized bit
vector up front?
bq. You basically need to know up front how many unique terms will be in the
given field for this segment right?

Yes - the job of anticipating the number of unique keys probably has 2
different contexts:
1) Net new segments e.g. guessing up front how many docs/keys a user is likely
to generate in a new segment before the flush settings kick in.
2) Merged segments e.g. guessing how many unique keys survive a merge operation

Estimating key volumes in context 1 is probably hard without some additional
hints from the end user. Arguably the BloomFilterFactory.getSetForField()
method already represents where this setting can be controlled.
In context 2 where potentially large merges occur we could look at adding an
extra method to BloomFilterFactory to handle this different context e.g.
something like
FuzzySet getSetForMergeOpOnField(FieldInfo fi, OneMerge mergeContext)
Based on the size of the segments being merged and volumes of deletes a more
appropriate size of Bloom bitset could be allocated based on a worst-case
estimate.
Not sure how we get the OneMerge instance fed through the call stack - could
that be held somewhere on a ThreadLocal as generally useful context?

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409676#comment-13409676
 ] 

Michael McCandless commented on LUCENE-4069:


bq. That's tightened performance but that lookss a scary amount of code for the 
optimal solution of this basic incrementing operation

Well the code now looks horribly complex since we have a ton of booleans to do 
the same thing in different ways :)

From my limited poking-around testing the fastest way seems to be 
lookupByReader=false, shareEnums=true, doSortKeys=true, useDocValues=true, 
useBase36=true, doDirectDelete=true.  Once you specialize the code for those 
booleans then it looks much more sane.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409679#comment-13409679
 ] 

Michael McCandless commented on LUCENE-4069:


Also: in digging into this I found and fixed a silly performance bug in the 
nightly PKLookup perf test:

http://people.apache.org/~mikemccand/lucenebench/PKLookup.html

So now we are close to 1M lookups/sec for a single thread!

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409687#comment-13409687
 ] 

Michael McCandless commented on LUCENE-4069:


bq. I added a 100mb start size for the BloomFilter for large-scale tests 
because without this it gets saturated and there were occasional big spikes in 
batch times.

I wonder if somehow we can do a better job picking the right sized bit vector 
up front?  You basically need to know up front how many unique terms will be in 
the given field for this segment right?  This seems similar to LUCENE-4198, 
where we want to improve Codec API so it has access to collection stats (like 
number of unique terms in this field) up front.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409693#comment-13409693
 ] 

Michael McCandless commented on LUCENE-4069:


I'll open a separate issue for IndexWriter.tryDeleteDocument; I think it's 
worth exploring...

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409746#comment-13409746
 ] 

Michael McCandless commented on LUCENE-4069:


I ran with with the fastest configuration (lookupByReader=false,
shareEnums=true, doSortKeys=true, useDocValues=true,
doDirectDelete=true, useBase36=true).

I used MMapDir, 10M numDocs and keySpaceSize, Java 1.7.0_04, 2 GB
starting/max heap.

With doWorstCaseFlushing=false:

{noformat}
  Lucene40
55978 ms

  Bloom
49880 ms

  Memory
55965 ms

  Pulsing
56059 ms
{noformat}

With doWorstCaseFlushing=true:

{noformat}
  Lucene40
61097 ms

  Bloom
53698 ms

  Memory
61066 ms

  Pulsing
61355 ms
{noformat}

So Bloom is ~11-12% faster.  Very curious that Lucene40, Pulsing and
Memory are basically the same!


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409762#comment-13409762
 ] 

Michael McCandless commented on LUCENE-4069:


I opened LUCENE-4203 for IndexWriter.tryDeleteDocument.

Urgh, patch is on the wrong issue ...

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-06 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408097#comment-13408097
 ] 

Mark Harwood commented on LUCENE-4069:
--

Thanks for the extra tests, Mike. That's tightened performance but that lookss 
a scary amount of code for the optimal solution of this basic incrementing 
operation :)

I've done some more benchmarks with the updated test and the performance 
characteristics are becoming clearer as shown in these results: 
http://goo.gl/dtWSb
Bloom performance is better than Pulsing but the gap narrows with the volumes 
of deletes lying around in old segments, caused by updates. In these cases the 
BloomFilter gives a false positive and falls back to the equivalent operations 
of Pulsing. I added a 100mb start size for the BloomFilter for large-scale 
tests because without this it gets saturated and there were occasional big 
spikes in batch times.
So overall there still looks to be a benefit and especially in low-frequency 
update scenarios.

I'll wait for the dust to settle on Lucene-4190 (given this Codec introduces a 
new file) before thinking about committing.

Cheers
Mark



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397425#comment-13397425
 ] 

Michael McCandless commented on LUCENE-4069:


Hmm somehow I messed up when testing the bloom postings format: all of my 
on-disk bloom bits sets are 8 MB, regardless of the size of the segment... so 
something is wrong (I would expect segments w/ more terms to have a larger bit 
set...).  This is what I did for the codec:
{noformat}
final Codec codec = new Lucene40Codec() {
  @Override
  public PostingsFormat getPostingsFormatForField(String field) {
final String pfName = field.equals(id) ? idFieldPostingsFormat : 
defaultPostingsFormat;
if (pfName.equals(Bloom)) {
  return new 
BloomFilteringPostingsFormat(PostingsFormat.forName(Lucene40),
  new BloomFilterFactory() {
@Override
public FuzzySet 
getSetForField(FieldInfo info) {
  return 
FuzzySet.createSetBasedOnMaxMemory(100*1024*1024, new MurmurHash2());
}
  });
} else {
  return PostingsFormat.forName(pfName);
}
  }
};
{noformat}

I thought that would use plenty of RAM (100 MB), and then downsize to the 
appropriately sized bit set (~10% saturation) when writing ... but somehow it 
didn't.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397445#comment-13397445
 ] 

Michael McCandless commented on LUCENE-4069:


OK I think I found the problem w/ the patch: in 
BloomFilteringPostingsFormat.saveAppropriatelySizedBloomFilter, it should be 
{{rightSizedSet.serialize(bloomOutput);}} not 
{{bloomFilter.serialize(bloomOutput);}} (it was saving the un-downsized 
version).  Now I see my .blm files varying in size according to how large the 
segment is...

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397457#comment-13397457
 ] 

Michael McCandless commented on LUCENE-4069:


Results after fixing the not downsizing bug:

{noformat}
TaskQPS base StdDev base   QPS bloomStdDev bloom  Pct 
diff
  Fuzzy1   97.021.43   88.673.08  -13% -   
-4%
  Fuzzy2   42.501.15   40.321.75  -11% -
1%
 AndHighHigh   17.990.29   17.740.47   -5% -
2%
 Respell   82.762.96   81.663.54   -8% -
6%
  AndHighMed   55.771.02   55.061.51   -5% -
3%
 TermGroup1M   42.350.79   41.980.85   -4% -
3%
TermBGroup1M   49.550.58   49.140.60   -3% -
1%
SpanNear   12.530.15   12.490.09   -2% -
1%
  TermBGroup1M1P   64.360.83   64.340.83   -2% -
2%
 Prefix3   34.181.25   34.451.38   -6% -
8%
Wildcard   34.421.12   34.751.35   -6% -
8%
SloppyPhrase   11.480.17   11.610.22   -2% -
4%
  Phrase8.470.308.570.25   -5% -
7%
Term   94.064.50   95.452.34   -5% -
9%
  IntNRQ9.840.63   10.070.71  -10% -   
16%
   OrHighMed   34.572.07   35.782.24   -8% -   
16%
  OrHighHigh   11.660.64   12.110.73   -7% -   
16%
PKLookup  129.752.37  153.802.31   14% -   
22%
{noformat}

Basically same as before ... which is good (it means target 10% saturation 
doesn't lose any perf).

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13396934#comment-13396934
 ] 

Mark Harwood commented on LUCENE-4069:
--

Mike, currently having various issues getting this benchmark framework up and 
running on my Windows platform here - is it easy for you to kick off another 
run with the latest patch on your setup? The latest change to the patch 
shouldn't require an index rebuild from your last run.

No worries if this is too much hassle for you - I'll probably just try switch 
to testing on OSX at home.

Cheers,
Mark

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397010#comment-13397010
]

Michael McCandless commented on LUCENE-4069:

bq. Mike, currently having various issues getting this benchmark framework up
and running on my Windows platform here

Alas it's not easy ... please report back on how to make it easier to set up!

bq. is it easy for you to kick off another run with the latest patch on your
setup?

No problem: I'll run perf test again. It's easy...

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostingsBranch4x.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397054#comment-13397054
]

Mark Harwood commented on LUCENE-4069:
--

bq. problem: I'll run perf test again. It's easy...

Great, thanks.

bq. Alas it's not easy ... please report back on how to make it easier to set
up!

My Windows-based woes were:
1) Had to install python (used 2.7)
2) Figure out python proxy settings for Wikipedia download
3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't
installed/available so gave up and did svn checkout manually
4) Ran first python test and it aborted with complaint about GnuPlot missing

I imagine most of what is needed here comes out of the box on typical OSX/Linux
setup.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostingsBranch4x.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397082#comment-13397082
 ] 

Michael McCandless commented on LUCENE-4069:


Results from last patch:
{noformat}
TaskQPS base StdDev base   QPS bloomStdDev bloom  Pct 
diff
  IntNRQ   11.351.27   10.140.62  -24% -
6%
  Fuzzy1  108.523.34  101.822.90  -11% -
0%
 Prefix3   64.872.17   61.551.61  -10% -
0%
Wildcard   43.181.74   41.331.17  -10% -
2%
  Fuzzy2   41.761.40   40.051.00   -9% -
1%
Term  151.714.38  147.244.42   -8% -
2%
SpanNear5.230.095.110.12   -6% -
1%
   OrHighMed   12.600.88   12.340.48  -11% -
9%
SloppyPhrase8.250.208.090.07   -5% -
1%
TermBGroup1M   69.980.68   68.801.13   -4% -
0%
  OrHighHigh   10.060.669.930.39  -11% -
9%
  Phrase   12.730.30   12.570.35   -6% -
3%
 TermGroup1M   35.440.42   35.080.67   -4% -
2%
  AndHighMed   63.402.27   62.901.11   -5% -
4%
 Respell   93.113.70   92.812.33   -6% -
6%
  TermBGroup1M1P   50.931.53   50.961.75   -6% -
6%
 AndHighHigh   15.860.71   15.930.27   -5% -
6%
PKLookup  127.442.15  134.858.68   -2% -   
14%
{noformat}

Looks like FuzzyN/Respell is good again ... PKLookup is a bit faster ... the 
rest is likely noise.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-19 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397174#comment-13397174
]

Michael McCandless commented on LUCENE-4069:

{quote}
My Windows-based woes were:
1) Had to install python (used 2.7)
{quote}

OK.

bq. 2) Figure out python proxy settings for Wikipedia download

Hmmm ... you can also manually download separately.

bq. 3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't
installed/available so gave up and did svn checkout manually

Oh I see: setup.py uses PySVN; hmm. Just checking out manually is better...

bq. 4) Ran first python test and it aborted with complaint about GnuPlot missing

Hmm we should default printIndexCharts to False (I'll fix).

bq. I imagine most of what is needed here comes out of the box on typical
OSX/Linux setup.

Hmm I don't think pysvn does, nor gnuplot ...

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostingsBranch4x.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395773#comment-13395773
]

Mark Harwood commented on LUCENE-4069:
--

Interesting results, Mike - thanks for taking the time to run them.

bq. BloomFilteredFieldsProducer should just pass through intersect to the
delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the
client app and the delegate PostingsFormat as soon as it is safe to do so i.e.
when the user is safely focused on a non-filtered field. While there is a
chance the client may end up making a call to TermsEnum.seekExact(..) on a
filtered field then I need to have a wrapper object in place which is in a
position to intercept this call. In all other method invocations I just end up
delegating calls so I wonder if all these extra method calls are the cause of
the slowdown you see e.g. when Fuzzy is enumerating over many terms.
The only other alternatives to endlessly wrapping in this way are:
a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for
just this one method.
b) Mess around with byte-code manipulation techniques to weave in Bloom
filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to
live with fuzzy+bloom not being as fast as straight fuzzy.

For completeness sake - I don't have access to your benchmarking code but I
would hope that PostingsFormat.fieldsProducer() isn't called more than once for
the same segment as that's where the Bloom filters get loaded from disk so
there's inherent cost there too. I can't imagine this is the case.

BTW I've just finished a long-running set of tests which mixes up reads and
writes here: http://goo.gl/KJmGv
This benchmark represents how graph databases such as Neo4j use Lucene for an
index when loading (I typically use the Wikipedia links as a test set). I look
to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over
the comparatively slower 3.6 codebase.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostingsBranch4x.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395830#comment-13395830
]

Michael McCandless commented on LUCENE-4069:

bq. Interesting results, Mike - thanks for taking the time to run them.

You're welcome!

{quote}
bq. BloomFilteredFieldsProducer should just pass through intersect to the
delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the
client app and the delegate PostingsFormat as soon as it is safe to do so i.e.
when the user is safely focused on a non-filtered field. While there is a
chance the client may end up making a call to TermsEnum.seekExact(..) on a
filtered field then I need to have a wrapper object in place which is in a
position to intercept this call. In all other method invocations I just end up
delegating calls so I wonder if all these extra method calls are the cause of
the slowdown you see e.g. when Fuzzy is enumerating over many terms.
The only other alternatives to endlessly wrapping in this way are:
a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out
for just this one method.
b) Mess around with byte-code manipulation techniques to weave in Bloom
filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to
live with fuzzy+bloom not being as fast as straight fuzzy.
{quote}

I think the fix is simple: you are not overriding Terms.intersect now,
in BloomFilteredTerms. I think you should override it and immediately
delegate and then FuzzyN/Respell performance should be just as good as
Lucene40 codec.

bq. For completeness sake - I don't have access to your benchmarking code

All the benchmarking code is here:
http://code.google.com/a/apache-extras.org/p/luceneutil/

I run it nightly (trunk) and publish the results here:
http://people.apache.org/~mikemccand/lucenebench/

bq. but I would hope that PostingsFormat.fieldsProducer() isn't called more
than once for the same segment as that's where the Bloom filters get loaded
from disk so there's inherent cost there too. I can't imagine this is the case.

It's only called once on init'ing the SegmentReader (or at least it
better be!).

{quote}
BTW I've just finished a long-running set of tests which mixes up reads and
writes here: http://goo.gl/KJmGv
This benchmark represents how graph databases such as Neo4j use Lucene for an
index when loading (I typically use the Wikipedia links as a test set). I look
to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over
the comparatively slower 3.6 codebase.
{quote}

Nice results! It looks like bloom(3.6) is faster than bloom(4.0)?
Why is that...

Also I wonder why you see such sizable (3.5X speedup) gains on PK
lookup but in my benchmark I see only ~13% - 24%. My index has 5
segments per level...

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostingsBranch4x.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395857#comment-13395857
]

Mark Harwood commented on LUCENE-4069:
--

bq. I think the fix is simple: you are not overriding Terms.intersect now, in
BloomFilteredTerms

Good catch - a quick test indeed shows a speed up on fuzzy queries.
I'll prepare a new patch.

I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a
closer look at your benchmark.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostingsBranch4x.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395980#comment-13395980
 ] 

Michael McCandless commented on LUCENE-4069:


bq. Didn't get too far with running the Wikipedia perf tests due to missing 
data file (see 
http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=7 )

Woops sorry: I posted a comment there (new, more recent wikipedia export).

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-15 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13295748#comment-13295748
 ] 

Michael McCandless commented on LUCENE-4069:


I ran a benchmark on 10 M Wikipedia index; for the factory I used 
createSetBasedOnMemory and passed it 100 MB; I think that's enough to ensure we 
get the 10% saturation on save ...:

{noformat}
TaskQPS base StdDev base   QPS bloomStdDev bloom  Pct 
diff
  Fuzzy1  102.473.67   41.950.78  -61% -  
-56%
  Fuzzy2   38.361.76   18.680.37  -54% -  
-47%
 Respell   89.894.38   44.090.52  -53% -  
-47%
Wildcard   40.482.82   36.200.64  -17% -   
-2%
SloppyPhrase7.960.288.070.07   -3% -
5%
 Prefix3   61.945.34   63.350.37   -6% -   
12%
TermBGroup1M   71.376.79   73.731.55   -7% -   
16%
  AndHighMed   64.095.51   66.731.75   -6% -   
16%
  TermBGroup1M1P   49.553.78   51.752.67   -7% -   
18%
 AndHighHigh   16.051.12   16.770.53   -5% -   
15%
 TermGroup1M   35.873.07   37.560.74   -5% -   
16%
  OrHighHigh9.601.38   10.150.65  -13% -   
31%
   OrHighMed   11.931.91   12.630.93  -15% -   
35%
  IntNRQ9.121.259.680.11   -7% -   
24%
Term  154.55   19.60  165.320.97   -5% -   
23%
  Phrase   11.400.33   12.210.182% -   
11%
SpanNear4.310.074.730.037% -   
12%
PKLookup  122.781.42  145.955.22   13% -   
24%
{noformat}

Baseline is Lucene40 PostingsFormat even for the id field ... so PKLookup gets 
a good improvement.  This is on an index w/ 5 segments at each level.

Other queries seem to speed up as well (eg Term, Or*).

The queries that rely on Terms.intersect got much worse: is the 
BloomFilteredFieldsProducer should just pass through intersect to the delegate?

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-01 Thread Mark Harwood (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287258#comment-13287258
]

Mark Harwood commented on LUCENE-4069:
--

I've thought some more about option 2 (PerFieldPF reusing wrapped PFs) and it
looks to get very ugly very quickly.
There's only so much PerFieldPF can do to rationalize a random jumble of PF
instances presented to it by clients. I think the right place to draw the line
is Lucene-4093 i.e. a simple .equals() comparison on top-level PFs to eliminate
any duplicates. Any other approach that also tries to de-dup nested PFs looks
to be adding a lot of complexity, especially when you consider what that does
to the model of read-time object instantiation. This would be significant added
complexity to solve a problem you have already suggested is insignificant (i.e.
too many files doesn't really matter when using CFS).

I can remove the per-field stuff from BloomPF if you want but I imagine I will
routinely subclass it to add this optimisation back in to my apps.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostings40.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286555#comment-13286555
 ] 

Robert Muir commented on LUCENE-4069:
-

I don't think the abstract class should be registered in the SPI.

Instead i think the concrete Bloom+Lucene40 that you have in tests should be 
moved into src/java and registered there, just call it Bloom40 or something. 
The abstract api is still available for someone that wants to do something more 
specialized.

This is just like how pulsing (another wrapper) is implemented.

As far as disabling this for certain tests, import 
o.a.l.util.LuceneTestCase.SuppressCodecs and put something like this at class 
level:

{code}
@SuppressCodecs(Bloom40)
public class TestFoo...

@SuppressCodecs({Bloom40, Memory})
public class TestBar...
{code}

The strings in here can be codecs or postings formats

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286562#comment-13286562
 ] 

Robert Muir commented on LUCENE-4069:
-

Seeing the tests in question though, i dont think you want to disable this for 
these entire test classes.

We dont have a way to disable this on a per-method basis: and I think its 
generally not possible because
many classes create indexes in @BeforeClass etc.

An alternative would be to just pick this less often in RandomCodec: see the 
SimpleText hack :)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286598#comment-13286598
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. Instead i think the concrete Bloom+Lucene40 that you have in tests should 
be moved into src/java and registered there

What problem would that be trying to solve? Registration (or creation) of any 
BloomFilteringPostingsFormat subclasses is not necessary to decode index 
contents. Offering a Bloom40 would only buy users a pairing of 
Lucene40Postings and Bloom filtering but they would still have to declare which 
fields they want Bloom filtering on at write time. This isn't too hard using 
the code in the existing patch:

{code:title=ThisWorks.java}
final SetStringbloomFilteredFields=new HashSetString();
bloomFilteredFields.add(PRIMARY_KEY_FIELD_NAME);

iwc.setCodec(new Lucene40Codec(){
  BloomFilteringPostingsFormat postingOptions=new 
BloomFilteringPostingsFormat(new Lucene40PostingsFormat(), bloomFilteredFields);
  @Override
  public PostingsFormat getPostingsFormatForField(String field) {
return postingOptions;
  }  
});
{code}
No extra subclasses/registration required here to read the index built with the 
above setup.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286599#comment-13286599
]

Robert Muir commented on LUCENE-4069:
-

I dont understand why this handles fields. Someone should just pick that with
perfieldpostingsformat.

So you have the abstract wrapper(takes the wrapped postings format, and a
String name), not registered.
And you have a concrete impl registered that is just abstractWrapper(lucene40,
Bloom40): done.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostings40.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286600#comment-13286600
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. An alternative would be to just pick this less often in RandomCodec: see 
the SimpleText hack 

Another option might be to make the TestBloomFilteredLucene40Postings pick a 
ludicrously small Bitset sizing option for each field so that we can 
accommodate tests that create silly numbers of fields. The bitsets being so 
small will just quickly reach saturation and force all reads to hit the 
underlying FieldsProducer.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286616#comment-13286616
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. I dont understand why this handles fields. Someone should just pick that 
with perfieldpostingsformat.

That would be inefficient because your PFPF will see 
BloomFilteringPostingsFormat(field1 + Lucene40) and 
BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different 
PostingsFormat instances and consequently create multiple files named 
differently because it assumes these instances may be capable of using 
radically different file structures.
In reality, the choice of BloomFilter with field 1 or BloomFilter with field 2 
or indeed no BloomFilter does not fundamentally alter the underlying delegate 
PostingFormat's file format - it only adds a supplementary blm file on the 
side with the field summaries. For this reason it is a mistake to configure 
seperate BloomFilterPostingsFormat instances on a per-field basis if they can 
share a common delegate.



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286619#comment-13286619
 ] 

Robert Muir commented on LUCENE-4069:
-

{quote}
That would be inefficient because your PFPF will see 
BloomFilteringPostingsFormat(field1 + Lucene40) and 
BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different 
PostingsFormat instances and consequently create multiple files named 
differently because it assumes these instances may be capable of using 
radically different file structures.
{quote}

But adding per-field handling here is not the way to solve this: its messy. 

Per-Field handling should all be handled at a level above in 
PerFieldPostingsFormat.

To solve what you speak of we just need to resolve LUCENE-4093. Then multiple 
postings format instances that are 'the same' will be deduplicated correctly.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286707#comment-13286707
]

Mark Harwood commented on LUCENE-4069:
--

bq. To solve what you speak of we just need to resolve LUCENE-4093.

Presumably the main objective here is that in order to cut down on the number
of files we store, content consumers of various types should aim to consolidate
multiple fields' contents into a single file (if they share common config
choices).

bq. Then multiple postings format instances that are 'the same' will be
deduplicated correctly.

The complication in this case is that we essentially have 2 consumers (Bloom
and Lucene40), one wrapped in the other with different but overlapping choices
of fields e.g we want a single Lucene40 to process all fields but we want Bloom
to handle only a subset of these fields. This will be a tough one for PFPF to
untangle while we are stuck with a delegating model for composing consumers.

This may be made easier if instead of delegating a single stream we have a
*stream-splitting* capability via a multicast subscription e.g. Bloom filtering
consumer registers interest in content streams for fields A and B while
Lucene40 is consolidating content from fields A, B, C and D. A broadcast
mechanism feeds each consumer a copy of the relevant stream and each consumer
is responsible for inventing their own file-naming convention that avoids
muddling files.

While that may help for writing streams it doesn't solve the re-assembly of
producer streams at read-time where BloomFilter absolutely has to position
itself in front of the standard Lucene40 producer in order to offer fast-fail
lookups.

In the absence of a fancy optimised routing mechanism (this all may be
overkill) my current solution was to put BloomFilter in the delegate chain
armed with a subset of fieldnames to observe as a larger array of fields flow
past to a common delegate. I added some Javadocs to describe the need to do it
this way for an efficient configuration.
You are right that this is messy (ie open to bad configuration) but operating
this deep down in Lucene that's always a possibility regardless of what we put
in place.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostings40.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286712#comment-13286712
 ] 

Robert Muir commented on LUCENE-4069:
-

{quote}
but overlapping choices of fields e.g we want a single Lucene40 to process all 
fields but we want Bloom to handle only a subset of these fields.
{quote}

Thats not true: I disagree. Its an implementation detail that Bloom as a 
postingsformat wraps another one (thats just the abstract implementation), and 
the file formats should not expose this in general for any format.

This is true for a number of reasons: e.g. in the pulsing case the wrapped 
writer only gets a subset of the postings: therefore the wrapped writer's files 
are incomplete and an implementation detail.

its enough here that if you have 5 fields: 2 bloom and 3 not, that we detect 
there are only two postings formats in use, regardless of whether you have 2 or 
5 actual object instances.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286717#comment-13286717
 ] 

Robert Muir commented on LUCENE-4069:
-

And separately, you can always contain the number of files even today by:
* using only unique instances yourself when writing (rather than waiting on 
LUCENE-4093)
* using the compound file format.

The purpose of LUCENE-4093 is just to make this simpler, but I opened it as a 
separate
issue because its really solely an optimization, and only for a pretty rare 
case where
people are customizing the index format for different fields.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286754#comment-13286754
 ] 

Mark Harwood commented on LUCENE-4069:
--

Its true to say that Bloom is a different case to Pulsing - Bloom does not 
interfere in any with the normal recording of content in the wrapped delegate 
whereas Pulsing does.
It may prove useful for us to mark a formal distinction between these 
mutating/non mutating types so we can treat them differently and provide 
optimisations?


bq. And separately, you can always contain the number of files even today by 
using only unique instances yourself when writing

Contained but not optimal - roughly double the number of required files if I 
want the common case of a primary key indexed with Bloom. I can't see a way of 
indexing with Bloom-plus-Lucene40 on field A and indexing with just Lucene40 
on fields B,C and D and winding up with only one Lucene40 set of files with a 
common segment suffix. The way I did find of achieving this was to add a 
bloomFilteredFields set into my single Bloom+Lucene40 instance used for all 
fields. Is there any other option here currently? 

Looking to the future, 4093 may have more capabilities at optimising if it 
understands the distinction between mutating wrappers and non-mutating ones and 
how they are composed?



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286756#comment-13286756
 ] 

Robert Muir commented on LUCENE-4069:
-

{quote}
Contained but not optimal - roughly double the number of required files if I 
want the common case of a primary key indexed with Bloom.
{quote}

Then use CFS, its optimal always (1). 

I really dont think we should make this complex to save 2 or 3 files total 
(even in a complex config with many fields). Its not worth the complexity.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286815#comment-13286815
]

Mark Harwood commented on LUCENE-4069:
--

bq. Its not worth the complexity

There's no real added complexity in BloomFilterPostingsFormat - it has to be
capable of storing blooms for 1 field anyway and using the fieldname set is
roughly 2 extra lines of code to see if a TermsConsumer needs wrapping or not.

From a client side you don't have to use this feature - the fieldname set can
be null in which case it will wrap all fields sent its way. If you do chose to
supply a set the wrapped PostingsFormat will have the advantage of being
shared for bloomed and non-bloomed fields. We could add a constructor that
removes the set and mark the others expert.

For me this falls into one of the many faster-if-you-know-about-it
optimisations like FieldSelectors or recycling certain objects. Basically a
useful hint to Lucene to save some extra effort but one which you dont *need*
to use.

Lucene-4093 may in future resolve the multi-file issue but I'm not sure it will
do so without significant complication.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostings40.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Simon Willnauer (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286868#comment-13286868
]

Simon Willnauer commented on LUCENE-4069:
-

bq. I really dont think we should make this complex to save 2 or 3 files total
(even in a complex config with many fields). Its not worth the complexity.

I agree. I think those postings formats should only deal with encoding and not
with handling certain fields different. A user / app should handle this in the
codec. Ideally you don't have any conditions in the relevant methods like
termsConsumer etc.

bq. For me this falls into one of the many faster-if-you-know-about-it
optimisations like FieldSelectors or recycling certain objects. Basically a
useful hint to Lucene to save some extra effort but one which you dont need to
use.

why is this a speed improvement? reading from one file vs. multiple is not
really faster though.

Anyway, I think we should make this patch as simple as possible and don't
handle fields in the PF. We can still open another issue or wait until
LUCENE-4093 is in to discuss this issue?

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterPostings40.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286882#comment-13286882
 ] 

Robert Muir commented on LUCENE-4069:
-

{quote}
For me this falls into one of the many faster-if-you-know-about-it 
optimisations like FieldSelectors or recycling certain objects. Basically a 
useful hint to Lucene to save some extra effort but one which you dont need to 
use.
{quote}

I agree with Simon, its not going to be faster.

Worse, it creates a situation from the per-field perspective where multiple 
postings formats are sharing the same files for a segment.

This would make it harder to do things like refactorings of codec apis in the 
future.

So where is the benefit?

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286916#comment-13286916
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. why is this a speed improvement?

Sorry - misleading. Replace the word faster in my comment with better and 
that makes more sense - I mean better in terms of resource usage and reduced 
open file handles. This seemed relevant given the earlier comments about Solr's 
use of non-compound files:

bq. [Solr] create massive amounts of files if we did so (add to the fact it 
disables compound files by default and its a disaster...)

I can see there is a useful simplification being sought for here if PerFieldPF 
can consider each of the unique top-level PFs presented to it as looking after 
an exclusive set of files. As the centralised allocator of file names it can 
then simply call each unique PF with a choice of segment suffix to name its 
various files without conflicting with other PFs. Lucene 4093 is all about 
better determining which PF is unique using .equals(). Unfortunately I don't 
think this approach is sufficiently complex. In order to avoid allocating all 
unnecessary file names PerFieldPF would have to further understand the nuances 
of which PFs were being wrapped by other PFs and which wrapped PFs would be 
reusable outside of their wrapped PF (as is the case with BloomPF's wrapped 
PF). That seems a more complex task than implementing equals(). 

So it seems we have 3 options:
1) Ignore the problems of creating too many files in the case of BloomPF and 
any other examples of wrapping PFs
2) Create a PerFieldPF implementation that reuses wrapped PFs using some 
generic means of discovering recyclable wrapped PFs (i.e go further than what 
4093 currently proposes in adding .equals support)
3) Retain my BloomPF-specific solution to the problem for those prepared to use 
lower-level APIs.

Am I missing any other options and which one do you want to go for?



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286926#comment-13286926
 ] 

Simon Willnauer commented on LUCENE-4069:
-

bq. Create a PerFieldPF implementation that reuses wrapped PFs using some 
generic means of discovering recyclable wrapped PFs (i.e go further than what 
4093 currently proposes in adding .equals support)

I think we should investigate this further. Lets keep this issue simple and 
remove the field handling and fix this on a higher level!


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285518#comment-13285518
 ] 

Mark Harwood commented on LUCENE-4069:
--

Thanks for the comment, Rob.
While the choice of Codec can be an anonymous inner class, resolving the choice 
of PostingsFormat is trickier.
BloomFilterPostingsFormat is now intended to wrap any another choice of 
PostingsFormat and Simon has suggested leaving the Bloom support purely 
abstract.
However, as an end user if I want to use Bloom support on the standard Lucene 
codec I would then have to write one of these:
{code:title=MyBloomFilteredLucene40Postings.java}
public class MyBloomFilteredLucene40Postingsextends 
BloomFilteringPostingsFormatBase {

  public MyBloomFilteredLucene40Postings() {
//declare my choice of PostingsFormat to be wrapped and provide a unique 
name for this combo of Bloom-plus-delegate
super(myBL40, new Lucene40PostingsFormat());
  }
}
{code}
The resulting index files are then named [segname]_myBL40.[filetypesuffix].
At read-time the myBL40 bit of the filename is used to lookup via Service 
Provider registrations the decoding class so 
com.xx.MyBloomFilteredLucene40Postings would need adding to a 
o.a.l.codecs.PostingsFormat file for the registration to work.

I imagine Bloom-plus-Lucene40Postings would be a common combo and if both are 
in core it would be annoying to have to code support for this in each app or 
for things like Luke to have to have classpaths redefined to access the 
app-specific class that was created purely to bind this combo of core 
components.

I think a better option might be to change the Bloom filtering base class to 
record the choice of delegate PostingsFormat in it's own blm file at 
write-time and instantiate the appropriate delegate instance at read-time using 
the recorded name. The BloomFilteringBaseClass would need changing to a final 
class rather than an abstract so that core Lucene could load it as the handler 
for [segname]_BloomPosting.xxx files and it would then have to examine the 
[segname].blm file to discover and instantiate the choice of delegate 
PostingsFormat using the standard service registration mechanism. At write-time 
clients would need to instantiate the BloomFilterPostingsFormat, passing a 
choice of PostingsFormat delegate to the constructor. At read-time Lucene core 
would invoke a zero-arg constructor.
I'll look into this as an approach.






 

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285706#comment-13285706
 ] 

Mark Harwood commented on LUCENE-4069:
--

Aaaargh. Unless I've missed something, I have concerns with the fundamental 
design of the current Codec loading mechanism.

It seems too tied to the concept of a ServiceProvider class-loading mechanism, 
forcing users to write new SPI-registered classes in order to simply declare 
what amount to index schema configuration choices.

Example: If I take Rob's sample Codec above and choose to use a subtly 
different configuration of the same PostingsFormat class for different fields 
it breaks:

{code:title=ThisBreaks.java}
  Codec fooCodec=new Lucene40Codec() {
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if (text.equals(field)) {
return new FooPostingsFormat(1);
  }
  if (title.equals(field)) {
//same impl as text field, different constructor settings
return new FooPostingsFormat(2);
  }  
  return super.getPostingsFormatForField(field);
} 
  };
{code}
This causes a file overwrite error as PerFieldPostingsFormat uses the same name 
from FooPostingsFormat(1) and FooPostingsFormat(2) to create files.
In order to safely make use of differently configured choices of the same 
PostingsFormat we are forced to declare a brand new subclass with a unique new 
service name and entry in the service provider registration. This is 
essentially where I have got to in trying to integrate this Bloom filtering 
logic.

This dependency on writing custom classes seems to make everything a bit 
fragile, no? What hope has Luke got in opening the average index without 
careful assembly of classpaths etc?
If I contrast this with the world of database schemas it seems absurd to have a 
reliance on writing custom classes with no behaviour simply in order to 
preserve a configuration of an application's schema settings. Even an IOC 
container with XML declarations would offer a more agile means of assembling 
pre-configured *beans* rather than relying on a Service Provider mechanism that 
is only serving as a registry of *classes*.

Anyone else see this as a major pain?











 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285726#comment-13285726
 ] 

Robert Muir commented on LUCENE-4069:
-

I'm not sure this is true: e.g. if your postings format requires parameters to 
decode the segment, then this enforces that it records said parameters,
e.g. Pulsing records these parameters.

Codec parameters are at index-time, at read-time its your responsibility to be 
able to decode them solely from the index (this enforces that there doesnt need
to be a crazy matching of user configuration at write and read time).


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285746#comment-13285746
 ] 

Michael McCandless commented on LUCENE-4069:


When I run all tests with the bloom 4.0 postings format ({{ant test-core 
-Dtests.postingsformat=BloomFilteredLucene40PostingsFormat}}), I see a lot of 
failures..., eg lots of CheckIndex failures like:

{noformat}
   [junit4]   1 java.lang.RuntimeException: vector term=[e4 b8 80 74 77 6f] 
field=textField2Utf8 does not exist in postings; doc=0
   [junit4]   1at 
org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1440)
   [junit4]   1at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:575)
   [junit4]   1at 
org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:186)
   [junit4]   1at 
org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:587)
   [junit4]   1at 
org.apache.lucene.index.TestFieldsReader.afterClass(TestFieldsReader.java:72)
   [junit4]   1at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
   [junit4]   1at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
{noformat}

But also other spooky failures.  Are these known issues?

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285744#comment-13285744
 ] 

Mark Harwood commented on LUCENE-4069:
--

This fails if you add docs with title and text fields:
{code:title=ThisCrashes.java}
  Codec fooCodec=new Lucene40Codec() {
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if (text.equals(field)) {
return new MemoryPostingsFormat(false);
  }
  if (title.equals(field)) {
return new MemoryPostingsFormat(true);
  }  
  else {
return super.getPostingsFormatForField(field);
  }
} 
  };
{code}

  Exception in thread main java.io.IOException: Cannot overwrite: 
C:\temp\luceneCodecs\_2_Memory.ram

This also fails:

{code:title=ThisToo.java}
   Codec fooCodec=new Lucene40Codec() {
SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if (text.equals(field)) {
return new SimpleTextPostingsFormat();
  }
  if (title.equals(field)) {
return new SimpleTextPostingsFormat();
  }  
  else {
return super.getPostingsFormatForField(field);
  }
} 
  };
{code}
with 
  Exception in thread main java.io.IOException: Cannot overwrite: 
C:\temp\luceneCodecs\_1_SimpleText.pst

Whereas sharing the same instance of a PostingsFormat class across fields works:

{code:title=ThisWorks.java}
  Codec fooCodec=new Lucene40Codec() {
SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if ((text.equals(field))|| (title.equals(field))) {
return theSimple;
  }  
  else {
return super.getPostingsFormatForField(field);
  }
} 
  };
{code}


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

[
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285745#comment-13285745
]

Robert Muir commented on LUCENE-4069:
-

As far as your problem with parameters, this is actually just a limitation I
created (on accident) of PerFieldPostingsFormat
on LUCENE-4055. Because we use the postingsformat name as the segment suffix,
its not possible to have e.g. id field with Pulsing(1)
and date field with Pulsing(2).

I'll open an issue and fix this, again its just an issue with
PerFieldPostingsFormat. you should be able to use the same name here
with different configs, as long as your PostingsFormat writes what it needs to
read the files. Thanks for bringing this up.

Segment-level Bloom filters for a 2 x speed up on rare term searches

Key: LUCENE-4069
URL: https://issues.apache.org/jira/browse/LUCENE-4069
Project: Lucene - Java
Issue Type: Improvement
Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
Fix For: 4.0, 3.6.1

Attachments: BloomFilterCodec40.patch,
MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Andrzej Bialecki


On 30/05/2012 17:09, Robert Muir (JIRA) wrote:

I'm not sure this is true: e.g. if your postings format requires parameters to 
decode the segment, then this enforces that it records said parameters,
e.g. Pulsing records these parameters.

Codec parameters are at index-time, at read-time its your responsibility to be 
able to decode them solely from the index (this enforces that there doesnt need
to be a crazy matching of user configuration at write and read time).


I think what Mark is missing (and I saw as a limiting factor in 
developing other codecs) is to make it easier to customize Codec-s based 
on composition of reusable blocks, without necessarily needing a 
separate Codec class implementation.


This could be worked around by having a configurable codec that stores 
its configuration and instantiates necessary reusable blocks, available 
using the SPI mechanism. On writing you could specify this configuration 
as Codec attributes, and they could be written out e.g. to SegmentInfos, 
and on read they would become available from SegmentInfos.attributes.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285757#comment-13285757
 ] 

Robert Muir commented on LUCENE-4069:
-

Mark: I opened LUCENE-4090.. I broke it, so I'll fix it :)



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Robert Muir

On Wed, May 30, 2012 at 11:43 AM, Andrzej Bialecki a...@getopt.org wrote:
 On 30/05/2012 17:09, Robert Muir (JIRA) wrote:

 I'm not sure this is true: e.g. if your postings format requires
 parameters to decode the segment, then this enforces that it records said
 parameters,
 e.g. Pulsing records these parameters.

 Codec parameters are at index-time, at read-time its your responsibility
 to be able to decode them solely from the index (this enforces that there
 doesnt need
 to be a crazy matching of user configuration at write and read time).


 I think what Mark is missing (and I saw as a limiting factor in developing
 other codecs) is to make it easier to customize Codec-s based on composition
 of reusable blocks, without necessarily needing a separate Codec class
 implementation.

 This could be worked around by having a configurable codec that stores its
 configuration and instantiates necessary reusable blocks, available using
 the SPI mechanism. On writing you could specify this configuration as Codec
 attributes, and they could be written out e.g. to SegmentInfos, and on read
 they would become available from SegmentInfos.attributes.


Well I think honestly here a bug in PerFieldPostingsFormat is
definitely confusing the situation (LUCENE-4090).

You should be able to set Pulsing(1) on id field and Pulsing(2) on
date field and everything just work: but I broke that. I think thats
whats causing the most grief.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285765#comment-13285765
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. its just an issue with PerFieldPostingsFormat

OK, thanks. My guess is you'll effectively be having to supplement 
postingsformat.getName() with object-instanceID in file names.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches


[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285768#comment-13285768
 ] 

Robert Muir commented on LUCENE-4069:
-

Right, its not bad. Separately I'm not happy that its a little trappy in 
situations like your ThisToo.java,

its certainly easy and safe to continue to use identityhashmap but I think we 
will need to require equals/hashcode on postingsformats.

Otherwise its going to make the situation trappy in the future, e.g. we will 
want to add mechanisms to solr so that you can specify these parameters
in the schema.xml (today you cannot), and this would create massive amounts of 
files if we did so (add to the fact it disables compound files by default
and its a disaster...)


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches