MemoryCachedRangeFilter to boost performance of Range queries
-------------------------------------------------------------
Key: LUCENE-855
URL: https://issues.apache.org/jira/browse/LUCENE-855
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
Currently RangeFilter uses TermEnum and TermDocs to find documents that fall
within the specified range. This requires iterating through every single term
in the index and can get rather slow for large document sets.
MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts
by value, and stores in a SortedFieldCache. During bits(), binary searches are
used to find the start and end indices of the lower and upper bound values.
The BitSet is populated by all the docId values that fall in between the start
and end indices.
TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index
with random date values within a 5 year range. Executing bits() 1000 times on
standard RangeQuery using random date intervals took 63904ms. Using
MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic
when you have less unique terms in a field or using less number of documents.
Currently MemoryCachedRangeFilter only works with numeric values (values are
stored in a long[] array) but it can be easily changed to support Strings. A
side "benefit" of storing the values are stored as longs, is that there's no
longer the need to make the values lexographically comparable, i.e. padding
numeric values with zeros.
The downside of using MemoryCachedRangeFilter is there's a fairly significant
memory requirement. So it's designed to be used in situations where range
filter performance is critical and memory consumption is not an issue. The
memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.
MemoryCachedRangeFilter also requires a warmup step which can take a while to
run in large datasets (it took 40s to run on a 3M document corpus). Warmup can
be called explicitly or is automatically called the first time
MemoryCachedRangeFilter is applied using a given field.
So in summery, MemoryCachedRangeFilter can be useful when:
- Performance is critical
- Memory is not an issue
- Field contains many unique numeric values
- Index contains large amount of documents
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]