[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Hoss Man (JIRA) Tue, 10 Apr 2007 17:29:53 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487962
 ]


Hoss Man commented on LUCENE-855:
---------------------------------

On Mon, 9 Apr 2007, Otis Gospodnetic (JIRA) wrote:

: I'd love to know what Hoss and other big Filter users think about this.
: Solr makes a lof of use of (Range?)Filters, I believe.

This is one of those Jira issues that i didn't really have time to follow when 
it was first opened, and so the Jira emails have just been piling up waiting 
ofr me to read.

Here's the raw notes i took as i read through the patches...

----------------
FieldCacheRangeFilter.patch  from 10/Apr/07 01:52 PM

 * javadoc cut/paste errors (FieldCache)
 * FieldCacheRangeFilter should work with simple strings
   (using FieldCache.getStrings or FieldCache.getStringIndex)
   just like regular RangeFilter
 * it feels like the various parser versions should be in
   seperate subclasses (common abstract base class?)
 * why does clone need to construct a raw BitSet?  what exactly didn't
   work about ChainedFilter without this?
   (could cause other BitSet usage problems)
 * or/and/andNot/xor can all be implemented using convertToBitSet
 * need FieldCacheBitSet methods: cardinality, get(int,int)
 * need equals and hashCode methods in all new classes
 * FieldCacheBitSet.clear should be UnsuppOp
 * convertToBitSet can be cached.
 * FieldCacheBitSet should be abstract, requiring get(int) be implemented


MemoryCachedRangeFilter_1.4.patch from 06/Apr/07 06:14 AM

 * "tuples" should be initialized to fieldCache.length ... serious
   ArrayList resizing going on there
   (why is it an ArrayList, why not just Tules[] ?)
 * doesn't "cache" need synchronization? ... seems like the same
   CreationPlaceholder pattern used in FieldCache might make sense here.
 * this looks wrong...
     } else if ( (!includeLower) && (lowerIndex >= 0) ) {
   ...consider case where lower==5, includeLower==false, and all values
   in index are 5, binary search could leave us in the middle of hte index,
   so we still need for move forward to the end?
 * ditto above concern for finding upperIndex
 * what is pathological worst case for rewind/forward when *lots* of
   duplicate values in index?  should another binarySearch be used?
 * a lot of code in MemoryCachedRangeFilter.bits for finding
   lowerIndex/upperIndex would probably make more sense as methods in
   SortedFieldCache
 * only seems to handle longs, at a minimum should deal with arbitrary
   strings, with optional add ons for longs/ints/etc...
 * I can't help but wonder how MemoryCachedRangeFilter would compare if it
   used Solr's OpenBitSet (facaded to implement the BitSet API)

TestRangeFilterPerformanceComparison.java   from 10/Apr/07

 * I can't help but wonder how RangeFilter would compare if it used Solr's
   OpenBitSet (facaded to implement the BitSet API)
 * no test of includeLower==false or includeUpper==false
 * i don't think the ranges being compared are the same for RangeFilter as they 
   are for the other Filters ... note the use of DateTools when building the 
index, 
   vs straight string usage in RangeFilter, vs Long.parseLong in 
   MemoryCachedRangeFilter and FieldCacheRangeFilter
 * is it really a fair comparison to call MemoryCachedRangeFilter.warmup
   or FieldCacheRangeFilter.bits outside of the timing code?
   for indexes where the IndexReader is reopened periodicaly this may
   be a significant number to be aware of.
----------------

Questions about the legitimacy of the testing aside...

In general, I like the approach of FieldCacheBitSet -- but it should be 
generalized into an "AbstractReadOnlyBitSet" where all methods are implemented 
via get(int) in subclasses -- we should make sure that every method in the 
BitSet API works as advertised in Java1.4.  

I don't really like the various hoops FieldCacheRangeFilter has to jump through 
to support int/float/long ... I think at it's core it should support simple 
Strings, with alternate/sub classes for dealing with other FieldCache formats 
... i just really dislike all the crazy nested ifs to deal with the different 
Parser types, if there's going to be separate constructors for 
longs/floats/ints, they might as well be separate sub-classes.

the really nice thing this has over RangeFilter is that people can index raw 
numeric values without needing to massage them into lexicographically ordered 
Strings (since the FieldCache will take care of parsing them appropriately) 

My gut tells me that the MemoryCachedRangeFilter approach will never ever be 
able to compete with the FieldCacheRangeFilter facading BitSet approach since 
it needs to build the FieldCache, then the SortedFieldCache, then a BitSet 
...it seems like any optimization into that pipeline can always be beaten by 
using the same logic, but then facading the BitSet




> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Reply via email to