Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

Tim Sturge Wed, 10 Dec 2008 13:34:31 -0800

It's LUCENE-1487. 

Tim



On 12/10/08 1:13 PM, "Tim Sturge" <[EMAIL PROTECTED]> wrote:

> Yes (mostly). It turns those terms into an OpenBitSet on the term array.
> Then it does a fastGet() in the next() and skipTo() loops to see if the term
> for that document is in the set.
> 
> The issue is that fastGet() is not as fast as the two inequalities in FCRF.
> I didn't directly benchmark FCTF against FCRF because I had a different
> application in mind for FCTF (location boxes). However it wasn't as
> efficient in that case as directly realizing the bit sets. This was mostly
> because in the application I had in mind there were a lot (>100K) of terms
> with relatively low frequency and queries that needed only a few hundred
> terms in the set.
> 
> I tried a sorted list of terms and Arrays.binarySearch() but that is way
> slower as is Set<Integer> (no surprise there). I was thinking about a custom
> hash table implementation but I'm not hopeful; it increases cycle cost and
> means 
> 
> So it is efficient but for a more limited set of cases than FCRF. My gut
> feeling is that FCRF is a better solution for "most" range filters, whereas
> FCTF is a better solution for "some" term set filters (versus creating
> TermsFilter objects on the fly each time) It all depends on how common the
> terms are and how large the sets of terms are. Lots of terms (or a few very
> common terms) it wins. A few less common terms it loses.
> 
> I'll open a JIRA issue for it.
> 
> Tim
> 
> On 12/10/08 12:45 PM, "Michael McCandless" <[EMAIL PROTECTED]>
> wrote:
> 
>> 
>> It'd be great to get this into Lucene.
>> 
>> Does FieldCacheTermsFilter let you specify a set of arbitrary terms to
>> filter for, like TermsFilter in contrib/queries?  And it's space/time
>> efficient once FieldCache is populated?
>> 
>> Mike
>> 
>> Tim Sturge wrote:
>> 
>>> Mike, Mike,
>>> 
>>> I have an implementation of FieldCacheTermsFilter (which uses field
>>> cache to
>>> filter for a predefined set of terms) around if either of you are
>>> interested. It is faster than materializing the filter roughly when
>>> the
>>> filter matches more than 1% of the documents.
>>> 
>>> So it's not better for a large set of small filters (which you can
>>> materialize on the spot) but it is better for a small set (but more
>>> than 32)
>>> large filters.
>>> 
>>> Let me know if you're interested and I'll send it in.
>>> 
>>> Tim
>>> 
>>> On 12/10/08 3:34 AM, "Michael McCandless"
>>> <[EMAIL PROTECTED]> wrote:
>>> 
>>>> 
>>>> In your approach, roughly how many filters do you have cached?  It
>>>> seems like it could be quite a few (one for each color, one for each
>>>> type, etc)?
>>>> 
>>>> You might be able to modify the new (on Lucene trunk)
>>>> FieldCacheRangeFilter to achieve this same filtering without actually
>>>> having to materialize the full bitset for each.
>>>> 
>>>> Mike
>>>> 
>>>> Michael Stoppelman wrote:
>>>> 
>>>>> Yeah looks similar to what we've implemented for ourselves
>>>>> (although I
>>>>> haven't looked at the implementation). We've got quite a custom
>>>>> version of
>>>>> lucene at this point. Using Solr at this point really isn't a viable
>>>>> option,
>>>>> but thanks for pointing this out.
>>>>> 
>>>>> M
>>>>> 
>>>>> On Tue, Dec 9, 2008 at 1:47 AM, Michael McCandless <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>> 
>>>>>> 
>>>>>> This use case sounds alot like faceted navigation, which Solr
>>>>>> provides.
>>>>>> 
>>>>>> Mike
>>>>>> 
>>>>>> 
>>>>>> Michael Stoppelman wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>>> 
>>>>>>> I'm working on upgrading to Lucene 2.4.0 from 2.3.2 and was trying
>>>>>>> to
>>>>>>> integrate the new DodIdSet changes since
>>>>>>> o.a.l.search.Filter#bits() method
>>>>>>> is now depreciated. For our app we actually heavily rely on bits
>>>>>>> from the
>>>>>>> Filter to do post-query filtering (I explain why below).
>>>>>>> 
>>>>>>> For example, if someone searches for product: "ipod" and then
>>>>>>> filters a
>>>>>>> type: "nano" (e.g. mini/nano/regular) AND color: "red" (e.g.
>>>>>>> red/yellow/blue). In our current model the results are gathered in
>>>>>>> the
>>>>>>> following way:
>>>>>>> 
>>>>>>> 1) "ipod" w/o attributes is run and the results are stored in a
>>>>>>> hitcollector
>>>>>>> 2) "ipod" results are now filtered for color="red" AND type="mini"
>>>>>>> using
>>>>>>> the
>>>>>>> lucene Filters
>>>>>>> 3) The filtered results are returned to the user.
>>>>>>> 
>>>>>>> The reason that the attributes are filtered post-query is so that
>>>>>>> we can
>>>>>>> return the other types and colors the user can filter by in the
>>>>>>> future.
>>>>>>> Meaning the UI would be able to show "blue", "green", "pink",
>>>>>>> etc... if we
>>>>>>> pre-filtered results by color and type before hand we wouldn't
>>>>>>> know what
>>>>>>> the
>>>>>>> other filter options would be there for a broader result set.
>>>>>>> 
>>>>>>> Does anyone else have this use case? I'd imagine other folks are
>>>>>>> probably
>>>>>>> doing similar things to accomplish this.
>>>>>>> 
>>>>>>> M
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

Reply via email to