Re: Performance when computing computing a filter using hundreds of diff terms.

Paul Elschot Fri, 06 Aug 2004 14:00:05 -0700

Kevin,

On Thursday 05 August 2004 23:32, Kevin A. Burton wrote:
> I'm trying to compute a filter to match documents in our index by a set
> of terms.
>
> For example some documents have a given field 'category' so I need to
> compute a filter with mulitple categories.
>
> The problem is that our category list is > 200 items so it takes about
> 80 seconds to compute.  We cache it of course but this seems WAY too slow.
>
> Is there anything I could do to speed it up?  Maybe run the queries
> myself and then combine the bitsets?

That would be a first step.

> We're using a BooleanQuery with nested TermQueries to build up the
> filter...

I suppose that is a BooleanQuery with all terms optional?
Depending on the number of docs in the index and the distribution of
the categories over the classes that might lead to a lot of disk head
movements.

Recently some code was posted to compute a filter for date ranges.
For each date (ie. Term) in the range it would walk all documents and
set the corresponding bit in a bitset. You can use the same approach.
See IndexReader.termDocs(Term) for starters, and preferably iterate
over the categories (Terms) in sorted order.

A BooleanQuery would do much the same thing, but it has to work
in document order for all Term's at the same time, which can cause
extra disk seeks between the TermDocs.
You can avoid those disk seeks by iterating over the TermDocs yourself
and keeping the results in the bitset.

If you do this in with sorted terms, ideally the disk head would move in
a single direction for the whole process. For maximum performance 
you might want to avoid searching other Query's or similar TermDoc
iterators at the same time. Also avoid retrieving documents
while this is going on, just keep that disk head moving only where you
want it to.

For further CPU speedup you can cache the TermDocs using the
read() method. Lucene's TermScorer does this, see 
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java
and use 'view' on the latest revision. A bigger cache size than 32 would seem
appropriate for your case.

Could you evt. report the speedup? I guess you should be able
to bring it down to at most twenty seconds or so.

After that, replication over multiple disks might help, giving each of them
an interval of the sorted categories to search.

Good luck,
Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Performance when computing computing a filter using hundreds of diff terms.

Reply via email to