Filters: efficiency concerns

Ype Kingma Sat, 25 Jan 2003 07:30:39 -0800

Dear Lucene developers,

I'm about to scale up a set of Lucene databases from a single user
to some tens of users. They will normally use an in house classification 
system to limit the results of their queries.


Currently I am supporting this using BitFilters by first querying for the
class(es) and then creating and using BitFilters in the user queries with
IndexSearcher.collect(query, filter).
The filters are created from a sorted list of secondary keys using 
IndexReader.termDocs() for each key.
The filters are kept in an LRU cache for reuse.

Currently the total amount of documents is about 100k and a typical
limitation is to some hundreds of documents, ie. up to one 
percent of the total docs. Total index size is about 1Gb with
all text stored, and about 250Mb without stored text.
The databases are a mixture of titles, abstracts
and full text. (Scoring these quite differently sized documents
works nicely, btw.)

For a single user, performance is quite satisfactory using the 
cached BitFilters.
However, I'm concerned that for larger document sets performance
might drop. I expect the total amount of documents to go to 4000k (ie times 
40) and the typical limitation to stay the same. I am also expecting
that the PhrasePrefixQuery will be used a lot.
(Btw. your timing for lucene 1.3 is just about perfect, ie. I'll wait for it 
be ready before scaling up.)

I looked at the IndexSearcher.search(query, filter, results) implementation 
and found that scoring for the query is done for every hit _before_ the 
filter is consulted.
I suppose this also means the index is traversed without taking into
account the filter.

As said, for a single user there is no performance problem, but
for the larger document sets I am concerned that the BitFilter is not taken 
into account when performing the query search. Also, storing one bit
per document in the databases for each limitation needed does
not scale well when more limitations are needed. 

So I tried to find a way to define a new type of filter that might be used
more like a boolean query and that uses memory in proportion to the nr
of docs in the limitation instead of the nr. of docs in the database.

I also tried to understand the source code of BooleanScorer.java.
It uses a hash table, but that's where my understanding stops.

My questions:
- Is this a case of 'premature optimisation is the cause of all evil', ie.
is the concern misplaced?
- Did someone else encounter performance problems because of the
scoring ignoring the BitFilters?
- Could someone explain how the BooleanScorer works?
- Still a bit vague: during query evalution, I'd like to be able to use sth 
like skipTo(docNr) where the docNr is taken from the (new type of) 
limitation. Does this make sense?

Regards,
Ype

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Filters: efficiency concerns

Reply via email to