Dear Lucene developers, I'm about to scale up a set of Lucene databases from a single user to some tens of users. They will normally use an in house classification system to limit the results of their queries.
Currently I am supporting this using BitFilters by first querying for the class(es) and then creating and using BitFilters in the user queries with IndexSearcher.collect(query, filter). The filters are created from a sorted list of secondary keys using IndexReader.termDocs() for each key. The filters are kept in an LRU cache for reuse. Currently the total amount of documents is about 100k and a typical limitation is to some hundreds of documents, ie. up to one percent of the total docs. Total index size is about 1Gb with all text stored, and about 250Mb without stored text. The databases are a mixture of titles, abstracts and full text. (Scoring these quite differently sized documents works nicely, btw.) For a single user, performance is quite satisfactory using the cached BitFilters. However, I'm concerned that for larger document sets performance might drop. I expect the total amount of documents to go to 4000k (ie times 40) and the typical limitation to stay the same. I am also expecting that the PhrasePrefixQuery will be used a lot. (Btw. your timing for lucene 1.3 is just about perfect, ie. I'll wait for it be ready before scaling up.) I looked at the IndexSearcher.search(query, filter, results) implementation and found that scoring for the query is done for every hit _before_ the filter is consulted. I suppose this also means the index is traversed without taking into account the filter. As said, for a single user there is no performance problem, but for the larger document sets I am concerned that the BitFilter is not taken into account when performing the query search. Also, storing one bit per document in the databases for each limitation needed does not scale well when more limitations are needed. So I tried to find a way to define a new type of filter that might be used more like a boolean query and that uses memory in proportion to the nr of docs in the limitation instead of the nr. of docs in the database. I also tried to understand the source code of BooleanScorer.java. It uses a hash table, but that's where my understanding stops. My questions: - Is this a case of 'premature optimisation is the cause of all evil', ie. is the concern misplaced? - Did someone else encounter performance problems because of the scoring ignoring the BitFilters? - Could someone explain how the BooleanScorer works? - Still a bit vague: during query evalution, I'd like to be able to use sth like skipTo(docNr) where the docNr is taken from the (new type of) limitation. Does this make sense? Regards, Ype -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
