The trick is to wrap the TermQuery using a ConstantScoreQuery(new QueryWrapperFilter(new TermQuery(.))). Because for filtering, the TermQuery used instead of a filter should not contribute to score. This code is used quite often in Lucene, so don't care about the strange looking code. E.g. in MultiTermQuery.
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen <http://www.thetaphi.de/> http://www.thetaphi.de eMail: [email protected] From: Khash Sajadi [mailto:[email protected]] Sent: Sunday, October 24, 2010 12:50 PM To: [email protected] Subject: Re: Using filters to speed up queries Terribly sorry. I meant Mike: > Also note that static index partitioning like this does not result in the same scoring as you'd get if each user had their own index, since the term stats (IDF) is aggregated across all users. So for queries with more than one term, users can see docs sorted differently, and this is actually a known security risk in that users can gleen some details about the documents they aren't allowed to see due to the shared terms stats... there is a paper somewhere (Robert?) that delves into it. On 24 October 2010 11:46, Uwe Schindler <[email protected]> wrote: Security risk? I did not say anything about that! ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de <http://www.thetaphi.de/> eMail: [email protected] From: Khash Sajadi [mailto:[email protected]] Sent: Sunday, October 24, 2010 12:34 PM To: [email protected] Subject: Re: Using filters to speed up queries Here is what I've found so far: I have three main sets to use in a query: Account MUST be xxx User query DateRange on the query MUST be in (a,b) it is a NumericField I tried the following combinations (all using a BooleanQuery with the user query added to it) 1. One: - Add ACCOUNT as a TermQuery - Add DATE RANGE as Filter 2. Two - Add ACCOUNT as Filer - Add DATE RANGE as NumericRangeQuery I tried caching the filters on both scenarios. I also tried both scenarios by passing the query as a ConstantScoreQuery as well. I got the best result (about 4x faster) by using a cached filter for the DATE RANGE and leaving the ACCOUNT as a TermQuery. I think I'm happy with this approach. However, the security risk Uwe mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions? As for document distribution, the ACCOUNTS have a similar distribution of documents. Also, I still would like to try the multi index approach, but not sure about the memory, file handle burden of it (having potentially thousands of reades/writers/searchers) open at the same time. I use two processes one as indexer and one for search with the same underlying FSDirectory. As for search, I use writer.getReader().reopen within a SearchManager as suggested by Lucene in Action. On 24 October 2010 10:27, Paul Elschot <[email protected]> wrote: Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi: > My index contains documents for different users. Each document has the user > id as a field on it. > > There are about 500 different users with 3 million documents. > > Currently I'm calling Search with the query (parsed from user) > and FieldCacheTermsFilter for the user id. > > It works but the performance is not great. > > Ideally, I would like to perform the search only on the documents that are > relevant, this should make it much faster. However, it seems Search(Query, > Filter) runs the query first and then applies the filter. > > Is there a way to improve this? (i.e. run the query only on a subset of > documents) > > Thanks > When running the query with the filter, the query is run at the same time as the filter. Initially and after each matching document, the filter is assumed to be cheaper to execute and its first or next matching document is determined. Then the query and the filter are repeatedly advanced to each other's next matching document until they are at the same document (ie. there is a match), similar to a boolean query with two required clauses. The java code doing this is in the private method IndexSearcher.searchWithFilter(). It could be that filling the field cache is the performance problem. How is the performance when this search call with the FieldCacheTermsFilter is repeated? Also, for a single indexed term to be used as a filter (the user id in this case) there may be no need for a cache, a QueryWrapperFilter around the TermQuery might suffice. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
