Unfortunately, Lucene's performance with filters isn't great. This is because we now always apply filters "up high", using a leapfrog approach, where we alternate asking the filter and then the scorer to skip to each other's docID.
But if the filter accepts "enough" (~1% in my testing) of the documents in the index, it's often better to apply the filter "down low" like we do deleted docs (which really is its own filter), ie where we quickly eliminate docs as we enumerate them from the postings. I did a blog post about this too: http://chbits.blogspot.com/2010/09/fast-search-filters-using-flex.html That post shows some of the perf gains we could get by switching filters to apply down low, though this was for a filter that randomly accepts 50% of the index. And this is using the flex APIs (for 4.0); you may be able to do something similar using FilterIndexReader pre-4.0. Of course you shouldn't have to do such tricks -- https://issues.apache.org/jira/browse/LUCENE-1536 is open for Lucene to do this itself when you pass a filter. You should test, but, I suspect a MUST clause on an AND query may not perform that much better in general for filters that accept a biggish part of the index, since it's still using skipping, especially if your query wasn't already a BooleanQuery. For restrictive filters it should be a decent gain, but those queries are already fast to begin with. Do you have some perf numbers to share? What kind of queries are you running with the filters? Are there certain users that have a highish %tg of the documents, with a long tail of the other users? If so you could consider making dedicated indices for those high doc count users... Also note that static index partitioning like this does not result in the same scoring as you'd get if each user had their own index, since the term stats (IDF) is aggregated across all users. So for queries with more than one term, users can see docs sorted differently, and this is actually a known security risk in that users can gleen some details about the documents they aren't allowed to see due to the shared terms stats... there is a paper somewhere (Robert?) that delves into it. Mike On Sat, Oct 23, 2010 at 6:18 PM, Khash Sajadi <[email protected]> wrote: > My index contains documents for different users. Each document has the user > id as a field on it. > There are about 500 different users with 3 million documents. > Currently I'm calling Search with the query (parsed from user) > and FieldCacheTermsFilter for the user id. > It works but the performance is not great. > Ideally, I would like to perform the search only on the documents that are > relevant, this should make it much faster. However, it seems Search(Query, > Filter) runs the query first and then applies the filter. > Is there a way to improve this? (i.e. run the query only on a subset of > documents) > Thanks --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
