If auto-filters can provide an effective implementation for RangeQuery's that avoids rewriting, and we can give up MultiTermQuery and PrefixQuery in the distributed environment, then how about something like this refinement: 1. No rewriting is done. 2. The central node maintains a cache of aggregate docFreq data that is incrementally built on demand, and flushed after any remote node opens a new Searcher. 3. The central node computes the Weights by accessing the docFreq for each query term. This looks the value up in the cache, or queries it from each remote node, sums the results, and caches the result.
This seems simple and avoids a great deal of IPC traffic, especially in the common case where popular query terms are frequently reused. I presume the auto-filters get pushed out to each remote node as part of the query? Chuck > -----Original Message----- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 13, 2005 10:29 AM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Chuck Williams wrote: > > It just seems like a lot of IPC activity for each query. As things > > stand now, I think you are proposing this? > > 1. MultiSearcher calls the remote node to rewrite the query, > > requiring serialization of the query. > > 2. The remote node returns the rewritten query to the dispatcher > > node, which requires serialization of the (potentially much larger) > > rewritten query. > > 3. The dispatcher node computes the weights. This requires a call > to > > each remote node for each term in the query to compute the docFreq's; > > this can be an extremely large number of IPC calls (e.g., 1,000 terms > in > > a rewritten query times 10 remote nodes = 10,000 IPC calls). > > 4. The weights are serialized (including the serialized > Similarity's) > > and passed back to remote node. > > 5. The remote nodes execute the queries and pass results back to > the > > dispatcher node for collation. > > > > Is that right? This seems pretty expensive to me. > > I think that's right. For simple queries with a couple of terms it > should not be too expensive. For queries that expand into thousands of > terms, yes, it is expensive, but these are slow queries anyway. It's > not clear how much worse this would make them. Yes, we might optimize > it some, but first let's get things working correctly! > > An easy way to "optimize" this is to avoid queries that expand into > large numbers of terms. I've never permitted wildcard, fuzzy or range > queries in any system that I've deployed: they're simply too slow. When > I need, e.g., date ranges, I use a Filter instead. The auto-filter > proposal I've made could make this a lot easier. So I'd like to see > that implemented before I worry about optimizing remote range or > wildcard queries. > > > If we had a central docFreq table on the dispatcher node, the query > > processing could be much simpler: > > Perhaps, but that's a big "if". > > > 1. MultiSearcher rewrites the query and computes the weights, all > > locally on the central node. > > This could require a substantial change to rewrite implementations. > Rewriting is currently passed a full IndexReader: a central docFreq > table is not a full IndexReader. So we could add a search API for term > enumeration independent of an IndexReader, then change all rewrite > implementations to use this, and hope that none require other aspects of > the IndexReader. > > > 2. The rewritten query with weights is passed to each remote node > (= > > 10 IPC calls in the example case above). > > This still serializes a huge query. A central docFreq table only > provides a constant factor improvement. The rewritten query only needs > to travel one-way rather than round-trip. > > > If the aggregate docFreq table was replicated to each remote node, > then > > only the raw query would need to be passed as the remote nodes could > > each do the rewriting and weighting. However, this would be offset by > > the extra complexity to manage the distribution of the aggregate > tables, > > which is probably not worth it. > > > > The methods required to keep an accurate central docFreq table could > be: > > 1. Compute it initially by having the central node obtain and sum > the > > contributions from each remote node. > > 2. On each incremental index on a remote node, send the central > node > > a set of deltas for each term whose docFreq was changed by the > > incremental index. > > This sounds very hairy to me. > > The delta approach is problematic. At present a Searchable instance, > like an IndexReader, does not change the set of documents it searches. > At present, when you want to search an updated collection you construct > a new Searcher. So this is (again) a substantive change. It means, > e.g., if folks cache things based on the Searcher that these caches > might become invalid. > > > I think the question is how frequent and how expensive would those two > > steps be in comparison to the difference in the query processing. > > I think the first question is: can we get RemoteSearchables to work > correctly and reasonably efficiently for simple queries? > > Doug > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]