It just seems like a lot of IPC activity for each query. As things stand now, I think you are proposing this? 1. MultiSearcher calls the remote node to rewrite the query, requiring serialization of the query. 2. The remote node returns the rewritten query to the dispatcher node, which requires serialization of the (potentially much larger) rewritten query. 3. The dispatcher node computes the weights. This requires a call to each remote node for each term in the query to compute the docFreq's; this can be an extremely large number of IPC calls (e.g., 1,000 terms in a rewritten query times 10 remote nodes = 10,000 IPC calls). 4. The weights are serialized (including the serialized Similarity's) and passed back to remote node. 5. The remote nodes execute the queries and pass results back to the dispatcher node for collation.
Is that right? This seems pretty expensive to me. If we had a central docFreq table on the dispatcher node, the query processing could be much simpler: 1. MultiSearcher rewrites the query and computes the weights, all locally on the central node. 2. The rewritten query with weights is passed to each remote node (= 10 IPC calls in the example case above). 3. Each remote node processes the rewritten query. Here, the remote node could rewrite the query again to eliminate term expansions for terms it doesn't have as Paul suggests, or it could omit this step (I believe the only difference in result is scoring, and it's not clear to me the best way to score this case). 4. The results are passed back and collated. If the aggregate docFreq table was replicated to each remote node, then only the raw query would need to be passed as the remote nodes could each do the rewriting and weighting. However, this would be offset by the extra complexity to manage the distribution of the aggregate tables, which is probably not worth it. The methods required to keep an accurate central docFreq table could be: 1. Compute it initially by having the central node obtain and sum the contributions from each remote node. 2. On each incremental index on a remote node, send the central node a set of deltas for each term whose docFreq was changed by the incremental index. I think the question is how frequent and how expensive would those two steps be in comparison to the difference in the query processing. Chuck > -----Original Message----- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 13, 2005 9:14 AM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Chuck Williams wrote: > > I think there is another problem here. It is currently the Weight > > implementations that do rewrite(), which requires access to the index, > > not just to the idf's. E.g., RangeQuery.rewrite() must find the terms > > in the index within the range. So, the Weight cannot be computed in > the > > MultiSearcher, as it does not have direct access to the remote index. > > rewrite() is actually called before the weight is constructed. In the > remote case, rewrite() is another IPC. So, when a query is executed on > a MultiSearcher of RemoteSearchables, the following remote calls are > made: > > 1. RemoteSearchable.rewrite(Query) is called > 2. RemoteSearchable.docFreq(Term) is called for each term in the > rewritten query while constructing a Weight. > 3. RemoteSearchable.search(Weight, ...) is called. > > So I don't think this is a problem. > > Doug > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]