It just seems like a lot of IPC activity for each query. As things stand now, I think you are proposing this? 1. MultiSearcher calls the remote node to rewrite the query, requiring serialization of the query. 2. The remote node returns the rewritten query to the dispatcher node, which requires serialization of the (potentially much larger) rewritten query. 3. The dispatcher node computes the weights. This requires a call to each remote node for each term in the query to compute the docFreq's; this can be an extremely large number of IPC calls (e.g., 1,000 terms in a rewritten query times 10 remote nodes = 10,000 IPC calls). 4. The weights are serialized (including the serialized Similarity's) and passed back to remote node. 5. The remote nodes execute the queries and pass results back to the dispatcher node for collation.
Is that right? This seems pretty expensive to me.
I think that's right. For simple queries with a couple of terms it should not be too expensive. For queries that expand into thousands of terms, yes, it is expensive, but these are slow queries anyway. It's not clear how much worse this would make them. Yes, we might optimize it some, but first let's get things working correctly!
An easy way to "optimize" this is to avoid queries that expand into large numbers of terms. I've never permitted wildcard, fuzzy or range queries in any system that I've deployed: they're simply too slow. When I need, e.g., date ranges, I use a Filter instead. The auto-filter proposal I've made could make this a lot easier. So I'd like to see that implemented before I worry about optimizing remote range or wildcard queries.
If we had a central docFreq table on the dispatcher node, the query processing could be much simpler:
Perhaps, but that's a big "if".
1. MultiSearcher rewrites the query and computes the weights, all locally on the central node.
This could require a substantial change to rewrite implementations. Rewriting is currently passed a full IndexReader: a central docFreq table is not a full IndexReader. So we could add a search API for term enumeration independent of an IndexReader, then change all rewrite implementations to use this, and hope that none require other aspects of the IndexReader.
2. The rewritten query with weights is passed to each remote node (= 10 IPC calls in the example case above).
This still serializes a huge query. A central docFreq table only provides a constant factor improvement. The rewritten query only needs to travel one-way rather than round-trip.
If the aggregate docFreq table was replicated to each remote node, then only the raw query would need to be passed as the remote nodes could each do the rewriting and weighting. However, this would be offset by the extra complexity to manage the distribution of the aggregate tables, which is probably not worth it.
The methods required to keep an accurate central docFreq table could be: 1. Compute it initially by having the central node obtain and sum the contributions from each remote node. 2. On each incremental index on a remote node, send the central node a set of deltas for each term whose docFreq was changed by the incremental index.
This sounds very hairy to me.
The delta approach is problematic. At present a Searchable instance, like an IndexReader, does not change the set of documents it searches. At present, when you want to search an updated collection you construct a new Searcher. So this is (again) a substantive change. It means, e.g., if folks cache things based on the Searcher that these caches might become invalid.
I think the question is how frequent and how expensive would those two steps be in comparison to the difference in the query processing.
I think the first question is: can we get RemoteSearchables to work correctly and reasonably efficiently for simple queries?
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]