Doug Cutting wrote: > It would indeed be nice to be able to short-circuit rewriting for > queries where it is a no-op. Do you have a proposal for how this could > be done?
First, this gets into the other part of Bug 31841. I don't believe MultiSearcher.rewrite() is ever called. Rewriting is done in the Weight's, which invoke the rewrite() method of the Searcher, which is always the Seacher invoked by the MultiSearcher, not the MultiSearcher itself. In fact, MultiSearcher.rewrite() is broken. It requires Query.combine() which is unsupported except for the derived queries (i.e., those for which rewriting is not a no-op). When I added topmostSearcher to get the Weight's to call the MultiSearcher.docFreq(), that also caused them to call MultiSearcher.rewrite() which blows up on, for example, a simple TermQuery, because there is no TermQuery.combine(). That's why my patch contains a new default implementation for Query.combine() (which as noted in the bug report is probably not a good idea in general). So, I don't believe there is any valid rewrite() implementation for MultiSearcher to start from, unless I've completely misunderstood something. To address the question above, RemoteSearchable.rewrite() should be a no-op, i.e. always return this. For good error handling, it should verify that the query does not require rewriting. This requires some mechanism to determine whether or not a query requires rewriting. The challenge here is that some query types have a non-trivial rewrite() method not because they require rewriting, but because they might have subqueries that require rewriting (e.g., BooleanQuery). Other query types (e.g., MultiTermQuery) always require rewriting, while those that implement Weight's never require it. I think an upward incompatibility is required in the API to address this. If that is acceptable, then this could work: 1. Add a new interface called Rewritable that specifies a boolean rewriteRequired() method. 2. Have Query implement Rewritable but NOT provide an implementation for rewriteRequired(). This will force all applications to add support for this in order to upgrade. 2. Change all the Weight's to call Query.maybeRewrite() instead of Query.rewrite(). 3. Have Query.maybeRewrite() only call Query.rewrite() if Query.rewriteRequired() is true. 4. Have RemoteSearchable.maybeRewrite() throw an Exception if Query.rewriteRequired() is true. 5. Implement rewriteRequired() for all the built-in Query types (which is either true for derived queries, false for primitive queries, or an or of rewriteRequired() for all the subqueries). Maybe there's a better way, but this should work. It does require an extra pass over the query. There is a potential hole if there are applications that implement new primitive queries, i.e. have Weight's that directly call Query.rewrite(). This hole could be (mostly) plugged by renaming rewrite(), but that would introduce another upward incompatibility. An optimization could omit the call to rewriteRequired() in Query.maybeRewrite(), as this mechanism is really only needed in RemoteSearchable (and could be beneficial in MultiSeacher). There is still the need to properly implement Query.combine() for all query types (which is greatly simplified by a good default implementation). Chuck > -----Original Message----- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 13, 2005 11:41 AM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Chuck Williams wrote: > > If auto-filters can provide an effective implementation for > RangeQuery's > > that avoids rewriting, and we can give up MultiTermQuery and > PrefixQuery > > in the distributed environment, then how about something like this > > refinement: > > 1. No rewriting is done. > > It would indeed be nice to be able to short-circuit rewriting for > queries where it is a no-op. Do you have a proposal for how this could > be done? > > > 2. The central node maintains a cache of aggregate docFreq data > that > > is incrementally built on demand, and flushed after any remote node > > opens a new Searcher. > > 3. The central node computes the Weights by accessing the docFreq > for > > each query term. This looks the value up in the cache, or queries it > > from each remote node, sums the results, and caches the result. > > > > This seems simple and avoids a great deal of IPC traffic, especially > in > > the common case where popular query terms are frequently reused. > > I think this sort of a docFreq cache would be easy to build into either > MultiSearcher or RemoteSearchable. > > > I presume the auto-filters get pushed out to each remote node as part > of > > the query? > > They're not yet implemented, so we don't know. One implementation would > be that Scorers would automatically use filters for amenable query > clauses. If that's the way things are done then yes, the filters would > essentially be a part of the query. No matter how they're implemented, > we should take care to consider remote performance. > > Doug > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]