Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Paul Elschot Thu, 13 Jan 2005 11:06:17 -0800

On Thursday 13 January 2005 19:29, Doug Cutting wrote:
> Chuck Williams wrote:
> > It just seems like a lot of IPC activity for each query.  As things
> > stand now, I think you are proposing this?
> >   1.  MultiSearcher calls the remote node to rewrite the query,
> > requiring serialization of the query.
> >   2.  The remote node returns the rewritten query to the dispatcher
> > node, which requires serialization of the (potentially much larger)
> > rewritten query.
> >   3.  The dispatcher node computes the weights.  This requires a call to
> > each remote node for each term in the query to compute the docFreq's;
> > this can be an extremely large number of IPC calls (e.g., 1,000 terms in
> > a rewritten query times 10 remote nodes = 10,000 IPC calls).
> >   4.  The weights are serialized (including the serialized Similarity's)
> > and passed back to remote node.
> >   5.  The remote nodes execute the queries and pass results back to the
> > dispatcher node for collation.
> > 
> > Is that right?  This seems pretty expensive to me.
> 
> I think that's right.  For simple queries with a couple of terms it 
> should not be too expensive.  For queries that expand into thousands of 
> terms, yes, it is expensive, but these are slow queries anyway.  It's 
> not clear how much worse this would make them.  Yes, we might optimize 
> it some, but first let's get things working correctly!
> 
> An easy way to "optimize" this is to avoid queries that expand into 
> large numbers of terms.  I've never permitted wildcard, fuzzy or range 
> queries in any system that I've deployed: they're simply too slow.  When 
> I need, e.g., date ranges, I use a Filter instead.  The auto-filter 
> proposal I've made could make this a lot easier.  So I'd like to see 
> that implemented before I worry about optimizing remote range or 
> wildcard queries.
> 
> > If we had a central docFreq table on the dispatcher node, the query
> > processing could be much simpler:
> 
> Perhaps, but that's a big "if".
> 
> >   1.  MultiSearcher rewrites the query and computes the weights, all
> > locally on the central node.
> 
> This could require a substantial change to rewrite implementations. 
> Rewriting is currently passed a full IndexReader: a central docFreq 
> table is not a full IndexReader.  So we could add a search API for term 
> enumeration independent of an IndexReader, then change all rewrite 
> implementations to use this, and hope that none require other aspects of 
> the IndexReader.
> 
> >   2.  The rewritten query with weights is passed to each remote node (=
> > 10 IPC calls in the example case above).
> 
> This still serializes a huge query.  A central docFreq table only 
> provides a constant factor improvement.  The rewritten query only needs 
> to travel one-way rather than round-trip.


One can pass the original query, with only changed query weights
to take into account the global aspects of the idf.
Term expansion would have to be done centrally to determine the idf weight
factors,  and also locally to do the actual searching and scoring without
further idf computations.

Perhaps an easy way to send the term frequencies to the central node is
by sending the field info and term dictionary of each local index segment.
It's not ideal because the FreqDelta, ProxDelta, and SkipDelta are
superfluous, but it would be a relatively easy start.
Even the existing segment merging could be partially reused for central 
summing of the document frequencies.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to