It just seems like a lot of IPC activity for each query.  As things
stand now, I think you are proposing this?
  1.  MultiSearcher calls the remote node to rewrite the query,
requiring serialization of the query.
  2.  The remote node returns the rewritten query to the dispatcher
node, which requires serialization of the (potentially much larger)
rewritten query.
  3.  The dispatcher node computes the weights.  This requires a call to
each remote node for each term in the query to compute the docFreq's;
this can be an extremely large number of IPC calls (e.g., 1,000 terms in
a rewritten query times 10 remote nodes = 10,000 IPC calls).
  4.  The weights are serialized (including the serialized Similarity's)
and passed back to remote node.
  5.  The remote nodes execute the queries and pass results back to the
dispatcher node for collation.

Is that right?  This seems pretty expensive to me.

If we had a central docFreq table on the dispatcher node, the query
processing could be much simpler:
  1.  MultiSearcher rewrites the query and computes the weights, all
locally on the central node.
  2.  The rewritten query with weights is passed to each remote node (=
10 IPC calls in the example case above).
  3.  Each remote node processes the rewritten query.  Here, the remote
node could rewrite the query again to eliminate term expansions for
terms it doesn't have as Paul suggests, or it could omit this step (I
believe the only difference in result is scoring, and it's not clear to
me the best way to score this case).
  4.  The results are passed back and collated.

If the aggregate docFreq table was replicated to each remote node, then
only the raw query would need to be passed as the remote nodes could
each do the rewriting and weighting.  However, this would be offset by
the extra complexity to manage the distribution of the aggregate tables,
which is probably not worth it.

The methods required to keep an accurate central docFreq table could be:
  1.  Compute it initially by having the central node obtain and sum the
contributions from each remote node.
  2.  On each incremental index on a remote node, send the central node
a set of deltas for each term whose docFreq was changed by the
incremental index.

I think the question is how frequent and how expensive would those two
steps be in comparison to the difference in the query processing.

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:[EMAIL PROTECTED]
  > Sent: Thursday, January 13, 2005 9:14 AM
  > To: Lucene Developers List
  > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
  > Similarity.docFreq() ?
  > 
  > Chuck Williams wrote:
  > > I think there is another problem here.  It is currently the Weight
  > > implementations that do rewrite(), which requires access to the
index,
  > > not just to the idf's.  E.g., RangeQuery.rewrite() must find the
terms
  > > in the index within the range.  So, the Weight cannot be computed
in
  > the
  > > MultiSearcher, as it does not have direct access to the remote
index.
  > 
  > rewrite() is actually called before the weight is constructed.  In
the
  > remote case, rewrite() is another IPC.  So, when a query is executed
on
  > a MultiSearcher of RemoteSearchables, the following remote calls are
  > made:
  > 
  > 1. RemoteSearchable.rewrite(Query) is called
  > 2. RemoteSearchable.docFreq(Term) is called for each term in the
  > rewritten query while constructing a Weight.
  > 3. RemoteSearchable.search(Weight, ...) is called.
  > 
  > So I don't think this is a problem.
  > 
  > Doug
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to