RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Thu, 13 Jan 2005 11:31:26 -0800

If auto-filters can provide an effective implementation for RangeQuery's
that avoids rewriting, and we can give up MultiTermQuery and PrefixQuery
in the distributed environment, then how about something like this
refinement:
  1.  No rewriting is done.
  2.  The central node maintains a cache of aggregate docFreq data that
is incrementally built on demand, and flushed after any remote node
opens a new Searcher.
  3.  The central node computes the Weights by accessing the docFreq for
each query term.  This looks the value up in the cache, or queries it
from each remote node, sums the results, and caches the result.


This seems simple and avoids a great deal of IPC traffic, especially in
the common case where popular query terms are frequently reused.

I presume the auto-filters get pushed out to each remote node as part of
the query?

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:[EMAIL PROTECTED]
  > Sent: Thursday, January 13, 2005 10:29 AM
  > To: Lucene Developers List
  > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
  > Similarity.docFreq() ?
  > 
  > Chuck Williams wrote:
  > > It just seems like a lot of IPC activity for each query.  As
things
  > > stand now, I think you are proposing this?
  > >   1.  MultiSearcher calls the remote node to rewrite the query,
  > > requiring serialization of the query.
  > >   2.  The remote node returns the rewritten query to the
dispatcher
  > > node, which requires serialization of the (potentially much
larger)
  > > rewritten query.
  > >   3.  The dispatcher node computes the weights.  This requires a
call
  > to
  > > each remote node for each term in the query to compute the
docFreq's;
  > > this can be an extremely large number of IPC calls (e.g., 1,000
terms
  > in
  > > a rewritten query times 10 remote nodes = 10,000 IPC calls).
  > >   4.  The weights are serialized (including the serialized
  > Similarity's)
  > > and passed back to remote node.
  > >   5.  The remote nodes execute the queries and pass results back
to
  > the
  > > dispatcher node for collation.
  > >
  > > Is that right?  This seems pretty expensive to me.
  > 
  > I think that's right.  For simple queries with a couple of terms it
  > should not be too expensive.  For queries that expand into thousands
of
  > terms, yes, it is expensive, but these are slow queries anyway.
It's
  > not clear how much worse this would make them.  Yes, we might
optimize
  > it some, but first let's get things working correctly!
  > 
  > An easy way to "optimize" this is to avoid queries that expand into
  > large numbers of terms.  I've never permitted wildcard, fuzzy or
range
  > queries in any system that I've deployed: they're simply too slow.
When
  > I need, e.g., date ranges, I use a Filter instead.  The auto-filter
  > proposal I've made could make this a lot easier.  So I'd like to see
  > that implemented before I worry about optimizing remote range or
  > wildcard queries.
  > 
  > > If we had a central docFreq table on the dispatcher node, the
query
  > > processing could be much simpler:
  > 
  > Perhaps, but that's a big "if".
  > 
  > >   1.  MultiSearcher rewrites the query and computes the weights,
all
  > > locally on the central node.
  > 
  > This could require a substantial change to rewrite implementations.
  > Rewriting is currently passed a full IndexReader: a central docFreq
  > table is not a full IndexReader.  So we could add a search API for
term
  > enumeration independent of an IndexReader, then change all rewrite
  > implementations to use this, and hope that none require other
aspects of
  > the IndexReader.
  > 
  > >   2.  The rewritten query with weights is passed to each remote
node
  > (=
  > > 10 IPC calls in the example case above).
  > 
  > This still serializes a huge query.  A central docFreq table only
  > provides a constant factor improvement.  The rewritten query only
needs
  > to travel one-way rather than round-trip.
  > 
  > > If the aggregate docFreq table was replicated to each remote node,
  > then
  > > only the raw query would need to be passed as the remote nodes
could
  > > each do the rewriting and weighting.  However, this would be
offset by
  > > the extra complexity to manage the distribution of the aggregate
  > tables,
  > > which is probably not worth it.
  > >
  > > The methods required to keep an accurate central docFreq table
could
  > be:
  > >   1.  Compute it initially by having the central node obtain and
sum
  > the
  > > contributions from each remote node.
  > >   2.  On each incremental index on a remote node, send the central
  > node
  > > a set of deltas for each term whose docFreq was changed by the
  > > incremental index.
  > 
  > This sounds very hairy to me.
  > 
  > The delta approach is problematic.  At present a Searchable
instance,
  > like an IndexReader, does not change the set of documents it
searches.
  > At present, when you want to search an updated collection you
construct
  > a new Searcher.  So this is (again) a substantive change.  It means,
  > e.g., if folks cache things based on the Searcher that these caches
  > might become invalid.
  > 
  > > I think the question is how frequent and how expensive would those
two
  > > steps be in comparison to the difference in the query processing.
  > 
  > I think the first question is: can we get RemoteSearchables to work
  > correctly and reasonably efficiently for simple queries?
  > 
  > Doug
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to