Ahhh, I didn't understand the part about caching the results in the
central dispatch node.  I thought you were accessing the remote nodes on
every query to sum the docFreq's in each remote index for each query
term.  I was trying to avoid a large number of round-trips to the remote
nodes by allowing them to have the aggregate docFreq's to use when
processing their queries.  It would seem to make sense to build the
aggregate docFreq table in the central dispatch node, and so I agree it
therefore makes more sense to weight the query terms on the central node
rather than doing it separately on each remote node.

There needs to be a way to create the aggregate docFreq table and keep
it current under incremental changes to the indices on the various
remote nodes.  One approach might be to always maintain the complete
aggregate docFreq table on the central dispatch node and have any remote
node that performs an indexing operation issue a delta-docFreq table to
the central dispatch node (i.e. a table of the changes in its docFreq
values).  If you build the cache incrementally on the central dispatch
node (i.e., on demand as terms are used in Query's), this process would
seem to be more difficult, unless the central dispatch node keeps a
separate cache for each remote node.  It could invalidate a remote
node's entire cache after an index operation, but this would lead to
slow subsequent query processing (reacquiring all the docFreq values)
and therefore could lead to poor performance in, for example, a
"realtime" indexing environment.

So, it seems to me that keeping a complete aggregate docFreq table on
the central dispatch node that is updated after after remote index would
be a good way to go.  This table shouldn't be that much larger than any
single remote node docFreq table assuming the terms are substantially
the same in each index (although perhaps this isn't true, especially if
highly infrequent terms dominate the tables as is probably the case?  I
think you suggested something about dropping such infrequent terms from
the aggregate table to address this issue and assuming a docFreq of 1).

Is there a better way, or perhaps I'm missing something?

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:[EMAIL PROTECTED]
  > Sent: Wednesday, January 12, 2005 8:58 AM
  > To: Lucene Developers List
  > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
  > Similarity.docFreq() ?
  > 
  > Chuck Williams wrote:
  > > I was thinking of the aggressive version with an index-time
solution,
  > > although I don't know the Lucene architecture for distributed
indexing
  > > and searching well enough to formulate the idea precisely.
  > > Conceptually, I'd like each server that owns a slice of the index
in a
  > > distributed environment to have the complete docFreq data, i.e. to
  > have
  > > docFreq's that represent the collection as a whole, not just its
index
  > > slice.  If this was achieved at index-time, then the current
  > > implementation would work at query time.  I.e., MultiSearch could
send
  > > the queries out to the remote Searcher's and these Searcher's
could
  > > consult their local indexes for the correct docFreq's to use.
  > 
  > This is different than what I described.  I described keeping a
docFreq
  > cache at the central dispatch node, while you describe replicating
that
  > cache on every search node.  I don't see the advantage in this
  > replication.  It is both more efficient to maintain a single cache,
and
  > faster to search, since fewer dictionary lookups are involved.
  > 
  > Doug
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to