Ahhh, I didn't understand the part about caching the results in the central dispatch node. I thought you were accessing the remote nodes on every query to sum the docFreq's in each remote index for each query term. I was trying to avoid a large number of round-trips to the remote nodes by allowing them to have the aggregate docFreq's to use when processing their queries. It would seem to make sense to build the aggregate docFreq table in the central dispatch node, and so I agree it therefore makes more sense to weight the query terms on the central node rather than doing it separately on each remote node.
There needs to be a way to create the aggregate docFreq table and keep it current under incremental changes to the indices on the various remote nodes. One approach might be to always maintain the complete aggregate docFreq table on the central dispatch node and have any remote node that performs an indexing operation issue a delta-docFreq table to the central dispatch node (i.e. a table of the changes in its docFreq values). If you build the cache incrementally on the central dispatch node (i.e., on demand as terms are used in Query's), this process would seem to be more difficult, unless the central dispatch node keeps a separate cache for each remote node. It could invalidate a remote node's entire cache after an index operation, but this would lead to slow subsequent query processing (reacquiring all the docFreq values) and therefore could lead to poor performance in, for example, a "realtime" indexing environment. So, it seems to me that keeping a complete aggregate docFreq table on the central dispatch node that is updated after after remote index would be a good way to go. This table shouldn't be that much larger than any single remote node docFreq table assuming the terms are substantially the same in each index (although perhaps this isn't true, especially if highly infrequent terms dominate the tables as is probably the case? I think you suggested something about dropping such infrequent terms from the aggregate table to address this issue and assuming a docFreq of 1). Is there a better way, or perhaps I'm missing something? Chuck > -----Original Message----- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Wednesday, January 12, 2005 8:58 AM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Chuck Williams wrote: > > I was thinking of the aggressive version with an index-time solution, > > although I don't know the Lucene architecture for distributed indexing > > and searching well enough to formulate the idea precisely. > > Conceptually, I'd like each server that owns a slice of the index in a > > distributed environment to have the complete docFreq data, i.e. to > have > > docFreq's that represent the collection as a whole, not just its index > > slice. If this was achieved at index-time, then the current > > implementation would work at query time. I.e., MultiSearch could send > > the queries out to the remote Searcher's and these Searcher's could > > consult their local indexes for the correct docFreq's to use. > > This is different than what I described. I described keeping a docFreq > cache at the central dispatch node, while you describe replicating that > cache on every search node. I don't see the advantage in this > replication. It is both more efficient to maintain a single cache, and > faster to search, since fewer dictionary lookups are involved. > > Doug > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]