Doug Cutting wrote: > I'm not sure exactly what you mean by "distribute the idf information > out to the RemoteSearchable". I think one might profitably implement a > docFreq() cache in RemoteSearchable. This could be a simple cache, or > it could be fairly agressive, pre-fetching all the docFreqs. (As an > optimization, it could only pre-fetch those greater than 1, and, when a > term is not in the cache, assume its docFreq is 1. As a lossy > optimization, it could only pre-fetch those greater than N, and somehow > estimate those not in the cache.) Is that what you meant?
I was thinking of the aggressive version with an index-time solution, although I don't know the Lucene architecture for distributed indexing and searching well enough to formulate the idea precisely. Conceptually, I'd like each server that owns a slice of the index in a distributed environment to have the complete docFreq data, i.e. to have docFreq's that represent the collection as a whole, not just its index slice. If this was achieved at index-time, then the current implementation would work at query time. I.e., MultiSearch could send the queries out to the remote Searcher's and these Searcher's could consult their local indexes for the correct docFreq's to use. Chuck > -----Original Message----- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 11, 2005 3:46 PM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Chuck Williams wrote: > > This is a nice solution! By having MultiSearcher create the Weight, > it > > can pass itself in as the searcher, thereby allowing the correct > > docFreq() method to be called. > > Glad to hear it at least makes sense... Now I hope it works! > > > I'm still left wondering if having MultiSearcher query all the > > RemoteSearchable's on every call to docFreq() within each TermQuery, > > PhraseQuery, SpanQuery and PhrasePrefixQuery is the way to go long > term, > > although it seems like the best thing to do right now. The calls only > > happen when the Weight's are created, so maybe it's not too bad. > Longer > > term, it might be better to distribute the idf information out to the > > RemoteSearchable's to minimize the required number of remote accesses > > for each Query. > > I'm not sure exactly what you mean by "distribute the idf information > out to the RemoteSearchable". I think one might profitably implement a > docFreq() cache in RemoteSearchable. This could be a simple cache, or > it could be fairly agressive, pre-fetching all the docFreqs. (As an > optimization, it could only pre-fetch those greater than 1, and, when a > term is not in the cache, assume its docFreq is 1. As a lossy > optimization, it could only pre-fetch those greater than N, and somehow > estimate those not in the cache.) Is that what you meant? > > Doug > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]