RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Tue, 11 Jan 2005 16:01:05 -0800

Doug Cutting wrote:
  > I'm not sure exactly what you mean by "distribute the idf
information
  > out to the RemoteSearchable".  I think one might profitably
implement a
  > docFreq() cache in RemoteSearchable.  This could be a simple cache,
or
  > it could be fairly agressive, pre-fetching all the docFreqs.  (As an
  > optimization, it could only pre-fetch those greater than 1, and,
when a
  > term is not in the cache, assume its docFreq is 1.  As a lossy
  > optimization, it could only pre-fetch those greater than N, and
somehow
  > estimate those not in the cache.)  Is that what you meant?


I was thinking of the aggressive version with an index-time solution,
although I don't know the Lucene architecture for distributed indexing
and searching well enough to formulate the idea precisely.
Conceptually, I'd like each server that owns a slice of the index in a
distributed environment to have the complete docFreq data, i.e. to have
docFreq's that represent the collection as a whole, not just its index
slice.  If this was achieved at index-time, then the current
implementation would work at query time.  I.e., MultiSearch could send
the queries out to the remote Searcher's and these Searcher's could
consult their local indexes for the correct docFreq's to use.

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, January 11, 2005 3:46 PM
  > To: Lucene Developers List
  > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
  > Similarity.docFreq() ?
  > 
  > Chuck Williams wrote:
  > > This is a nice solution!  By having MultiSearcher create the
Weight,
  > it
  > > can pass itself in as the searcher, thereby allowing the correct
  > > docFreq() method to be called.
  > 
  > Glad to hear it at least makes sense... Now I hope it works!
  > 
  > > I'm still left wondering if having MultiSearcher query all the
  > > RemoteSearchable's on every call to docFreq() within each
TermQuery,
  > > PhraseQuery, SpanQuery and PhrasePrefixQuery is the way to go long
  > term,
  > > although it seems like the best thing to do right now.  The calls
only
  > > happen when the Weight's are created, so maybe it's not too bad.
  > Longer
  > > term, it might be better to distribute the idf information out to
the
  > > RemoteSearchable's to minimize the required number of remote
accesses
  > > for each Query.
  > 
  > I'm not sure exactly what you mean by "distribute the idf
information
  > out to the RemoteSearchable".  I think one might profitably
implement a
  > docFreq() cache in RemoteSearchable.  This could be a simple cache,
or
  > it could be fairly agressive, pre-fetching all the docFreqs.  (As an
  > optimization, it could only pre-fetch those greater than 1, and,
when a
  > term is not in the cache, assume its docFreq is 1.  As a lossy
  > optimization, it could only pre-fetch those greater than N, and
somehow
  > estimate those not in the cache.)  Is that what you meant?
  > 
  > Doug
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to