RE: how do I get my own TopDocHitCollector?

Beard, Brian Fri, 11 Jan 2008 12:50:52 -0800

Thanks for all this. We're doing warmup searching also, but just for
some common date searches. The warmup would be a good place to add some
pre-caching capability. I'll plan for this eventually and start with the
partial cache for now.


Thanks,

Brian Beard

-----Original Message-----
From: Antony Bowesman [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 10, 2008 3:19 PM
To: java-user@lucene.apache.org
Subject: Re: how do I get my own TopDocHitCollector?

Beard, Brian wrote:
> Ok, I've been thinking about this some more. Is the cache mechanism
> pulling from the cache if the external id already exists there and
then
> hitting the searcher if it's not already in the cache (maybe using a
> FieldSelector for just retrieving the external id)?

I am warming searchers in background and each search has one or more
query 
related caches.  The external Id cache is normally preloaded by simply
iterating 
terms, e.g.

         String field = fieldName.intern();
         final String[] retArray = new String[reader.maxDoc()];
         TermDocs termDocs = reader.termDocs();
         TermEnum termEnum = reader.terms (new Term (field, ""));
         try
         {
             do
             {
                 Term term = termEnum.term();
                 if (term == null || term.field() != field)
                     break;
                 String termval = term.text();
                 termDocs.seek(termEnum);
                 while (termDocs.next())
                 {
                     retArray[termDocs.doc()] = termval;
                 }
             }
             while (termEnum.next());
         }
         finally
         {
             termDocs.close();
             termEnum.close();
         }
         return retArray;

I do allow for a partial cache, in which case, as you suggest, the
searcher uses 
a FieldSelector to get the external Id from the document which then is
stored to 
cache.

Antony



> 
> -----Original Message-----
> From: Beard, Brian [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, January 10, 2008 10:08 AM
> To: java-user@lucene.apache.org
> Subject: RE: how do I get my own TopDocHitCollector?
> 
> Thanks for the post. So you're using the doc id as the key into the
> cache to retrieve the external id. Then what mechanism fetches the
> external id's from the searcher and places them in the cache?
> 
> 
> -----Original Message-----
> From: Antony Bowesman [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, January 09, 2008 7:19 PM
> To: java-user@lucene.apache.org
> Subject: Re: how do I get my own TopDocHitCollector?
> 
> Beard, Brian wrote:
>> Question:
>>
>> The documents that I index have two id's - a unique document id and a
>> record_id that can link multiple documents together that belong to a
>> common record.
>>
>> I'd like to use something like TopDocs to return the first 1024
> results
>> that have unique record_id's, but I will want to skip some of the
>> returned documents that have the same record_id. We're using the
>> ParallelMultiSearcher. 
>>
>> I read that I could use a HitCollector and throw an exception to get
> it
>> to stop, but is there a cleaner way?
> 
> I'm doing a similar thing.  I have external Ids (equivalent to yout
> record_id), 
> which have one or more Lucene Documents associated with them.  I wrote
a
> custom 
> HitCollector that uses a Map to hold the so far collected external ids
> along 
> with the collected document.
> 
> I had to write my own priority queue to know when an object was
dropped
> of the 
> bottom of the score sorted queue, but the latest PriorityQueue on the
> trunk now 
> has insertWithOverflow(), which does the same thing.
> 
> Note that ResultDoc extends ScoreDoc, so that the external Id of the
> item 
> dropped off the queue can be used to remove it from my Map.
> 
> Code snippet is somewhat as below (I am caching my external Ids, hence
> the cache 
> usage)
> 
>     protected Map<OfficeId, ScoreDoc> results;
> 
>     public void collect(int doc, float score)
>      {
>          if (score > 0.0f)
>          {
>              totalHits++;
>              if (pq.size() < numHits || score > minScore)
>              {
>                  OfficeId id = cache.get(doc);
>                  ResultDoc rd = results.get(id);
>                  //  No current result for this ID yet found
>                  if (rd == null)
>                  {
>                      rd = new ResultDoc(id, doc, score);
>                      ResultDoc added = pq.insert(rd);
>                      if (added == null)
>                      {
>                          //  Nothing dropped of the bottom
>                          results.put(id, rd);
>                      }
>                      else
>                      {
>                          //  Return value dropped of the bottom
>                          results.remove(added.id);
>                          results.put(id, rd);
>                          remaining++;
>                      }
>                  }
>                  //  Already found this ID, so replace high score if
> necessary
>                  else
>                  {
>                      if (score > rd.score)
>                      {
>                          pq.remove(rd);
>                          rd.score = score;
>                          pq.insert(rd);
>                      }
>                  }
>                  //  Readjust our minimum score again from the top
entry
>                  minScore = pq.peek().score;
>              }
>              else
>                  remaining++;
>          }
>      }
> 
> HTH
> Antony
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: how do I get my own TopDocHitCollector?

Reply via email to