Re: how do I get my own TopDocHitCollector?

Antony Bowesman Thu, 10 Jan 2008 12:20:19 -0800

Beard, Brian wrote:

Ok, I've been thinking about this some more. Is the cache mechanism
pulling from the cache if the external id already exists there and then
hitting the searcher if it's not already in the cache (maybe using a
FieldSelector for just retrieving the external id)?

I am warming searchers in background and each search has one or more queryrelated caches. The external Id cache is normally preloaded by simply iteratingterms, e.g.


        String field = fieldName.intern();
        final String[] retArray = new String[reader.maxDoc()];
        TermDocs termDocs = reader.termDocs();
        TermEnum termEnum = reader.terms (new Term (field, ""));
        try
        {
            do
            {
                Term term = termEnum.term();
                if (term == null || term.field() != field)
                    break;
                String termval = term.text();
                termDocs.seek(termEnum);
                while (termDocs.next())
                {
                    retArray[termDocs.doc()] = termval;
                }
            }
            while (termEnum.next());
        }
        finally
        {
            termDocs.close();
            termEnum.close();
        }
        return retArray;

I do allow for a partial cache, in which case, as you suggest, the searcher usesa FieldSelector to get the external Id from the document which then is stored tocache.


Antony


-----Original Message-----

From: Beard, Brian [mailto:[EMAIL PROTECTED]Sent: Thursday, January 10, 2008 10:08 AM

To: java-user@lucene.apache.org
Subject: RE: how do I get my own TopDocHitCollector?

Thanks for the post. So you're using the doc id as the key into the
cache to retrieve the external id. Then what mechanism fetches the
external id's from the searcher and places them in the cache?

-----Original Message-----

From: Antony Bowesman [mailto:[EMAIL PROTECTED]Sent: Wednesday, January 09, 2008 7:19 PM

To: java-user@lucene.apache.org
Subject: Re: how do I get my own TopDocHitCollector?

Beard, Brian wrote:

Question:

The documents that I index have two id's - a unique document id and a
record_id that can link multiple documents together that belong to a
common record.

I'd like to use something like TopDocs to return the first 1024

results

that have unique record_id's, but I will want to skip some of the
returned documents that have the same record_id. We're using the

ParallelMultiSearcher.

I read that I could use a HitCollector and throw an exception to get

it

to stop, but is there a cleaner way?


I'm doing a similar thing.  I have external Ids (equivalent to yout

record_id),which have one or more Lucene Documents associated with them. I wrote acustomHitCollector that uses a Map to hold the so far collected external idsalongwith the collected document.


I had to write my own priority queue to know when an object was dropped

of thebottom of the score sorted queue, but the latest PriorityQueue on thetrunk nowhas insertWithOverflow(), which does the same thing.


Note that ResultDoc extends ScoreDoc, so that the external Id of the

itemdropped off the queue can be used to remove it from my Map.


Code snippet is somewhat as below (I am caching my external Ids, hence

the cacheusage)


    protected Map<OfficeId, ScoreDoc> results;

    public void collect(int doc, float score)
     {
         if (score > 0.0f)
         {
             totalHits++;
             if (pq.size() < numHits || score > minScore)
             {
                 OfficeId id = cache.get(doc);
                 ResultDoc rd = results.get(id);
                 //  No current result for this ID yet found
                 if (rd == null)
                 {
                     rd = new ResultDoc(id, doc, score);
                     ResultDoc added = pq.insert(rd);
                     if (added == null)
                     {
                         //  Nothing dropped of the bottom
                         results.put(id, rd);
                     }
                     else
                     {
                         //  Return value dropped of the bottom
                         results.remove(added.id);
                         results.put(id, rd);
                         remaining++;
                     }
                 }
                 //  Already found this ID, so replace high score if
necessary
                 else
                 {
                     if (score > rd.score)
                     {
                         pq.remove(rd);
                         rd.score = score;
                         pq.insert(rd);
                     }
                 }
                 //  Readjust our minimum score again from the top entry
                 minScore = pq.peek().score;
             }
             else
                 remaining++;
         }
     }

HTH
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how do I get my own TopDocHitCollector?

Reply via email to