Re: Time of processing hits.doc()

Mark Miller Sun, 18 Nov 2007 15:17:07 -0800

Correction: that issue to watch out for is in regards to the TopDocsHitCollector. If you where to go with your own HitCollector rather thanTopDocs you might not necessarily have this problem (or at the least youcan code around it).


Mark Miller wrote:

Hey Haroldo.
First thing you need to do is *stop* using Hits in your searches. Hitsis optimized for some pretty specific use cases and you will get alongmuch better by using a HitCollector.
Hits has three main functions:
It caches documents, normalizes scores, and stores ids associated withscores (a HitDoc). If you attempt to retrieve a HitDoc past the first100 from Hits, a new search will be issued to grab double the requiredHitDocs needed to satisfy your HitDoc retrieval attempt. This will berepeated everytime you ask for a HitDoc beyond the current cache(which began at 100). This means that if you need to get a HitDocbeyond 100, Hits is not a great choice for you. You will want to usethe HitCollector instead...but remember that you are losing thenormalized scores (simple to copy code if you still want it) and thedocument caching (I rarely want that anyway).
An issue to watch out for: with Hits, you do not have to ask for howmany docs to get back, but with a HitCollector solution you will needto. This is a minor dilema if you want to go over all of the hits nomatter what. You can pass a huge number to ensure you get everything,but you will be creating large data structures if you do this, asstructure sizes may be initialized by the number you pass. Also,passing the maximum integer will cause an error (negative init size)as Lucene initializes a data structure to hold the hits as n+1.
- Mark

Haroldo Nascimento wrote:
I have a problem of performance when I need group the result do search

I have the code below:

   for (int i = 0; i < hits.length(); i++) {
                    doc = hits.doc(i);

                    obj1 = doc.get(Constants.STATE_DESC_FIELD_LABEL);
                    obj2 = doc.get(xxx);
                    ...
   }

  I work with volume of data very big. The search process in 0.300
seconds but when the object hits have much results, the time for get
all objects is very big. The command hits.doc(i) is processed in 2
second.

  Por exemplo. For hits.length() equals the 25.000 results, the time
of "pos search" is 7 seconds.

  I get all result because I need group the result (remove the
duplicate results).

  Is there any form in Lucene that group the result. I need of
anything as the command "group by" of sql.

  Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Time of processing hits.doc()

Reply via email to