Hi Andrzej, Could you please help on this subject ?
Thanks ---------- Forwarded message ---------- From: Emmanuel <[EMAIL PROTECTED]> Date: 17 févr. 2008 07:04 Subject: Query Searc To: nutch-user <[email protected]> Hi Guys, I've been trying to understand the way we are getting the search results based on all parameters inputed. My objectives are the following: - sort my results by score - limit the nb of dup - build a paging management within my html page according to the number of results So i looked at the code in NuchBean class and more precisely to the method named: public Hits search(Query query, int numHits, int maxHitsPerDup, String dedupField, String sortField, boolean reverse) I tried to understand the code but it leads to the following questions: 1. We initiate a search to get a list of hits. We go through this list using a loop, however we regenerate a list to prohibited some term when its needed. I don't really understand why do we do it within this loop. As the hits.getTotal() results might be different and the loop could miss some items. is there any reason ? 2. We optimize the request by adding optQuery.addProhibitedTerm with the dedupValue found. However it looks like we are limited to 20 dedupValue. Why do we have this limitation ? What happen if we have more dedupValue to exclude ? 3. numHitsRaw = (int)(numHitsRaw * rawHitsFactor); is defined within the loop to make a search following a new prohibited term. It means that we can end up by doing a search of (numHitsRaw exponent 20). Its a huge request which cause an outofmemory on my webserver. Is it normal ? why do we need to have a such amount of results ? 4. getTotalHits is obviously not the exact number of results that we have. This number corresponds to the number of results without our filter on the dedup value. However how could i know how many results are available (based on the filter on dedup value) to build a paging managment on my html page ? We maybe need to add another variable within the object HITS to get the the total of results filtered. isn't it ? Thanks in advance for any clarification
