Re: Fwd: Query Searc

Andrzej Bialecki Mon, 17 Mar 2008 06:42:25 -0700

Emmanuel wrote:

My objectives are the following:


   - sort my results by score
   - limit the nb of dup

I'm assuming here that you mean "results from the same site" and not theexact duplicates? The de-duplication is an offline process in Nutch, andis handled by DeleteDuplicates tool. On the other hand, site-leveldeduplication is handled on the fly in NutchBean.

   - build a paging management within my html page according to the
   number of results

So i looked at the code in NuchBean class and more precisely to the method
named:

public Hits search(Query query, int numHits, int maxHitsPerDup, String
dedupField, String sortField, boolean reverse)

I tried to understand the code but it leads to the following questions:

   1. We initiate a search to get a list of hits. We go through this list
   using a loop, however we regenerate a list to prohibited some term when its
   needed. I don't really understand why do we do it within this loop. As the
   hits.getTotal() results might be different and the loop could miss
   some items. is there any reason ?

The purpose of this code is to eliminate excessive numbers of hitscoming from the same site - otherwise you would get pages and pages ofhits from the same site ... The way we do this is to first search forraw results. Then we go through this list, retrieving hits.getLength()results at a time, and collect only at most maxDupPerPage results fromthe same site.

When we reach hits.getLength() we know that we have to search againanyway, so it makes sense to add the sites that we already processed tothe query as prohibited clauses. Since we are adding prohibited terms tothe query, the total can only be smaller than the original value.

Then we go again through the results, and skip those that we alreadycollected. This process repeats as many times as needed to collectenough results to fill the current page.

   2. We optimize the request by adding optQuery.addProhibitedTerm with
   the dedupValue found. However it looks like we are limited to 20 dedupValue.
   Why do we have this limitation ? What happen if we have more dedupValue to
   exclude ?

This is a performance-related optimization. Beyond that the raw Lucenequery becomes so long that it would take a lot of time to process. Also,Lucene limits the number of clauses in a BooleanQuery to 32, very muchfor the same reasons.

   3. numHitsRaw = (int)(numHitsRaw * rawHitsFactor); is defined within
   the loop to make a search following a new prohibited term. It means that we
   can end up by doing a search of (numHitsRaw exponent 20). Its a huge request
   which cause an outofmemory on my webserver. Is it normal ? why do we need to
   have a such amount of results ?

This is not exponential, it's multiplicative. Why do you userawHitsFactor == 20? The default value is 2.0.

   4. getTotalHits is obviously not the exact number of results that we
   have. This number corresponds to the number of results without our filter on
   the dedup value. However how could i know how many results are available
   (based on the filter on dedup value) to build a paging managment on my html
   page ? We maybe need to add another variable within the object HITS to get
   the the total of results filtered. isn't it ?

Well, try the following experiment on Google: search for "blurfl", whichshould yield "about 5,150 results", and then try to retrieve the 60-th page:


http://www.google.com/search?q=blurfl&hl=en&start=600&sa=N&filter=0

You will notice that the result comes from the 57-th page.

I guess my point is that the number of pages is not an exact value,because for performance reasons you want to avoid counting exactnumbers. So I think if you provide an approximate value it should begood enough.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Fwd: Query Searc

Reply via email to