Emmanuel wrote:
My objectives are the following:
- sort my results by score
- limit the nb of dup
I'm assuming here that you mean "results from the same site" and not the
exact duplicates? The de-duplication is an offline process in Nutch, and
is handled by DeleteDuplicates tool. On the other hand, site-level
deduplication is handled on the fly in NutchBean.
- build a paging management within my html page according to the
number of results
So i looked at the code in NuchBean class and more precisely to the method
named:
public Hits search(Query query, int numHits, int maxHitsPerDup, String
dedupField, String sortField, boolean reverse)
I tried to understand the code but it leads to the following questions:
1. We initiate a search to get a list of hits. We go through this list
using a loop, however we regenerate a list to prohibited some term when its
needed. I don't really understand why do we do it within this loop. As the
hits.getTotal() results might be different and the loop could miss
some items. is there any reason ?
The purpose of this code is to eliminate excessive numbers of hits
coming from the same site - otherwise you would get pages and pages of
hits from the same site ... The way we do this is to first search for
raw results. Then we go through this list, retrieving hits.getLength()
results at a time, and collect only at most maxDupPerPage results from
the same site.
When we reach hits.getLength() we know that we have to search again
anyway, so it makes sense to add the sites that we already processed to
the query as prohibited clauses. Since we are adding prohibited terms to
the query, the total can only be smaller than the original value.
Then we go again through the results, and skip those that we already
collected. This process repeats as many times as needed to collect
enough results to fill the current page.
2. We optimize the request by adding optQuery.addProhibitedTerm with
the dedupValue found. However it looks like we are limited to 20 dedupValue.
Why do we have this limitation ? What happen if we have more dedupValue to
exclude ?
This is a performance-related optimization. Beyond that the raw Lucene
query becomes so long that it would take a lot of time to process. Also,
Lucene limits the number of clauses in a BooleanQuery to 32, very much
for the same reasons.
3. numHitsRaw = (int)(numHitsRaw * rawHitsFactor); is defined within
the loop to make a search following a new prohibited term. It means that we
can end up by doing a search of (numHitsRaw exponent 20). Its a huge request
which cause an outofmemory on my webserver. Is it normal ? why do we need to
have a such amount of results ?
This is not exponential, it's multiplicative. Why do you use
rawHitsFactor == 20? The default value is 2.0.
4. getTotalHits is obviously not the exact number of results that we
have. This number corresponds to the number of results without our filter on
the dedup value. However how could i know how many results are available
(based on the filter on dedup value) to build a paging managment on my html
page ? We maybe need to add another variable within the object HITS to get
the the total of results filtered. isn't it ?
Well, try the following experiment on Google: search for "blurfl", which
should yield "about 5,150 results", and then try to retrieve the 60-th page:
http://www.google.com/search?q=blurfl&hl=en&start=600&sa=N&filter=0
You will notice that the result comes from the 57-th page.
I guess my point is that the number of pages is not an exact value,
because for performance reasons you want to avoid counting exact
numbers. So I think if you provide an approximate value it should be
good enough.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com