Hi Andrzej,

1. Even with a pretty broad area of interest, you wind up focusing on a subset of all domains. Which then means that the max threads per host limit (for polite crawling) starts killing your efficiency.

The "policies" approach that I described is able to follow and distribute the scores along any links, not necessarily within the domain, so I think we could avoid this.

The issue here is that we score pages statically - so every page doesn't start with a score of 1.0f, but rather a score from 0.0...1.0f.

Then we then divide this score among (valid) outlinks, and sum these into the OPIC scores for the referenced pages.

When we use these scores to rank URLs to fetch, and we constrain (via topN) the number of URLs fetched each round (to help focus the search) we wind up with what looks like an exponential curve for sites/domain - the top few domains wind up with most of the URLs.

We modified Nutch 0.7 to restrict (as a percentage of total URLs being fetched) the number of URLs coming from any one domain. I see that 0.8 has something similar, but it's a max-URLs-per-domain, whereas a percentage makes more sense I think.

We also had to add in "kill the fetch" support. We monitor the fetch threads, and when the ratio of active threads (fetching) to unactive (blocked) threads drops below a threshold we terminate the fetch. This then compensates for issues where a popular site is also a low-performing site.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to