Hi all,

I was wondering if you could help me understand the behaviour of the Nutch 
crawler a little better.

Lets say I start 2 separate Nutch crawlers on the exact same day at the exact 
same time and I carry out a whole web crawl of 125 hosts/seed urls to a depth 
of 5, where the db.max.outlinks.per.page is set to the default value of 100 on 
both machines - Will the output be the same/similar?

Aside from the fact that some pages may not be accessible due to HTTP errors, 
will anything built into Nutch affect the output? For example the 
db.max.outlinks.per.page property, will that effect anything? As far as I'm 
aware this property means that at most, 100 outlinks will be processed per page 
regardless of how many outlinks were originally extracted from the page.  As 
long as these outlinks are processed in the same order the output will be the 
same but are these outlinks processed in a random fashion? What about the fact 
the Nutch randomizes its fetchlists - would that cause issues between crawls?

Thanks,

Karen

Reply via email to