Stefan Groschupf wrote:
In case you setup one thread per host, you have maximal as much
connections to one host as you have boxes. In may case that are not
that much.
Anything more than one is not generally considered polite.
Also it is a reproducible bug that the segment is everytime ~half size
of the size you specify or expect based on your crawldb.
See my mail posting.
I cannot reproduce this. I just now ran a crawl with depth=5, topN=100
and mapred.map.tasks=2, starting from a single url. Segments (after the
first two) contain over 80 pages with a total of more than 300 pages
fetched.
Doug