Stefan Groschupf wrote:
In case you setup one thread per host, you have maximal as much
connections to one host as you have boxes. In may case that are not
that much.
Anything more than one is not generally considered polite.
Also it is a reproducible bug that the segment is everytime
When doing a one-pass crawl, I noticed that when I inject more than
~16000 urls, the fetcher only fetches a subset of the set initially
injected.
I use 1 master and 3 slaves with the following properties:
mapred.map.tasks = 30
mapred.reduce.tasks = 6
generate.max.per.host = -1
I tried to inject
- job.setPartitionerClass(PartitionUrlByHost.class); in the generate
method
yes, this line is the one you need to change. The other stuff can be
as it is for now.
Do I only need to change the last line to using HashPartitioner.class,
or do I need to modify the other 2 references as well?
AWESOME !! =:)
Stefan Groschupf wrote:
´So, with your patch, did you see 100% of urls *attempting* a fetch ?
100% ! :-)
Stefan Groschupf wrote:
- job.setPartitionerClass(PartitionUrlByHost.class); in the generate
method
yes, this line is the one you need to change. The other stuff can be as
it is for now.
I don't recommend this change. It makes your crawler impolite, since
multiple tasks may reference