Re: mapreduce fetcher doesn't fetch all urls

2005-12-15 Thread Doug Cutting
Stefan Groschupf wrote: In case you setup one thread per host, you have maximal as much connections to one host as you have boxes. In may case that are not that much. Anything more than one is not generally considered polite. Also it is a reproducible bug that the segment is everytime

mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Florent Gluck
When doing a one-pass crawl, I noticed that when I inject more than ~16000 urls, the fetcher only fetches a subset of the set initially injected. I use 1 master and 3 slaves with the following properties: mapred.map.tasks = 30 mapred.reduce.tasks = 6 generate.max.per.host = -1 I tried to inject

Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Stefan Groschupf
- job.setPartitionerClass(PartitionUrlByHost.class); in the generate method yes, this line is the one you need to change. The other stuff can be as it is for now. Do I only need to change the last line to using HashPartitioner.class, or do I need to modify the other 2 references as well?

Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Florent Gluck
AWESOME !! =:) Stefan Groschupf wrote: ´So, with your patch, did you see 100% of urls *attempting* a fetch ? 100% ! :-)

Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Doug Cutting
Stefan Groschupf wrote: - job.setPartitionerClass(PartitionUrlByHost.class); in the generate method yes, this line is the one you need to change. The other stuff can be as it is for now. I don't recommend this change. It makes your crawler impolite, since multiple tasks may reference