Flo,
I had the same problem!
Change change fetcher to hashpartitoner, see the job setup where actually the Url host partioner is used. Than also assign the case insensitive content properties patch to the 0.8. You may need to change 3 other classes (e.g fetcher) since the patch is for 0.7. After that I was able to get at least a 80 -90 % success-rate running a 2 million pages fetch. I actually I only have the problem that the reduce tasks hangs somehow, as discussed in the user list.

Stefan


Am 14.12.2005 um 20:39 schrieb Florent Gluck:

When doing a one-pass crawl, I noticed that when I inject more than
~16000 urls, the fetcher only fetches a subset of the set initially
injected.
I use 1 master and 3 slaves with the following properties:
mapred.map.tasks = 30
mapred.reduce.tasks = 6
generate.max.per.host = -1

I tried to inject different amount of urls to see around what threshold I start to see some missing ones. Here are the results of my tests so far:

#urls
15000 and below: 100% fetched
16000: 15998 fetched (~100%)
25000: 21379 fetched (86%)
50000: 26565 fetched (53%)
100000: 22088 fetched (22%)

After having seen bug NUTCH-136 "mapreduce segment generator generates
50 % less than excepted urls", I thought it may fix my problem. I only
applied the 2nd change mentioned in the description (the change in
Generator.java, line 48) since I didn't know how to set the partition to
use a normal hashPartitioner.  The fix didn't make any difference.

Then I started debugging the generator to see if all the urls were
generated. I confirmed they were all generated (did a check w/ 50k), so
the problem lays further in the pipeline.  I assume it's somewhere in
the fetcher, but I'm not sure where yet. I'm gonna keep investigating.

Has anyone encountered a similar issue ?
I read messages of people crawling million of pages and I wonder why it
seems I'm the only one to have this issue.  I'm apparently unable to
fetch more than ~30k pages even though I inject 1 million urls.

Any help would be greatly appreciated.

Thanks,
--Flo


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply via email to