Re: mapreduce fetcher doesn't fetch all urls

Florent Gluck Wed, 14 Dec 2005 15:28:47 -0800

Stephan,

Thanks for your input, I'm glad to see I'm not the only one :)


> Change change fetcher to hashpartitoner, see the job setup where 
> actually the Url host partioner is used.

There are several references to PartitionUrlByHost in Generator.java:
- "private Partitioner hostPartitioner = new PartitionUrlByHost();" in
the members declaration
- in the "getPartition" method
- "job.setPartitionerClass(PartitionUrlByHost.class);" in the generate
method
Do I only need to change the last line to using HashPartitioner.class,
or do I need to modify the other 2 references as well?

> Than also assign the case insensitive content properties patch to the 
> 0.8. You may need to change 3 other classes (e.g fetcher) since the 
> patch is for 0.7.

I'm not sure I understand what I need to do... Do I need to modify 3
other classes ?
Was 0.7 prone to this bug as well and it's been fixed ?  So I'd need to
port it to 0.8 ?

> After that I was able to get at least a 80 -90 % success-rate running 
> a 2 million pages fetch. I actually I only have the problem that the 
> reduce tasks hangs somehow, as discussed in the user list.

It's much better than what I have right now.  However, it's still not
100% and fetching all the urls would mean implementing some sort of
iterative process until all the urls are finally fetched.
Do you have an idea why we are still missing 10 to 20% ?

Thanks,
--Flo

>
> Stefan
>
>
> Am 14.12.2005 um 20:39 schrieb Florent Gluck:
>
>> When doing a one-pass crawl, I noticed that when I inject more than
>> ~16000 urls, the fetcher only fetches a subset of the set initially
>> injected.
>> I use 1 master and 3 slaves with the following properties:
>> mapred.map.tasks = 30
>> mapred.reduce.tasks = 6
>> generate.max.per.host = -1
>>
>> I tried to inject different amount of urls to see around what  threshold
>> I start to see some missing ones.  Here are the results of my tests 
>> so far:
>>
>> #urls
>> 15000 and below: 100% fetched
>> 16000: 15998 fetched (~100%)
>> 25000: 21379 fetched (86%)
>> 50000: 26565 fetched (53%)
>> 100000: 22088 fetched (22%)
>>
>> After having seen bug NUTCH-136 "mapreduce segment generator generates
>> 50 % less than excepted urls", I thought it may fix my problem.  I  
>> only
>> applied the 2nd change mentioned in the description (the change in
>> Generator.java, line 48) since I didn't know how to set the 
>> partition to
>> use a normal hashPartitioner.  The fix didn't make any difference.
>>
>> Then I started debugging the generator to see if all the urls were
>> generated.  I confirmed they were all generated (did a check w/ 
>> 50k), so
>> the problem lays further in the pipeline.  I assume it's somewhere in
>> the fetcher, but I'm not sure where yet.  I'm gonna keep  investigating.
>>
>> Has anyone encountered a similar issue ?
>> I read messages of people crawling million of pages and I wonder  why it
>> seems I'm the only one to have this issue.  I'm apparently unable to
>> fetch more than ~30k pages even though I inject 1 million urls.
>>
>> Any help would be greatly appreciated.
>>
>> Thanks,
>> --Flo
>>
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>

Re: mapreduce fetcher doesn't fetch all urls

Reply via email to