[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Mike Smith (JIRA) Sun, 22 Jan 2006 11:39:54 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363587 ]


Mike Smith commented on NUTCH-136:
----------------------------------

I have had the same problem. Florent suggested to use "protocol-http" instead 
of "protocol-httpclient", this fixed the problem on single machine, but I still 
have the same problem  when I have multiple data nodes using NDFS. Commenting 
line 211 didn't help. Here is my results:

Injected URL: 80000
only one machine is datanode: 70000 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250
 
Injected URL: 80000
3 machines are datanode. All machines are partipated in the fetching by looking 
at the task tracker logs on three machines:  20000 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250
 
Injected URL : 5000
3 machines are datanode. All machines are partipated in the fetching by looking 
at the task tracker logs on three machines:  1200 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250

 
Injected URL : 1000
3 machines are datanode. All machines are partipated in the fetching by looking 
at the task tracker logs on three machines:  240 fetched pages
 
Injected URL : 1000
only one machine is datanode: 800 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250

Thanks, Mike

> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator 
> contains only 50 % of the expected urls. We had a crawldb with 40 000 urls 
> and the generate commands only created a 20 000 pages segment. This also 
> happened with the topN parameter, we everytime got around 50 % of the 
> expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However 
> we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected. 
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 
> % of it's urls per iteration.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Reply via email to