[ 
http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363587 ] 

Mike Smith commented on NUTCH-136:
----------------------------------

I have had the same problem. Florent suggested to use "protocol-http" instead 
of "protocol-httpclient", this fixed the problem on single machine, but I still 
have the same problem  when I have multiple data nodes using NDFS. Commenting 
line 211 didn't help. Here is my results:

Injected URL: 80000
only one machine is datanode: 70000 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250
 
Injected URL: 80000
3 machines are datanode. All machines are partipated in the fetching by looking 
at the task tracker logs on three machines:  20000 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250
 
Injected URL : 5000
3 machines are datanode. All machines are partipated in the fetching by looking 
at the task tracker logs on three machines:  1200 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250

 
Injected URL : 1000
3 machines are datanode. All machines are partipated in the fetching by looking 
at the task tracker logs on three machines:  240 fetched pages
 
Injected URL : 1000
only one machine is datanode: 800 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250

Thanks, Mike

> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator 
> contains only 50 % of the expected urls. We had a crawldb with 40 000 urls 
> and the generate commands only created a 20 000 pages segment. This also 
> happened with the topN parameter, we everytime got around 50 % of the 
> expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However 
> we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected. 
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 
> % of it's urls per iteration.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to