[ 
http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363194 ] 

Dominik Friedrich commented on NUTCH-136:
-----------------------------------------

I took me some hours but I finally solved the mystery. The problem is this line
177      numLists = job.getNumMapTasks();            // a partition per fetch 
task
in combination with this
211    job.setNumReduceTasks(numLists);
and the fact that nutch-site.xml overrides job.xml settings. 

In my case I have on the box with the jobtracker and where I start job 
map.tasks=12 and reduce.tasks=4 defined in the nutch-site.xml. On the other 
three boxes there is no map.tasks or reduce.tasks in the nutch-site.xml. When 
the second job of the generator tool is started the jobtracker creates only 4 
reduce task because reduce.tasks=4 in nutch-site.xml overrides the job.xml on 
this box. But the map task on the other 3 boxes read 12 reduce tasks from the 
job.xml and so they create 12 partitions. When the 4 reduce tasks are started 
they only read the data from partition 0-3 on that 3 boxes so 3*8 partitions 
get lost.

I solved this problem by removing line 211.

> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator 
> contains only 50 % of the expected urls. We had a crawldb with 40 000 urls 
> and the generate commands only created a 20 000 pages segment. This also 
> happened with the topN parameter, we everytime got around 50 % of the 
> expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However 
> we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected. 
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 
> % of it's urls per iteration.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to