[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363194 ]
Dominik Friedrich commented on NUTCH-136: ----------------------------------------- I took me some hours but I finally solved the mystery. The problem is this line 177 numLists = job.getNumMapTasks(); // a partition per fetch task in combination with this 211 job.setNumReduceTasks(numLists); and the fact that nutch-site.xml overrides job.xml settings. In my case I have on the box with the jobtracker and where I start job map.tasks=12 and reduce.tasks=4 defined in the nutch-site.xml. On the other three boxes there is no map.tasks or reduce.tasks in the nutch-site.xml. When the second job of the generator tool is started the jobtracker creates only 4 reduce task because reduce.tasks=4 in nutch-site.xml overrides the job.xml on this box. But the map task on the other 3 boxes read 12 reduce tasks from the job.xml and so they create 12 partitions. When the 4 reduce tasks are started they only read the data from partition 0-3 on that 3 boxes so 3*8 partitions get lost. I solved this problem by removing line 211. > mapreduce segment generator generates 50 % less than excepted urls > -------------------------------------------------------------------- > > Key: NUTCH-136 > URL: http://issues.apache.org/jira/browse/NUTCH-136 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Stefan Groschupf > Priority: Critical > > We notice that segments generated with the map reduce segment generator > contains only 50 % of the expected urls. We had a crawldb with 40 000 urls > and the generate commands only created a 20 000 pages segment. This also > happened with the topN parameter, we everytime got around 50 % of the > expected urls. > I tested the PartitionUrlByHost and it looks like it does its work. However > we fixed the problem by changing two things: > First we set the partition to a normal hashPartitioner. > Second we changed Generator.java line 48: > limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks(); > to: > limit = job.getLong("crawl.topN",Long.MAX_VALUE); > Now it works as expected. > Has anyone a idea what the real source of this problem can be? > In general this is bug has the effect that all map reduce users fetch only 50 > % of it's urls per iteration. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
