[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363886 ]
Florent Gluck commented on NUTCH-136: ------------------------------------- On my setup of 5 boxes (4 slaves, 1 master), I confirm that what Dominik Friedrich suggested fixes the missing urls I've been encountering for a while. I simply moved the following properties from nutch-site.xml to mapred-default.xml: <property> <name>mapred.map.tasks</name> <value>100</value> <description>The default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is "local". </description> </property> <property> <name>mapred.reduce.tasks</name> <value>40</value> <description>The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". </description> </property>/ After injecting 100'000 urls and doing a single pass crawl, I grepped the logs on my 4 slaves and confirmed that the sum of all the fetching attemps adds up to exactly 100'000. Therefore, there is no need to modify Generator.java. I also ran some tests with protocol-http and protocol-httpclient and verified that they give similar results. No missing urls in both cases. --Florent > mapreduce segment generator generates 50 % less than excepted urls > -------------------------------------------------------------------- > > Key: NUTCH-136 > URL: http://issues.apache.org/jira/browse/NUTCH-136 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Stefan Groschupf > Priority: Critical > > We notice that segments generated with the map reduce segment generator > contains only 50 % of the expected urls. We had a crawldb with 40 000 urls > and the generate commands only created a 20 000 pages segment. This also > happened with the topN parameter, we everytime got around 50 % of the > expected urls. > I tested the PartitionUrlByHost and it looks like it does its work. However > we fixed the problem by changing two things: > First we set the partition to a normal hashPartitioner. > Second we changed Generator.java line 48: > limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks(); > to: > limit = job.getLong("crawl.topN",Long.MAX_VALUE); > Now it works as expected. > Has anyone a idea what the real source of this problem can be? > In general this is bug has the effect that all map reduce users fetch only 50 > % of it's urls per iteration. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
