[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363587 ]
Mike Smith commented on NUTCH-136: ---------------------------------- I have had the same problem. Florent suggested to use "protocol-http" instead of "protocol-httpclient", this fixed the problem on single machine, but I still have the same problem when I have multiple data nodes using NDFS. Commenting line 211 didn't help. Here is my results: Injected URL: 80000 only one machine is datanode: 70000 fecthed pages map tasks: 3 reduce tasks: 3 threads: 250 Injected URL: 80000 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 20000 fetched pages map tasks: 12 reduce tasks: 6 threads: 250 Injected URL : 5000 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 1200 fetched pages map tasks: 12 reduce tasks: 6 threads: 250 Injected URL : 1000 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 240 fetched pages Injected URL : 1000 only one machine is datanode: 800 fecthed pages map tasks: 3 reduce tasks: 3 threads: 250 Thanks, Mike > mapreduce segment generator generates 50 % less than excepted urls > -------------------------------------------------------------------- > > Key: NUTCH-136 > URL: http://issues.apache.org/jira/browse/NUTCH-136 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Stefan Groschupf > Priority: Critical > > We notice that segments generated with the map reduce segment generator > contains only 50 % of the expected urls. We had a crawldb with 40 000 urls > and the generate commands only created a 20 000 pages segment. This also > happened with the topN parameter, we everytime got around 50 % of the > expected urls. > I tested the PartitionUrlByHost and it looks like it does its work. However > we fixed the problem by changing two things: > First we set the partition to a normal hashPartitioner. > Second we changed Generator.java line 48: > limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks(); > to: > limit = job.getLong("crawl.topN",Long.MAX_VALUE); > Now it works as expected. > Has anyone a idea what the real source of this problem can be? > In general this is bug has the effect that all map reduce users fetch only 50 > % of it's urls per iteration. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
