[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Florent Gluck (JIRA) Tue, 24 Jan 2006 13:41:33 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363886 ]


Florent Gluck commented on NUTCH-136:
-------------------------------------

On my setup of 5 boxes (4 slaves, 1 master), I confirm that what Dominik 
Friedrich suggested fixes the missing urls I've been encountering for a while.
I simply moved the following properties from nutch-site.xml to 
mapred-default.xml:

<property>
  <name>mapred.map.tasks</name>
  <value>100</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>40</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>/

After injecting 100'000 urls and doing a single pass crawl, I grepped the logs 
on my 4 slaves and confirmed that the sum of all the fetching attemps adds up 
to exactly 100'000.  Therefore, there is no need to modify Generator.java.
I also ran some tests with protocol-http and protocol-httpclient and verified 
that they give similar results.  No missing urls in both cases.

--Florent


> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator 
> contains only 50 % of the expected urls. We had a crawldb with 40 000 urls 
> and the generate commands only created a 20 000 pages segment. This also 
> happened with the topN parameter, we everytime got around 50 % of the 
> expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However 
> we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected. 
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 
> % of it's urls per iteration.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Reply via email to