[EMAIL PROTECTED] wrote:
Also, on the subject of tuning for speed, I am confused about the relevance of
the "-numFetchers n" flag in the "generate" command. I understand that it
causes that "n" segments to be created, but, when using mapred, does the
"fetch" command then understand that it should allocate one fetcher per
segment?

In 0.8 this determines the number of input directories that will be generated in each segment, and, consequently, the number of map tasks when fetching. Urls are hashed into these so that they are hostwise disjoint.

If so, is the benefit -

- resilience so that failed fetches can be re-started individually

- performance; or

Both of these. Multiple fetchlists can be fetched in parallel, and, if they crash, can be restarted. But if you use too many and don't have very many unique hosts, then each will be performance limited by politeness (if a task has urls from only 2 hosts, and it waits a second between accesses, then it can maximally fetch only 2 pages/second).

PS.  For completeness, the following is my nutch-site.xml.  mapred-site.xml is
an exact copy of it.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<!-- Do not modify this file directly.  Instead, copy entries that you -->
<!-- wish to modify from this file into nutch-site.xml and change them -->
<!-- there.  If nutch-site.xml does not already exist, create it.      -->

This comment is confusing in a nutch-site.xml...

<property>
  <name>mapred.map.tasks</name>
  <value>51</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>5</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

These values should not be placed in nutch-site, since that causes them to override job-specified values. They should instead be in mapred-default,xml, so that jobs can sometimes override them. For example, the generate task manipulates the number of reduce tasks in order to generate the appropriate number of input directories for fetching (as described above).

<property>
  <name>searcher.max.hits</name>
  <value>200</value>
  <description>If positive, search stops after this many hits are
  found.  Setting this to small, positive values (e.g., 1000) can make
  searches much faster.  With a sorted index, the quality of the hits
  suffers little.</description>
</property>

Make sure you're using a sorted indexer if you're using this. Otherwise your results could suffer greatly.

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to