[EMAIL PROTECTED] wrote:
Also, on the subject of tuning for speed, I am confused about the relevance of the "-numFetchers n" flag in the "generate" command. I understand that it causes that "n" segments to be created, but, when using mapred, does the "fetch" command then understand that it should allocate one fetcher per segment?
In 0.8 this determines the number of input directories that will be generated in each segment, and, consequently, the number of map tasks when fetching. Urls are hashed into these so that they are hostwise disjoint.
If so, is the benefit - - resilience so that failed fetches can be re-started individually - performance; or
Both of these. Multiple fetchlists can be fetched in parallel, and, if they crash, can be restarted. But if you use too many and don't have very many unique hosts, then each will be performance limited by politeness (if a task has urls from only 2 hosts, and it waits a second between accesses, then it can maximally fetch only 2 pages/second).
PS. For completeness, the following is my nutch-site.xml. mapred-site.xml is an exact copy of it. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?> <!-- Do not modify this file directly. Instead, copy entries that you --> <!-- wish to modify from this file into nutch-site.xml and change them --> <!-- there. If nutch-site.xml does not already exist, create it. -->
This comment is confusing in a nutch-site.xml...
<property> <name>mapred.map.tasks</name> <value>51</value> <description>The default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is "local". </description> </property> <property> <name>mapred.reduce.tasks</name> <value>5</value> <description>The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". </description> </property>
These values should not be placed in nutch-site, since that causes them to override job-specified values. They should instead be in mapred-default,xml, so that jobs can sometimes override them. For example, the generate task manipulates the number of reduce tasks in order to generate the appropriate number of input directories for fetching (as described above).
<property> <name>searcher.max.hits</name> <value>200</value> <description>If positive, search stops after this many hits are found. Setting this to small, positive values (e.g., 1000) can make searches much faster. With a sorted index, the quality of the hits suffers little.</description> </property>
Make sure you're using a sorted indexer if you're using this. Otherwise your results could suffer greatly.
Doug ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
