Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FAQ" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FAQ?action=diff&rev1=135&rev2=136 Comment: a Urls which are already in the database, won't be injected. === Fetching === + + ==== Can I parse during the fetching process? ==== + In short yes, however this is disabled by default (justification follows shortly). To enable this simply configure the following in nutch-site.xml before initiating the fecth process. + {{{ + <property> + <name>fetcher.parse</name> + <value>true</value> + <description>If true, fetcher will parse content. Default is false, which means + that a separate parsing step is required after fetching is finished.</description> + </property> + }}} + + '''N.B.''' In a parsing fetcher, outlinks are processed in the mapper (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to [[http://www.mail-archive.com/[email protected]/msg05031.html|this]] is usually observed in this situation. + + In summary, if it is possible, users are advised '''not''' to use a parsing fetcher as it is heavy on IO and often leads to the above outcome. + ==== Is it possible to fetch only pages from some specific domains? ==== Please have a look on PrefixURLFilter. Adding some regular expressions to the regex-urlfilter.txt file might work, but adding a list with thousands of regular expressions would slow down your system excessively.

