[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney

Apache Wiki Wed, 06 Feb 2013 18:50:25 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=135&rev2=136

Comment:
a

  Urls which are already in the database, won't be injected.
  
  === Fetching ===
+ 
+ ==== Can I parse during the fetching process? ====
+ In short yes, however this is disabled by default (justification follows 
shortly). To enable this simply configure the following in nutch-site.xml 
before initiating the fecth process.
+ {{{
+ <property>
+   <name>fetcher.parse</name>
+   <value>true</value>
+   <description>If true, fetcher will parse content. Default is false, which 
means
+   that a separate parsing step is required after fetching is 
finished.</description>
+ </property>
+ }}} 
+ 
+ '''N.B.''' In a parsing fetcher, outlinks are processed in the mapper (at 
least when outlinks are followed). If a fetcher's reducer stalls you may run 
out of memory or disk space, usually after a very long reduce job. Behaviour 
typical to 
[[http://www.mail-archive.com/[email protected]/msg05031.html|this]] is 
usually observed in this situation. 
+ 
+ In summary, if it is possible, users are advised '''not''' to use a parsing 
fetcher as it is heavy on IO and often leads to the above outcome.
+  
  ==== Is it possible to fetch only pages from some specific domains? ====
  Please have a look on PrefixURLFilter. Adding some regular expressions to the 
regex-urlfilter.txt file might work, but adding a list with thousands of 
regular expressions would slow down your system excessively.

[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney

Reply via email to