[Nutch Wiki] Trivial Update of "NonDefaultIntranetCrawlingOptions" by LewisJohnMcgibbney

Apache Wiki Wed, 27 Jul 2011 14:44:09 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NonDefaultIntranetCrawlingOptions" page has been changed by 
LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NonDefaultIntranetCrawlingOptions?action=diff&rev1=3&rev2=4

  ##language:en
  == Options for intranet crawling that are not enabled by default ==
- Here are some options you might want to add to your conf/nutch-site.xml 
configuration file if you plan on crawling your local network intranet that are 
not enabled by default.
+ Here are some options you might want to add to your conf/nutch-site.xml 
configuration file if you plan on crawling your local network intranet. You 
will notice that some plugins are not enabled by default but accurately reflect 
the type of data present on the typical enterprise intranet.
  
  === Enable additional parser plugins ===
  {{{
-       <property>
+ <property>
-               <name>plugin.includes</name>
+ <name>plugin.includes</name>
-               
<value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword|pdf|rss|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+ 
<value>protocol-http|urlfilter-regex|parse-(html|tika|zip|js|swf|feed)|index-(basic|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
-       </property>
+ </property>
  }}} 
  
- This will enable the parser plugins for text, html, javascript, pdf, excel, 
powerpoint, word, pdf, rss and zip. There are additional parsers you can enable 
which are listed in conf/parse-plugins.xml. If you have additional document 
types you wish to parse and they are listed in the parse-plugins file, just add 
them to the list.
+ This will enable the parser plugins for html, zip, javascript, swf and 
rss/atom feed. Text, pdf, excel, powerpoint, word and various other document 
formats are also parsed by the tika implementation. Additional parsers can be 
specified in conf/parse-plugins.xml. If you have additional document types you 
wish to parse and they are listed in the parse-plugins file, just add them to 
the list. For more information please see

[Nutch Wiki] Trivial Update of "NonDefaultIntranetCrawlingOptions" by LewisJohnMcgibbney

Reply via email to