[Nutch Wiki] Update of "NonDefaultIntranetCrawlingOptions" by JasonKull

Apache Wiki Thu, 10 Jan 2008 17:05:52 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by JasonKull:
http://wiki.apache.org/nutch/NonDefaultIntranetCrawlingOptions

New page:
##language:en
== Options for intranet crawling that are not enabled by default ==
Here are some options you might want to add to your conf/nutch-site.xml 
configuration file if you plan on crawling your local network intranet that are 
not enabled by default.

=== Enable additional parser plugins ===
{{{
        <property>
                <name>plugin.includes</name>
                
<value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword|pdf|rss|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
        </property>
}}} 

This will enable the parser plugins for text, html, javascript, pdf, excel, 
powerpoint, word, pdf, rss and zip. There are additional parsers you can enable 
which are listed in conf/parse-plugins.xml. If you have additional document 
type you wish to parse and they are listed in the parse-plugins file, just add 
them to the list.



=== Increase the file size fetch limit ===
{{{
        <property>
                <name>http.content.limit</name> <value>2097152</value>
        </property>
}}} 

This will increase the default file size fetching limit to 2 megabytes. If your 
documents are larger (such as PDFs) then increase the number appropriately.

[Nutch Wiki] Update of "NonDefaultIntranetCrawlingOptions" by JasonKull

Reply via email to