Re: [Nutch-general] [Q] Crawling non-html contents ONLY

Doug Cutting Mon, 25 Oct 2004 10:42:57 -0700

Fooz wrote:

I'd like to crawl (and index) NON-html/htm contents ONLY. However, if I put "html" and "htm" in the skip list in conf/crawl-urlfilter.txt, I won't get anything since the crawler doesn't have the link information to start with. Basically my task is to crawl and search documents (pdf/doc/ps/etc.) but NOT the webpages.

Anyone can give me a hand? Any input is appreciated!


You can use the plugin.excludes config parameter:

Put something like the following in your nutch-site.xml:

<property>
  <name>plugin.excludes</name>
  <value>parse-html</value>
</property>

Doug


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] [Q] Crawling non-html contents ONLY

Reply via email to