Fooz wrote:
I'd like to crawl (and index) NON-html/htm contents ONLY. However, if I put "html" and "htm" in the skip list in conf/crawl-urlfilter.txt, I won't get anything since the crawler doesn't have the link information to start with. Basically my task is to crawl and search documents (pdf/doc/ps/etc.) but NOT the webpages.

Anyone can give me a hand? Any input is appreciated!

You can use the plugin.excludes config parameter:

Put something like the following in your nutch-site.xml:

<property>
  <name>plugin.excludes</name>
  <value>parse-html</value>
</property>

Doug




------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to