crawl-urlfilter.txt.template

Piotr Kosiorowski Mon, 08 Aug 2005 13:11:31 -0700

No problem for me. I have just run the test crawl onhttp://lucene.apache.org/nutch as described in new tutorial and a lotof pdf and png files were causing big exceptions and stack traces inlog. I thought that people (usually using nutch for the first time)might think that they did something wrong when they see such things inlog and as they are not parsed I simply switched it off. I will skiponly png files if such convention exists.But I think we should handle unsupported parse types with a bit morefriendly message in future.


Regards,
Piotr



Doug Cutting wrote:

[EMAIL PROTECTED] wrote:
Skipping png and pdf files.
I think the undocumented convention has been to, by default, still fetchcontent types that are not parsed by default, but which may be parsed bysimply enabling a plugin. That way folks need to only change one placein order to, e.g., start indexing pdfs. By this convention, the defaultprohibited formats should thus be those which have no parser pluginimplementation.
A problem with this convention is that, by default, folks will fetchpdfs that they won't otherwise use, slowing fetching and consuming diskspace. Ideally enabling a plugin would alter the url filters, but thatmight be tricky to implement well.
Doug

Re: svn commit: r230867 - /lucene/nutch/trunk/conf/crawl-urlfilter.txt.template

Reply via email to