No problem for me. I have just run the test crawl on http://lucene.apache.org/nutch as described in new tutorial and a lot of pdf and png files were causing big exceptions and stack traces in log. I thought that people (usually using nutch for the first time) might think that they did something wrong when they see such things in log and as they are not parsed I simply switched it off. I will skip only png files if such convention exists. But I think we should handle unsupported parse types with a bit more friendly message in future.

Regards,
Piotr


Doug Cutting wrote:
[EMAIL PROTECTED] wrote:

Skipping png and pdf files.


I think the undocumented convention has been to, by default, still fetch content types that are not parsed by default, but which may be parsed by simply enabling a plugin. That way folks need to only change one place in order to, e.g., start indexing pdfs. By this convention, the default prohibited formats should thus be those which have no parser plugin implementation.

A problem with this convention is that, by default, folks will fetch pdfs that they won't otherwise use, slowing fetching and consuming disk space. Ideally enabling a plugin would alter the url filters, but that might be tricky to implement well.

Doug


Reply via email to