No problem for me. I have just run the test crawl on
http://lucene.apache.org/nutch as described in new tutorial and a lot
of pdf and png files were causing big exceptions and stack traces in
log. I thought that people (usually using nutch for the first time)
might think that they did something wrong when they see such things in
log and as they are not parsed I simply switched it off. I will skip
only png files if such convention exists.
But I think we should handle unsupported parse types with a bit more
friendly message in future.
Regards,
Piotr
Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
Skipping png and pdf files.
I think the undocumented convention has been to, by default, still fetch
content types that are not parsed by default, but which may be parsed by
simply enabling a plugin. That way folks need to only change one place
in order to, e.g., start indexing pdfs. By this convention, the default
prohibited formats should thus be those which have no parser plugin
implementation.
A problem with this convention is that, by default, folks will fetch
pdfs that they won't otherwise use, slowing fetching and consuming disk
space. Ideally enabling a plugin would alter the url filters, but that
might be tricky to implement well.
Doug