Hello, I'm crawling sites that have mime types that I don't want to fetch, although the URLs themselves don't have any distinguishing pattern, so I can't use the regex URL filter to skip these URLs. As far as I know, there is presently no way to filter fetched content by mime type.
E.g. How can I avoid fetching these URLs Error parsing: http://www2.tellus.no/tellus/db.dll?pi_35_2107_1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www2.tellus.no/tellus/db.dll?pi_35_2107_1 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) The same applies also where video content is available without an extension. It would be nice to avoid these errors and the unnecessary download. (Using the http.content.limit is not really an option for me as there are files I want to fetch with large file sizes e.g. pdfs.) Is there a way to do this? If no solution is presently available, I'm thinking about adding to the fetch process so that it filters fetch requests on mime type, or other metadata (length, chraset, language, etc..) as well as filtering based on the content data itself. I can offer 3 solutions (though I'm new to nutch internals, so may be off the mark.) 1. make a new plugin that provides metadata and content filtering. This is called up in the protocol plugins, which forwards details about the content as they become available. The content filter inspects the content details and determines if the content should not be fetched or not, or alters fetch characteristics, such as fetched length. 2. A short-ciruit form of the above, where the protocol plugins retrieve the mime type, and then check to see if there is a registered parser for that type. If no parser is found, the content is not fetched. 3. Use an external proxy and configure that with rules about which files to fetch. A proxy would run on each slave machine (so the nutch config would just point to the local proxy.) The first solution is more coding work, but would give users more control over what content is retrieved and what is ignored. As with URL filtering, content filtering plugins would allow a choice of schemes for describing what content is fetched and what is not. The second proposal is quick and dirty and fixes the need here, but it may not evolve well. The third solution is an afterthought, and I'm not sure how appropriate it is. Although perhaps bundling a proxy with nutch may be generally useful and leverage an exsiting solution that works reliably across protocols? I'm a nutch guts newbie, so I hope someone can point me in the right direction for what is the best approach! I'd be glad to implement any changes needed. Thanks! Matthew McGowan Lead Engineer Nynodata AS, http://www.nynodata.no tlf:35 06 15 80