Hello,

I'm crawling sites that have mime types that I don't want to fetch, although
the URLs themselves don't have any distinguishing pattern, so I can't use
the regex URL filter to skip these URLs. As far as I know, there is
presently no way to filter fetched content by mime type.

E.g. How can I avoid fetching these URLs

Error parsing: http://www2.tellus.no/tellus/db.dll?pi_35_2107_1:
org.apache.nutch.parse.ParseException: parser not found for
contentType=image/jpeg url=http://www2.tellus.no/tellus/db.dll?pi_35_2107_1
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

The same applies also where video content is available without an extension.
It would be nice to avoid these errors and the unnecessary download. (Using
the http.content.limit is not really an option for me as there are files I
want to fetch with large file sizes e.g. pdfs.) Is there a way to do this?

If no solution is presently available, I'm thinking about adding to the
fetch process so that it filters fetch requests on mime type, or other
metadata (length, chraset, language, etc..) as well as filtering based on
the content data itself.

I can offer 3 solutions (though I'm new to nutch internals, so may be off
the mark.)

1. make a new plugin that provides metadata and content filtering. This is
called up in the protocol plugins, which forwards details about the content
as they become available. The content filter inspects the content details
and determines if the content should not be fetched or not, or alters fetch
characteristics, such as fetched length.

2. A short-ciruit form of the above, where the protocol plugins retrieve the
mime type, and then check to see if there is a registered parser for that
type. If no parser is found, the content is not fetched. 

3. Use an external proxy and configure that with rules about which files to
fetch. A proxy would run on each slave machine (so the nutch config would
just point to the local proxy.)

The first solution is more coding work, but would give users more control
over what content is retrieved and what is ignored. As with URL filtering,
content filtering plugins would allow a choice of schemes for describing
what content is fetched and what is not.

The second proposal is quick and dirty and fixes the need here, but it may
not evolve well. 

The third solution is  an afterthought, and I'm not sure how appropriate it
is. Although perhaps bundling a proxy with nutch may be generally useful and
leverage an exsiting solution that works reliably across protocols?

I'm a nutch guts newbie, so I hope someone can point me in the right
direction for what is the best approach! I'd be glad to implement any
changes needed.

Thanks!



Matthew McGowan
Lead Engineer
Nynodata AS, http://www.nynodata.no
tlf:35 06 15 80







Reply via email to