I am trying use Nutch to look up at specify URLs like this are:





(So I use "-depth 1" option for running crawl.)

Some sites are HTML pages, but some return files.

After running the crawl I found at the log file this record:


060719 160036 fetch okay, but can't parse
http://server.domain/appname/get?id=36&view=content, reason:
failed(2,203): Content-Type not text/html: application/msword


It looks like crawl infers Content-Type from URL and than compares it
with received Content-Type.

Is possible to use received Content-Type without checking Content-Type
implied from URL?

How can I resolve this problem?


Tank you for reply.

Milan Skuhra


Reply via email to