Have you enabled the msword parse plugin in nutch-site.xml . You will have to enable that plugin for msword parsing to work.
Cheers, Jayant On 7/20/06, SKUHRA, Milan <[EMAIL PROTECTED]> wrote:
Hello, I am trying use Nutch to look up at specify URLs like this are: http://server.domain/appname/get?id=34&view=content http://server.domain/appname/get?id=35&view=content http://server.domain/appname/get?id=36&view=content (So I use "-depth 1" option for running crawl.) Some sites are HTML pages, but some return files. After running the crawl I found at the log file this record: 060719 160036 fetch okay, but can't parse http://server.domain/appname/get?id=36&view=content, reason: failed(2,203): Content-Type not text/html: application/msword It looks like crawl infers Content-Type from URL and than compares it with received Content-Type. Is possible to use received Content-Type without checking Content-Type implied from URL? How can I resolve this problem? Tank you for reply. Milan Skuhra
-- www.jkg.in | http://www.jkg.in/contact-me/ Jayant Kr. Gandhi M.Tech. Computer Tech. Class of 2007, IIT Delhi