How to successfully crawl and index office 2007 documents in Nutch 1.0

Rupesh Mankar Fri, 04 Dec 2009 02:59:31 -0800

Hi,

I am new to Nutch. I want to crawl and search office 2007 documents (.docx, 
.pptx etc) from Nutch. But when I try to crawl, crawler throws following error:


fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx
Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx: 
org.apache.nutch.parse.ParseException: parser not found for 
contentType=application/zip 
url=http://10.88.45.140:8081/tutorial/Office-2007-document.docx
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

When I add zip plugin in nutch-site.xml under plugin.includes, crawling becomes 
successful but nothing gets search.

How can we successfully crawl and search contents of office 2007 documents?

Thanks,
Rupesh

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

How to successfully crawl and index office 2007 documents in Nutch 1.0

Reply via email to