docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on.
2009/12/4 Rupesh Mankar <[email protected]> > Hi, > > I am new to Nutch. I want to crawl and search office 2007 documents (.docx, > .pptx etc) from Nutch. But when I try to crawl, crawler throws following > error: > > fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx > Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx: > org.apache.nutch.parse.ParseException: parser not found for > contentType=application/zip url= > http://10.88.45.140:8081/tutorial/Office-2007-document.docx > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552) > > When I add zip plugin in nutch-site.xml under plugin.includes, crawling > becomes successful but nothing gets search. > > How can we successfully crawl and search contents of office 2007 documents? > > Thanks, > Rupesh > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. >
