Re: How to successfully crawl and index office 2007 documents in Nutch 1.0

yangfeng Mon, 07 Dec 2009 03:06:03 -0800

docx should be parsed,A plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on.


2009/12/4 Rupesh Mankar <[email protected]>

> Hi,
>
> I am new to Nutch. I want to crawl and search office 2007 documents (.docx,
> .pptx etc) from Nutch. But when I try to crawl, crawler throws following
> error:
>
> fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx
> Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/zip url=
> http://10.88.45.140:8081/tutorial/Office-2007-document.docx
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
> When I add zip plugin in nutch-site.xml under plugin.includes, crawling
> becomes successful but nothing gets search.
>
> How can we successfully crawl and search contents of office 2007 documents?
>
> Thanks,
> Rupesh
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

Re: How to successfully crawl and index office 2007 documents in Nutch 1.0

Reply via email to