Hi,

There is a Tika plugin in JIRA (
https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page
the support for the Office 2007 was imminent in POI (which Tika uses
internally). The plan for Nutch is to progressively delegate the parsing to
Tika; Nutch-766 has been implemented for this. I haven't checked whether
Tika currently supports Office 2007 but I suggest that you try parsing docs
at this format with Tika, if it does work then you'll get that automatically
via Nutch-766

Makes sense?

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/12/14 Adilson Oliveira Cruz <[email protected]>

>  Hi all,
>
>  Anyone successfully used nutch to index Office 2007 documents? I know that
> this question has already been asked, but considering the number of e-mails
> asking the same question, looks like that Nutch does not support Office
> 2007
> documents.
>
>  Best,
>
>  Adilson
>
> On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <[email protected]>
> wrote:
>
> > Hi,
> >
> >
> >
> > I'm also curious as to whether anyone has had success with Nutch and
> > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
> > errors as seen here -
> > http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
> > cuments-in-Nutch-1.0-td26640949.html#a26640949<
> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
> >
> >
> >
> >
> > Is a separate plugin required to parse these documents (i.e.,
> > parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
> >
> >
> >
> > I noticed the comment on the above thread - docx should be parsed,A
> > plugin can be used to Parsed docx file. you get some
> > help info from parse-html plugin and so on. - but didn't find it really
> > helpful.
> >
> >
> >
> > Regards,
> >
> > Joe
> >
> >
> >
> >
> > This message is confidential to Prodea Systems, Inc unless otherwise
> > indicated
> > or apparent from its nature. This message is directed to the intended
> > recipient
> > only, who may be readily determined by the sender of this message and its
> > contents. If the reader of this message is not the intended recipient, or
> > an
> > employee or agent responsible for delivering this message to the intended
> > recipient:(a)any dissemination or copying of this message is strictly
> > prohibited; and(b)immediately notify the sender by return message and
> > destroy
> > any copies of this message in any form(electronic, paper or otherwise)
> that
> > you
> > have.The delivery of this message and its information is neither intended
> > to be
> > nor constitutes a disclosure or waiver of any trade secrets, intellectual
> > property, attorney work product, or attorney-client communications. The
> > authority of the individual sending this message to legally bind Prodea
> > Systems
> > is neither apparent nor implied,and must be independently verified.
>

Reply via email to