Hi,

 Thanks for the reply. I will try to use Tika with Nutch to parse the
documents. My current Nutch setup is working quite nice and I don't want to
configure another Nutch instance.

 If I manage to put it to work I will write here a mini how-to.

 Best,

 Adilson

On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche <
[email protected]> wrote:

> Hi,
>
> There is a Tika plugin in JIRA (
> https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page
> the support for the Office 2007 was imminent in POI (which Tika uses
> internally). The plan for Nutch is to progressively delegate the parsing to
> Tika; Nutch-766 has been implemented for this. I haven't checked whether
> Tika currently supports Office 2007 but I suggest that you try parsing docs
> at this format with Tika, if it does work then you'll get that
> automatically
> via Nutch-766
>
> Makes sense?
>
> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/12/14 Adilson Oliveira Cruz <[email protected]>
>
> >  Hi all,
> >
> >  Anyone successfully used nutch to index Office 2007 documents? I know
> that
> > this question has already been asked, but considering the number of
> e-mails
> > asking the same question, looks like that Nutch does not support Office
> > 2007
> > documents.
> >
> >  Best,
> >
> >  Adilson
> >
> > On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I'm also curious as to whether anyone has had success with Nutch and
> > > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
> > > errors as seen here -
> > >
> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
> > > cuments-in-Nutch-1.0-td26640949.html#a26640949<
> >
> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
> > >
> > >
> > >
> > >
> > > Is a separate plugin required to parse these documents (i.e.,
> > > parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
> > >
> > >
> > >
> > > I noticed the comment on the above thread - docx should be parsed,A
> > > plugin can be used to Parsed docx file. you get some
> > > help info from parse-html plugin and so on. - but didn't find it really
> > > helpful.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Joe
> > >
> > >
> > >
> > >
> > > This message is confidential to Prodea Systems, Inc unless otherwise
> > > indicated
> > > or apparent from its nature. This message is directed to the intended
> > > recipient
> > > only, who may be readily determined by the sender of this message and
> its
> > > contents. If the reader of this message is not the intended recipient,
> or
> > > an
> > > employee or agent responsible for delivering this message to the
> intended
> > > recipient:(a)any dissemination or copying of this message is strictly
> > > prohibited; and(b)immediately notify the sender by return message and
> > > destroy
> > > any copies of this message in any form(electronic, paper or otherwise)
> > that
> > > you
> > > have.The delivery of this message and its information is neither
> intended
> > > to be
> > > nor constitutes a disclosure or waiver of any trade secrets,
> intellectual
> > > property, attorney work product, or attorney-client communications. The
> > > authority of the individual sending this message to legally bind Prodea
> > > Systems
> > > is neither apparent nor implied,and must be independently verified.
> >
>

Reply via email to