Hi, Thanks for the reply. I will try to use Tika with Nutch to parse the documents. My current Nutch setup is working quite nice and I don't want to configure another Nutch instance.
If I manage to put it to work I will write here a mini how-to. Best, Adilson On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche < [email protected]> wrote: > Hi, > > There is a Tika plugin in JIRA ( > https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page > the support for the Office 2007 was imminent in POI (which Tika uses > internally). The plan for Nutch is to progressively delegate the parsing to > Tika; Nutch-766 has been implemented for this. I haven't checked whether > Tika currently supports Office 2007 but I suggest that you try parsing docs > at this format with Tika, if it does work then you'll get that > automatically > via Nutch-766 > > Makes sense? > > Julien > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > 2009/12/14 Adilson Oliveira Cruz <[email protected]> > > > Hi all, > > > > Anyone successfully used nutch to index Office 2007 documents? I know > that > > this question has already been asked, but considering the number of > e-mails > > asking the same question, looks like that Nutch does not support Office > > 2007 > > documents. > > > > Best, > > > > Adilson > > > > On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <[email protected]> > > wrote: > > > > > Hi, > > > > > > > > > > > > I'm also curious as to whether anyone has had success with Nutch and > > > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same > > > errors as seen here - > > > > http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do > > > cuments-in-Nutch-1.0-td26640949.html#a26640949< > > > http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 > > > > > > > > > > > > > > > Is a separate plugin required to parse these documents (i.e., > > > parse-msexcel, parse-mspowerpoint, etc. will *not* work?) > > > > > > > > > > > > I noticed the comment on the above thread - docx should be parsed,A > > > plugin can be used to Parsed docx file. you get some > > > help info from parse-html plugin and so on. - but didn't find it really > > > helpful. > > > > > > > > > > > > Regards, > > > > > > Joe > > > > > > > > > > > > > > > This message is confidential to Prodea Systems, Inc unless otherwise > > > indicated > > > or apparent from its nature. This message is directed to the intended > > > recipient > > > only, who may be readily determined by the sender of this message and > its > > > contents. If the reader of this message is not the intended recipient, > or > > > an > > > employee or agent responsible for delivering this message to the > intended > > > recipient:(a)any dissemination or copying of this message is strictly > > > prohibited; and(b)immediately notify the sender by return message and > > > destroy > > > any copies of this message in any form(electronic, paper or otherwise) > > that > > > you > > > have.The delivery of this message and its information is neither > intended > > > to be > > > nor constitutes a disclosure or waiver of any trade secrets, > intellectual > > > property, attorney work product, or attorney-client communications. The > > > authority of the individual sending this message to legally bind Prodea > > > Systems > > > is neither apparent nor implied,and must be independently verified. > > >
