Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for this. I haven't checked whether Tika currently supports Office 2007 but I suggest that you try parsing docs at this format with Tika, if it does work then you'll get that automatically via Nutch-766
Makes sense? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz <[email protected]> > Hi all, > > Anyone successfully used nutch to index Office 2007 documents? I know that > this question has already been asked, but considering the number of e-mails > asking the same question, looks like that Nutch does not support Office > 2007 > documents. > > Best, > > Adilson > > On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <[email protected]> > wrote: > > > Hi, > > > > > > > > I'm also curious as to whether anyone has had success with Nutch and > > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same > > errors as seen here - > > http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do > > cuments-in-Nutch-1.0-td26640949.html#a26640949< > http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 > > > > > > > > > > Is a separate plugin required to parse these documents (i.e., > > parse-msexcel, parse-mspowerpoint, etc. will *not* work?) > > > > > > > > I noticed the comment on the above thread - docx should be parsed,A > > plugin can be used to Parsed docx file. you get some > > help info from parse-html plugin and so on. - but didn't find it really > > helpful. > > > > > > > > Regards, > > > > Joe > > > > > > > > > > This message is confidential to Prodea Systems, Inc unless otherwise > > indicated > > or apparent from its nature. This message is directed to the intended > > recipient > > only, who may be readily determined by the sender of this message and its > > contents. If the reader of this message is not the intended recipient, or > > an > > employee or agent responsible for delivering this message to the intended > > recipient:(a)any dissemination or copying of this message is strictly > > prohibited; and(b)immediately notify the sender by return message and > > destroy > > any copies of this message in any form(electronic, paper or otherwise) > that > > you > > have.The delivery of this message and its information is neither intended > > to be > > nor constitutes a disclosure or waiver of any trade secrets, intellectual > > property, attorney work product, or attorney-client communications. The > > authority of the individual sending this message to legally bind Prodea > > Systems > > is neither apparent nor implied,and must be independently verified. >
