Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents.
Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <[email protected]> wrote: > Hi, > > > > I'm also curious as to whether anyone has had success with Nutch and > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same > errors as seen here - > http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do > cuments-in-Nutch-1.0-td26640949.html#a26640949<http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949> > > > > Is a separate plugin required to parse these documents (i.e., > parse-msexcel, parse-mspowerpoint, etc. will *not* work?) > > > > I noticed the comment on the above thread - docx should be parsed,A > plugin can be used to Parsed docx file. you get some > help info from parse-html plugin and so on. - but didn't find it really > helpful. > > > > Regards, > > Joe > > > > > This message is confidential to Prodea Systems, Inc unless otherwise > indicated > or apparent from its nature. This message is directed to the intended > recipient > only, who may be readily determined by the sender of this message and its > contents. If the reader of this message is not the intended recipient, or > an > employee or agent responsible for delivering this message to the intended > recipient:(a)any dissemination or copying of this message is strictly > prohibited; and(b)immediately notify the sender by return message and > destroy > any copies of this message in any form(electronic, paper or otherwise) that > you > have.The delivery of this message and its information is neither intended > to be > nor constitutes a disclosure or waiver of any trade secrets, intellectual > property, attorney work product, or attorney-client communications. The > authority of the individual sending this message to legally bind Prodea > Systems > is neither apparent nor implied,and must be independently verified.
