Hi,

 

I'm also curious as to whether anyone has had success with Nutch and
parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
errors as seen here -
http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
cuments-in-Nutch-1.0-td26640949.html#a26640949 

 

Is a separate plugin required to parse these documents (i.e.,
parse-msexcel, parse-mspowerpoint, etc. will *not* work?) 

 

I noticed the comment on the above thread - docx should be parsed,A
plugin can be used to Parsed docx file. you get some 
help info from parse-html plugin and so on. - but didn't find it really
helpful.

 

Regards,

Joe




This message is confidential to Prodea Systems, Inc unless otherwise indicated 
or apparent from its nature. This message is directed to the intended recipient 
only, who may be readily determined by the sender of this message and its 
contents. If the reader of this message is not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient:(a)any dissemination or copying of this message is strictly 
prohibited; and(b)immediately notify the sender by return message and destroy 
any copies of this message in any form(electronic, paper or otherwise) that you 
have.The delivery of this message and its information is neither intended to be 
nor constitutes a disclosure or waiver of any trade secrets, intellectual 
property, attorney work product, or attorney-client communications. The 
authority of the individual sending this message to legally bind Prodea Systems 
 
is neither apparent nor implied,and must be independently verified.

Reply via email to