Have create a page http://wiki.apache.org/nutch/TikaPlugin; feel free to use it for your how-to
J. 2009/12/14 Julien Nioche <lists.digitalpeb...@gmail.com> > If I manage to put it to work I will write here a mini how-to. >> > > The Nutch Wiki would be the right place for doing that. It would be nice to > have a page there listing the differences between the capabilities of the > Tika plugin and the existing Nutch parsing plugins as there might be > differences between them (support for Office 2007 being potentially one of > them) > > Note that the Tika plugin is VERY beta > > Julien > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > 2009/12/14 Adilson Oliveira Cruz <adilsonoc...@gmail.com> > >> Hi, >> >> Thanks for the reply. I will try to use Tika with Nutch to parse the >> documents. My current Nutch setup is working quite nice and I don't want >> to >> configure another Nutch instance. >> >> If I manage to put it to work I will write here a mini how-to. >> >> Best, >> >> Adilson >> >> On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche < >> lists.digitalpeb...@gmail.com> wrote: >> >> > Hi, >> > >> > There is a Tika plugin in JIRA ( >> > https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's >> page >> > the support for the Office 2007 was imminent in POI (which Tika uses >> > internally). The plan for Nutch is to progressively delegate the parsing >> to >> > Tika; Nutch-766 has been implemented for this. I haven't checked whether >> > Tika currently supports Office 2007 but I suggest that you try parsing >> docs >> > at this format with Tika, if it does work then you'll get that >> > automatically >> > via Nutch-766 >> > >> > Makes sense? >> > >> > Julien >> > >> > -- >> > DigitalPebble Ltd >> > http://www.digitalpebble.com >> > >> > 2009/12/14 Adilson Oliveira Cruz <adilsonoc...@gmail.com> >> > >> > > Hi all, >> > > >> > > Anyone successfully used nutch to index Office 2007 documents? I know >> > that >> > > this question has already been asked, but considering the number of >> > e-mails >> > > asking the same question, looks like that Nutch does not support >> Office >> > > 2007 >> > > documents. >> > > >> > > Best, >> > > >> > > Adilson >> > > >> > > On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <joe.b...@prodeasystems.com> >> > > wrote: >> > > >> > > > Hi, >> > > > >> > > > >> > > > >> > > > I'm also curious as to whether anyone has had success with Nutch and >> > > > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same >> > > > errors as seen here - >> > > > >> > >> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do >> > > > cuments-in-Nutch-1.0-td26640949.html#a26640949< >> > > >> > >> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 >> > > > >> > > > >> > > > >> > > > >> > > > Is a separate plugin required to parse these documents (i.e., >> > > > parse-msexcel, parse-mspowerpoint, etc. will *not* work?) >> > > > >> > > > >> > > > >> > > > I noticed the comment on the above thread - docx should be parsed,A >> > > > plugin can be used to Parsed docx file. you get some >> > > > help info from parse-html plugin and so on. - but didn't find it >> really >> > > > helpful. >> > > > >> > > > >> > > > >> > > > Regards, >> > > > >> > > > Joe >> > > > >> > > > >> > > > >> > > > >> > > > This message is confidential to Prodea Systems, Inc unless otherwise >> > > > indicated >> > > > or apparent from its nature. This message is directed to the >> intended >> > > > recipient >> > > > only, who may be readily determined by the sender of this message >> and >> > its >> > > > contents. If the reader of this message is not the intended >> recipient, >> > or >> > > > an >> > > > employee or agent responsible for delivering this message to the >> > intended >> > > > recipient:(a)any dissemination or copying of this message is >> strictly >> > > > prohibited; and(b)immediately notify the sender by return message >> and >> > > > destroy >> > > > any copies of this message in any form(electronic, paper or >> otherwise) >> > > that >> > > > you >> > > > have.The delivery of this message and its information is neither >> > intended >> > > > to be >> > > > nor constitutes a disclosure or waiver of any trade secrets, >> > intellectual >> > > > property, attorney work product, or attorney-client communications. >> The >> > > > authority of the individual sending this message to legally bind >> Prodea >> > > > Systems >> > > > is neither apparent nor implied,and must be independently verified. >> > > >> > >> > > > > > -- DigitalPebble Ltd http://www.digitalpebble.com