Have create a page http://wiki.apache.org/nutch/TikaPlugin; feel free to use
it for your how-to

J.

2009/12/14 Julien Nioche <lists.digitalpeb...@gmail.com>

>  If I manage to put it to work I will write here a mini how-to.
>>
>
> The Nutch Wiki would be the right place for doing that. It would be nice to
> have a page there listing the differences between the capabilities of the
> Tika plugin and the existing Nutch parsing plugins as there might be
> differences between them (support for Office 2007 being potentially one of
> them)
>
> Note that the Tika plugin is VERY beta
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/12/14 Adilson Oliveira Cruz <adilsonoc...@gmail.com>
>
>>  Hi,
>>
>>  Thanks for the reply. I will try to use Tika with Nutch to parse the
>> documents. My current Nutch setup is working quite nice and I don't want
>> to
>> configure another Nutch instance.
>>
>>  If I manage to put it to work I will write here a mini how-to.
>>
>>  Best,
>>
>>  Adilson
>>
>> On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche <
>> lists.digitalpeb...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > There is a Tika plugin in JIRA (
>> > https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's
>> page
>> > the support for the Office 2007 was imminent in POI (which Tika uses
>> > internally). The plan for Nutch is to progressively delegate the parsing
>> to
>> > Tika; Nutch-766 has been implemented for this. I haven't checked whether
>> > Tika currently supports Office 2007 but I suggest that you try parsing
>> docs
>> > at this format with Tika, if it does work then you'll get that
>> > automatically
>> > via Nutch-766
>> >
>> > Makes sense?
>> >
>> > Julien
>> >
>> > --
>> > DigitalPebble Ltd
>> > http://www.digitalpebble.com
>> >
>> > 2009/12/14 Adilson Oliveira Cruz <adilsonoc...@gmail.com>
>> >
>> > >  Hi all,
>> > >
>> > >  Anyone successfully used nutch to index Office 2007 documents? I know
>> > that
>> > > this question has already been asked, but considering the number of
>> > e-mails
>> > > asking the same question, looks like that Nutch does not support
>> Office
>> > > 2007
>> > > documents.
>> > >
>> > >  Best,
>> > >
>> > >  Adilson
>> > >
>> > > On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <joe.b...@prodeasystems.com>
>> > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > >
>> > > >
>> > > > I'm also curious as to whether anyone has had success with Nutch and
>> > > > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
>> > > > errors as seen here -
>> > > >
>> >
>> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
>> > > > cuments-in-Nutch-1.0-td26640949.html#a26640949<
>> > >
>> >
>> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Is a separate plugin required to parse these documents (i.e.,
>> > > > parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
>> > > >
>> > > >
>> > > >
>> > > > I noticed the comment on the above thread - docx should be parsed,A
>> > > > plugin can be used to Parsed docx file. you get some
>> > > > help info from parse-html plugin and so on. - but didn't find it
>> really
>> > > > helpful.
>> > > >
>> > > >
>> > > >
>> > > > Regards,
>> > > >
>> > > > Joe
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > This message is confidential to Prodea Systems, Inc unless otherwise
>> > > > indicated
>> > > > or apparent from its nature. This message is directed to the
>> intended
>> > > > recipient
>> > > > only, who may be readily determined by the sender of this message
>> and
>> > its
>> > > > contents. If the reader of this message is not the intended
>> recipient,
>> > or
>> > > > an
>> > > > employee or agent responsible for delivering this message to the
>> > intended
>> > > > recipient:(a)any dissemination or copying of this message is
>> strictly
>> > > > prohibited; and(b)immediately notify the sender by return message
>> and
>> > > > destroy
>> > > > any copies of this message in any form(electronic, paper or
>> otherwise)
>> > > that
>> > > > you
>> > > > have.The delivery of this message and its information is neither
>> > intended
>> > > > to be
>> > > > nor constitutes a disclosure or waiver of any trade secrets,
>> > intellectual
>> > > > property, attorney work product, or attorney-client communications.
>> The
>> > > > authority of the individual sending this message to legally bind
>> Prodea
>> > > > Systems
>> > > > is neither apparent nor implied,and must be independently verified.
>> > >
>> >
>>
>
>
>
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to