Hi,

I came across the classloader issue that you mentioned but got everything to
work OK by duplicating the class TikaConfiguration into the package used by
my plugin. The lib tika-core goes into the main /lib dir of nutch while
tika-parsers jar goes into the lib dir of the plugin. I now have a first
version of the Tika plugin which does some very basic text and metadata
extraction.

What shall we do about the HTMLParseFilters? Get the generic TikaParser to
create a DOM representation and pass it to the HTMLParseFilters as it is
done now? Modify the HTMLParseFilters so that they use SAX events so that we
can forward them from Tika? Any other suggestions?

J.


2009/11/12 Kirby Bohling <kirby.bohl...@gmail.com>

> You'll need to be careful of the classloader issues if you do that...
>
> The core Nutch code needs just the mime type stuff, but if you access
> Tika from the lib directory rather then from the plugins/lib
> directory, it won't be able to find any extensions.  I've used Tika to
> implement a docx plugin, and came across all these problems.
>
> Kirby
>
>
> On Thu, Nov 12, 2009 at 8:41 AM, Julien Nioche
> <lists.digitalpeb...@gmail.com> wrote:
> > Speaking of which, I'm planning to do some work on the Tika integration
> > within the next week or so. Basically, I'll create a new plugin which
> will
> > be used for the mime types that Tika can already handle while keeping
> some
> > of the existing plugins for the more complex cases. This should allow us
> to
> > already have a first version of the Tika integration without losing any
> the
> > functionalities. Will update the list as soon as I have something working
> +
> > will create a JIRA
> >
> > J.
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
> > 2009/11/10 Andrzej Bialecki <a...@getopt.org>
> >>
> >> BrunoWL wrote:
> >>>
> >>> Hi. i'm a benning in nutch. Can anybody tell how to make nutch use
> >>> parsers
> >>> from tika.
> >>> I did all kind of search and didn't find a answer.
> >>
> >> Tika parsers are not integrated yet with Nutch - we use our own parsers,
> >> and in most cases they are of similar quality as those in Tika (since
> most
> >> Tika parsers originated in Nutch). Tight Tika integration is on the
> roadmap.
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >>  ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to