Hi,

I have some questions about the dependencies of the Parser interface,
especially from the perspective of generalizing it to the potential
Tika project. The current dependencies are:

   * Configurable - depends on the Hadoop configuration system
   * Pluggable - depends on the Nutch plugin system
   * Content - depends on the Nutch protocol model
   * Parse - depends on the Nutch index content model

I notice that Nutch uses a custom plugin and configuration system. Is
there a technical reason for having your own instead of using some of
the existing IoC and other component frameworks? If we are to make the
Parser components easily usable outside Nutch we'd need to either
remove those dependencies or include (either directly or by reference)
the plugin/configuration system in Tika. I'd personally prefer to
remove those dependencies in favor of more IoC-friendly JavaBean
conventions, but I'm not familiar with the background of the Parser
components.

The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to generalize this, as neither of these ideas
apply for example to the needs of the Apache Jackrabbit project. My
TextExtractor proposal avoids these dependencies by using just a
binary stream, a content type and an optional character encoding to
produce a single text stream, but that approach fails to support more
structured index content models. I'm trying to find a solution that
combines the best parts of both approaches.

Ideally I'd like to see a parser implementation in Tika that avoids
the Nutch dependencies but can still be used in Nutch without changing
any of the existing code or configuration files. Something like a
TikaParser adapter class might be needed to achieve that.

BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting, and Java development

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to