Jukka Zitting wrote: > The Parser interface is also bound to the ideas of fetching content > from the network and indexing it using a standard content model > through the Content and Parse dependencies. For the Tika project I'd > like to look for ways to generalize this, as neither of these ideas > apply for example to the needs of the Apache Jackrabbit project. My > TextExtractor proposal avoids these dependencies by using just a > binary stream, a content type and an optional character encoding to > produce a single text stream, but that approach fails to support more > structured index content models. I'm trying to find a solution that > combines the best parts of both approaches.
A very important aspect of the Parser interface (or actually, the Parse and Content classes) is that they each may contain arbitrary metadata. This is required for discovering and passing around both the original metadata (such as protocol headers, document properties, etc), and other secondary content (such as data from external sources, or derived metadata). Simply returning a String doesn't cut it. Returning a java.util.Map may be an option, if you use standard Metadata constants as keys - still, Nutch would have to repackage this anyway into a Writable. And we would lose a nice property of the current Metadata class, which is the ability to tolerate minor syntax variations and to store multiple values per key. > > Ideally I'd like to see a parser implementation in Tika that avoids > the Nutch dependencies but can still be used in Nutch without changing > any of the existing code or configuration files. Something like a > TikaParser adapter class might be needed to achieve that. It seems to me that such adapter is unavoidable. Most probably similar adapters would have to be used for all other dependencies (Configurable etc). The big question is how to minimize the intermediate object creation, and to come up with interfaces that are robust enough to support all current usecases in Nutch, but at the same time don't introduce too many layers of delegation... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
