Hey Andrzej, We're having the same sorts of discussions in Tika-ville right now. Check out this page on the Tika wiki:
http://wiki.apache.org/tika/MetadataDiscussion Comments, thoughts, welcome. Depending on what comes out of Tika, we may be able to leverage upon it... Cheers, Chris On 7/21/10 5:47 PM, "Andrzej Bialecki" <[email protected]> wrote: Hi, I noticed that nutchbase doesn't use the multi-valued ParseResult, instead all parse plugins return a simple Parse. As a consequence, it's not possible to return multiple values from parsing a single WebPage, something that parsers for compound documents absolutely require (archives, rss, mbox, etc). Dogacan - was there a particular reason for this change? However, a broader issue here is how to treat compound documents, and links to/from them: a) record all URLs of child documents (e.g. with the !/ notation, or # notation), and create as many WebPage-s as there were archive members. This needs some hacks to prevent such urls from being scheduled for fetching. b) extend WebPage to allow for multiple content sections and their names (and metadata, and ... yuck) c) like a) except put a special "synthetic" mark on the page to prevent selection of this page for generation and fetching. This mark would also help us to update / remove obsolete sub-documents when their container changes. I'm leaning towards c). Now, when it comes to the ParseResult ... it's not an ideal solution either, because it means we have to keep all sub-document results in memory. We could avoid it by implementing something that Aperture uses, which is a sub-crawler - a concept of a parser plugin for compound formats. The main plugin would return a special result code, which basically says "this is a compound format of type X", and then the caller (ParseUtil?) would use SubCrawlerFactory.create(typeX, containerDataStream) to create a parser for the container. This parser in turn would simply extract sections of the compound document (as streams) and it would pass each stream to the regular parsing chain. The caller then needs to iterate over results returned from the SubCrawler. What do you think? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

