Hey, On Thu, Jul 22, 2010 at 00:47, Andrzej Bialecki <[email protected]> wrote:
> Hi, > > I noticed that nutchbase doesn't use the multi-valued ParseResult, instead > all parse plugins return a simple Parse. As a consequence, it's not possible > to return multiple values from parsing a single WebPage, something that > parsers for compound documents absolutely require (archives, rss, mbox, > etc). Dogacan - was there a particular reason for this change? > > No. Even though I wrote most of the original ParseResult code, I couldn't wrap my head around as to how to update WebPage (or old TableRow) API to use ParseResult. > However, a broader issue here is how to treat compound documents, and links > to/from them: > a) record all URLs of child documents (e.g. with the !/ notation, or # > notation), and create as many WebPage-s as there were archive members. This > needs some hacks to prevent such urls from being scheduled for fetching. > b) extend WebPage to allow for multiple content sections and their names > (and metadata, and ... yuck) > c) like a) except put a special "synthetic" mark on the page to prevent > selection of this page for generation and fetching. This mark would also > help us to update / remove obsolete sub-documents when their > container changes. > > I'm leaning towards c). > > I was initially leaning towards (a) but I think (c) sounds good too. The nice thing about (c) is that these documents will correctly get inlinks (assuming the URL given to them makes sense, so I am thinking for an RSS feed, this will be the <link> element), etc. Though this can also be a problem too. Since in some instances, you may want to refetch a URL that happens to be a link in a feed. > Now, when it comes to the ParseResult ... it's not an ideal solution > either, because it means we have to keep all sub-document results in memory. > We could avoid it by implementing something that Aperture uses, which is a > sub-crawler - a concept of a parser plugin for compound formats. The main > plugin would return a special result code, which basically says "this is a > compound format of type X", and then the caller (ParseUtil?) would use > SubCrawlerFactory.create(typeX, containerDataStream) to create a parser for > the container. This parser in turn would simply extract sections of the > compound document (as streams) and it would pass each stream to the regular > parsing chain. The caller then needs to iterate over results returned from > the SubCrawler. What do you think? > > This is excellent :) +1. > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney

