Re: [Nutchbase] Multi-value ParseResult missing

Doğacan Güney Thu, 22 Jul 2010 05:05:43 -0700

Hey,

On Thu, Jul 22, 2010 at 00:47, Andrzej Bialecki <[email protected]> wrote:


> Hi,
>
> I noticed that nutchbase doesn't use the multi-valued ParseResult, instead
> all parse plugins return a simple Parse. As a consequence, it's not possible
> to return multiple values from parsing a single WebPage, something that
> parsers for compound documents absolutely require (archives, rss, mbox,
> etc). Dogacan - was there a particular reason for this change?
>
>
No. Even though I wrote most of the original ParseResult code, I couldn't
wrap my head around as to how to update WebPage (or old TableRow) API to use
ParseResult.


> However, a broader issue here is how to treat compound documents, and links
> to/from them:
>  a) record all URLs of child documents (e.g. with the !/ notation, or #
> notation), and create as many WebPage-s as there were archive members. This
> needs some hacks to prevent such urls from being scheduled for fetching.
>  b) extend WebPage to allow for multiple content sections and their names
> (and metadata, and ... yuck)
>  c) like a) except put a special "synthetic" mark on the page to prevent
> selection of this page for generation and fetching. This mark would also
> help us to update / remove obsolete sub-documents when their
> container changes.
>
> I'm leaning towards c).
>
>
I was initially leaning towards (a) but I think (c) sounds good too. The
nice thing about (c) is that these documents will correctly get inlinks
(assuming the URL given to them makes sense, so I am thinking for an RSS
feed, this will be the <link> element), etc. Though this can also be a
problem too. Since in some instances, you may want to refetch a URL that
happens to be a link in a feed.


> Now, when it comes to the ParseResult ... it's not an ideal solution
> either, because it means we have to keep all sub-document results in memory.
> We could avoid it by implementing something that Aperture uses, which is a
> sub-crawler - a concept of a parser plugin for compound formats. The main
> plugin would return a special result code, which basically says "this is a
> compound format of type X", and then the caller (ParseUtil?) would use
> SubCrawlerFactory.create(typeX, containerDataStream) to create a parser for
> the container. This parser in turn would simply extract sections of the
> compound document (as streams) and it would pass each stream to the regular
> parsing chain. The caller then needs to iterate over results returned from
> the SubCrawler. What do you think?
>
>
This is excellent :) +1.


> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Re: [Nutchbase] Multi-value ParseResult missing

Reply via email to