Re: [Nutchbase] Multi-value ParseResult missing

2010-07-22 Thread Doğacan Güney
Hey,

On Thu, Jul 22, 2010 at 00:47, Andrzej Bialecki a...@getopt.org wrote:

 Hi,

 I noticed that nutchbase doesn't use the multi-valued ParseResult, instead
 all parse plugins return a simple Parse. As a consequence, it's not possible
 to return multiple values from parsing a single WebPage, something that
 parsers for compound documents absolutely require (archives, rss, mbox,
 etc). Dogacan - was there a particular reason for this change?


No. Even though I wrote most of the original ParseResult code, I couldn't
wrap my head around as to how to update WebPage (or old TableRow) API to use
ParseResult.


 However, a broader issue here is how to treat compound documents, and links
 to/from them:
  a) record all URLs of child documents (e.g. with the !/ notation, or #
 notation), and create as many WebPage-s as there were archive members. This
 needs some hacks to prevent such urls from being scheduled for fetching.
  b) extend WebPage to allow for multiple content sections and their names
 (and metadata, and ... yuck)
  c) like a) except put a special synthetic mark on the page to prevent
 selection of this page for generation and fetching. This mark would also
 help us to update / remove obsolete sub-documents when their
 container changes.

 I'm leaning towards c).


I was initially leaning towards (a) but I think (c) sounds good too. The
nice thing about (c) is that these documents will correctly get inlinks
(assuming the URL given to them makes sense, so I am thinking for an RSS
feed, this will be the link element), etc. Though this can also be a
problem too. Since in some instances, you may want to refetch a URL that
happens to be a link in a feed.


 Now, when it comes to the ParseResult ... it's not an ideal solution
 either, because it means we have to keep all sub-document results in memory.
 We could avoid it by implementing something that Aperture uses, which is a
 sub-crawler - a concept of a parser plugin for compound formats. The main
 plugin would return a special result code, which basically says this is a
 compound format of type X, and then the caller (ParseUtil?) would use
 SubCrawlerFactory.create(typeX, containerDataStream) to create a parser for
 the container. This parser in turn would simply extract sections of the
 compound document (as streams) and it would pass each stream to the regular
 parsing chain. The caller then needs to iterate over results returned from
 the SubCrawler. What do you think?


This is excellent :) +1.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney


[Nutchbase] Multi-value ParseResult missing

2010-07-21 Thread Andrzej Bialecki

Hi,

I noticed that nutchbase doesn't use the multi-valued ParseResult, 
instead all parse plugins return a simple Parse. As a consequence, it's 
not possible to return multiple values from parsing a single WebPage, 
something that parsers for compound documents absolutely require 
(archives, rss, mbox, etc). Dogacan - was there a particular reason for 
this change?


However, a broader issue here is how to treat compound documents, and 
links to/from them:
 a) record all URLs of child documents (e.g. with the !/ notation, or # 
notation), and create as many WebPage-s as there were archive members. 
This needs some hacks to prevent such urls from being scheduled for 
fetching.
 b) extend WebPage to allow for multiple content sections and their 
names (and metadata, and ... yuck)
 c) like a) except put a special synthetic mark on the page to 
prevent selection of this page for generation and fetching. This mark 
would also help us to update / remove obsolete sub-documents when their

container changes.

I'm leaning towards c).

Now, when it comes to the ParseResult ... it's not an ideal solution 
either, because it means we have to keep all sub-document results in 
memory. We could avoid it by implementing something that Aperture uses, 
which is a sub-crawler - a concept of a parser plugin for compound 
formats. The main plugin would return a special result code, which 
basically says this is a compound format of type X, and then the 
caller (ParseUtil?) would use SubCrawlerFactory.create(typeX, 
containerDataStream) to create a parser for the container. This parser 
in turn would simply extract sections of the compound document (as 
streams) and it would pass each stream to the regular parsing chain. The 
caller then needs to iterate over results returned from the SubCrawler. 
What do you think?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [Nutchbase] Multi-value ParseResult missing

2010-07-21 Thread Mattmann, Chris A (388J)
Hey Andrzej,

We're having the same sorts of discussions in Tika-ville right now. Check out 
this page on the Tika wiki:

http://wiki.apache.org/tika/MetadataDiscussion

Comments, thoughts, welcome. Depending on what comes out of Tika, we may be 
able to leverage upon it...

Cheers,
Chris


On 7/21/10 5:47 PM, Andrzej Bialecki a...@getopt.org wrote:

Hi,

I noticed that nutchbase doesn't use the multi-valued ParseResult,
instead all parse plugins return a simple Parse. As a consequence, it's
not possible to return multiple values from parsing a single WebPage,
something that parsers for compound documents absolutely require
(archives, rss, mbox, etc). Dogacan - was there a particular reason for
this change?

However, a broader issue here is how to treat compound documents, and
links to/from them:
  a) record all URLs of child documents (e.g. with the !/ notation, or #
notation), and create as many WebPage-s as there were archive members.
This needs some hacks to prevent such urls from being scheduled for
fetching.
  b) extend WebPage to allow for multiple content sections and their
names (and metadata, and ... yuck)
  c) like a) except put a special synthetic mark on the page to
prevent selection of this page for generation and fetching. This mark
would also help us to update / remove obsolete sub-documents when their
container changes.

I'm leaning towards c).

Now, when it comes to the ParseResult ... it's not an ideal solution
either, because it means we have to keep all sub-document results in
memory. We could avoid it by implementing something that Aperture uses,
which is a sub-crawler - a concept of a parser plugin for compound
formats. The main plugin would return a special result code, which
basically says this is a compound format of type X, and then the
caller (ParseUtil?) would use SubCrawlerFactory.create(typeX,
containerDataStream) to create a parser for the container. This parser
in turn would simply extract sections of the compound document (as
streams) and it would pass each stream to the regular parsing chain. The
caller then needs to iterate over results returned from the SubCrawler.
What do you think?

--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++