Hi,

On 8/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> A very important aspect of the Parser interface (or actually, the Parse
> and Content classes) is that they each may contain arbitrary metadata.
> This is required for discovering and passing around both the original
> metadata (such as protocol headers, document properties, etc), and other
> secondary content (such as data from external sources, or derived metadata).

Is there a list of all the different metadata items that get passed in
or out of the parser components? My hunch is that the list of items is
relatively short and that even though different parsers might input or
output different metadata, it still might make sense to come up with a
general content model that serves the needs of everyone.

> Simply returning a String doesn't cut it. Returning a java.util.Map may
> be an option, if you use standard Metadata constants as keys - still,
> Nutch would have to repackage this anyway into a Writable. And we would
> lose a nice property of the current Metadata class, which is the ability
> to tolerate minor syntax variations and to store multiple values per key.

The problem I see with a Map or a similar keyed solution is that you
only get to specify the metadata contract as documentated (if ever)
keys instead of as a compile-time interface. Using a Map is fine if
the set of managed information truly varies at runtime, but not when
the set is fixed or at least well bounded.

Another concern with both the Parce class in Nutch and my
TextExtractor interface is that the body content is returned as a
single text stream (a String and a Reader respectively). This doesn't
allow the parser to pass along extra information like the emphasis of
certain parts (think of headings or links in html) or the language of
the text (e.g. xml:lang). I'm not too familiar with Lucene to know if
it could use such information, so this might be a YAGNI, but inversion
of control with a Builder interface would be a pretty powerful
solution for passing such information.

BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting, and Java development

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to