Hi, On 8/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > A very important aspect of the Parser interface (or actually, the Parse > and Content classes) is that they each may contain arbitrary metadata. > This is required for discovering and passing around both the original > metadata (such as protocol headers, document properties, etc), and other > secondary content (such as data from external sources, or derived metadata).
Is there a list of all the different metadata items that get passed in or out of the parser components? My hunch is that the list of items is relatively short and that even though different parsers might input or output different metadata, it still might make sense to come up with a general content model that serves the needs of everyone. > Simply returning a String doesn't cut it. Returning a java.util.Map may > be an option, if you use standard Metadata constants as keys - still, > Nutch would have to repackage this anyway into a Writable. And we would > lose a nice property of the current Metadata class, which is the ability > to tolerate minor syntax variations and to store multiple values per key. The problem I see with a Map or a similar keyed solution is that you only get to specify the metadata contract as documentated (if ever) keys instead of as a compile-time interface. Using a Map is fine if the set of managed information truly varies at runtime, but not when the set is fixed or at least well bounded. Another concern with both the Parce class in Nutch and my TextExtractor interface is that the body content is returned as a single text stream (a String and a Reader respectively). This doesn't allow the parser to pass along extra information like the emphasis of certain parts (think of headings or links in html) or the language of the text (e.g. xml:lang). I'm not too familiar with Lucene to know if it could use such information, so this might be a YAGNI, but inversion of control with a Builder interface would be a pretty powerful solution for passing such information. BR, Jukka Zitting -- Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED] Software craftsmanship, JCR consulting, and Java development ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
