Jukka Zitting wrote:
Hi,
On 8/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
A very important aspect of the Parser interface (or actually, the Parse
and Content classes) is that they each may contain arbitrary metadata.
This is required for discovering and passing around both the original
metadata (such as protocol headers, document properties, etc), and other
secondary content (such as data from external sources, or derived
metadata).
Is there a list of all the different metadata items that get passed in
or out of the parser components? My hunch is that the list of items is
relatively short and that even though different parsers might input or
output different metadata, it still might make sense to come up with a
general content model that serves the needs of everyone.
>
Simply returning a String doesn't cut it. Returning a java.util.Map may
be an option, if you use standard Metadata constants as keys - still,
Nutch would have to repackage this anyway into a Writable. And we would
lose a nice property of the current Metadata class, which is the ability
to tolerate minor syntax variations and to store multiple values per key.
The problem I see with a Map or a similar keyed solution is that you
only get to specify the metadata contract as documentated (if ever)
keys instead of as a compile-time interface. Using a Map is fine if
the set of managed information truly varies at runtime, but not when
the set is fixed or at least well bounded.
So far nutch has been build to deal mainly with text type documents.
There's however need also to deal with non textual object eg. images,
movies, sound which will provide content only in form of metadata (ok,
perhaps some text also about the context of object if applicable), so
the metadata names we have today are only a subset of what might be.
I really would not want to restrict the metadata the interface can carry
to a fixed set.
--
Sami Siren