Jukka Zitting wrote:
> Hi,
> 
> On 8/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>> A very important aspect of the Parser interface (or actually, the Parse
>> and Content classes) is that they each may contain arbitrary metadata.
>> This is required for discovering and passing around both the original
>> metadata (such as protocol headers, document properties, etc), and other
>> secondary content (such as data from external sources, or derived 
>> metadata).
> 
> Is there a list of all the different metadata items that get passed in
> or out of the parser components? My hunch is that the list of items is
> relatively short and that even though different parsers might input or
> output different metadata, it still might make sense to come up with a
> general content model that serves the needs of everyone.
 >
>> Simply returning a String doesn't cut it. Returning a java.util.Map may
>> be an option, if you use standard Metadata constants as keys - still,
>> Nutch would have to repackage this anyway into a Writable. And we would
>> lose a nice property of the current Metadata class, which is the ability
>> to tolerate minor syntax variations and to store multiple values per key.
> 
> The problem I see with a Map or a similar keyed solution is that you
> only get to specify the metadata contract as documentated (if ever)
> keys instead of as a compile-time interface. Using a Map is fine if
> the set of managed information truly varies at runtime, but not when
> the set is fixed or at least well bounded.

So far nutch has been build to deal mainly with text type documents. 
There's however need also to deal with non textual object eg.  images, 
movies, sound which will provide content only in form of metadata (ok, 
perhaps some text also about the context of object if applicable), so 
the metadata names we have today are only a subset of what might be.

I really would not want to restrict the metadata the interface can carry 
to a fixed set.

--
  Sami Siren


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to