Jukka Zitting wrote:
> Hi,
>
> On 8/19/06, Sami Siren <[EMAIL PROTECTED]> wrote:
>> So far nutch has been build to deal mainly with text type documents.
>> There's however need also to deal with non textual object eg.  images,
>> movies, sound which will provide content only in form of metadata (ok,
>> perhaps some text also about the context of object if applicable), so
>> the metadata names we have today are only a subset of what might be.
>>
>> I really would not want to restrict the metadata the interface can carry
>> to a fixed set.
>
> But if it's an open Map, how do you index and search using that, i.e.
> what is the mapping between the Map keys used by a parser component
> and the field names in the resulting Lucene index? How do we enforce
> that an MPEG parser uses the same Map keys as a JPEG parser when
> encountering metadata with the same semantics?
>
> I'm not opposed to using a Map for truly variable metadata, like HTML
> <meta/> tags with unknown names, but if we want common handling for
> example for Dublin Core metadata, it would be better to enforce that
> on the interface level.

Well, Nutch already does this in a way, but it's a "soft" endorsement 
rather than a hard enforcement .. ;) We define keys for all common 
metadata sets (DC, Office, HttpHeaders), and plugin writers are supposed 
to use them, unless they can't find any metadata key with matching 
semantics.

Then, other indexing plugins expect certain metadata to be available 
under these keys, and create appropriate Lucene fields, again using 
predefined field names.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to