Andrzej Bialecki wrote:
> Jukka Zitting wrote:
> 
>> The Parser interface is also bound to the ideas of fetching content
>> from the network and indexing it using a standard content model
>> through the Content and Parse dependencies. For the Tika project I'd
>> like to look for ways to generalize this, as neither of these ideas
>> apply for example to the needs of the Apache Jackrabbit project. My
>> TextExtractor proposal avoids these dependencies by using just a
>> binary stream, a content type and an optional character encoding to
>> produce a single text stream, but that approach fails to support more
>> structured index content models. I'm trying to find a solution that
>> combines the best parts of both approaches.
> 
> A very important aspect of the Parser interface (or actually, the Parse 
> and Content classes) is that they each may contain arbitrary metadata. 
> This is required for discovering and passing around both the original 
> metadata (such as protocol headers, document properties, etc), and other 
> secondary content (such as data from external sources, or derived 
> metadata).
> 
> Simply returning a String doesn't cut it. Returning a java.util.Map may 
> be an option, if you use standard Metadata constants as keys - still, 
> Nutch would have to repackage this anyway into a Writable. And we would 
> lose a nice property of the current Metadata class, which is the ability 
> to tolerate minor syntax variations and to store multiple values per key.
> 
The tolerance for syntax variations should instead of written into meta 
data object be in a separate class perhaps implemented as a decorator to 
actual meta data. In fact places where nutch needs to take advantage of 
this functionality (actually in case of http headers only??) are rarer 
(in number) than those where we know exactly the names of meta data keys 
(because we put them there).

I'd +1 if we'd go for a Map as a interface to meta data and in the same 
time perhaps change the Crawldb's metadata to the same meta data 
implementation or subclass of it.

Perhaps we could even go for Map<String,String> or is there actually 
some use case for having multiple values for single key?

--
  Sami Siren

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to