Sami Siren wrote:
>> Original motivation for this was http headers and meta tags, which 
>> can have multiple values. Another case is the language 
>> identification, where the same key may have multiple values, coming 
>> from different sources. Additionally, MapWritable supports any 
>> Writable, which is quite handy to store non-string data and to avoid 
>> converting to/from strings.
>
> I am not jerking on MapWritable, in fact I think it's quite efficient 
> piece of code :) IMO the support for Writable in values is valuable 
> but for keys hmmm... perhaps TextWritable is enough.

Yes, that's a sensible assumption - at least I've never had any need for 
any other types of keys so far ...

>
> I did play around this thing earlier by implementing something that 
> you could call external meta data. Which in concrete means that I 
> created a separate sequence file of (in this particular case DMOZ 
> data), keyed it by url and used that as sort of metadata during 
> indexing phase (mapped together with rest of nutch data).
>
> The benefit of this kind of approach compared to current one (static 
> metadata inside crawldb) is that I can manage the metadata completely 
> separated from crawldb operations and crawldb operations run faster 
> because of less data to move around.

Yes, that's true - I sometimes use this approach too. However, the 
downside of this method is its relative complexity: instead of just 
adding a key/value pair, and have it automagically appear wherever you 
have a CrawlDatum, you now have to manage a separate data file, modify 
it using custom tools and then make sure that all parts of Nutch can 
optionally include this file in the input (and output if you were to 
modify it) ... It can be done, of course, it's just much more complex.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to