Under this proposal, FetcherContent has following fields:
byte[] getContent(); String getContentType(); // new. required to select parser. URL getBaseUrl(); // new. required to select parser. Properties getMetaData() // new. permits extensions.
A couple of questions:
* Did you plan for getContentType() to return the full Content-Type header? I.e. including stuff like character encoding... That would be really useful.
* do you plan to store metadata in inverted lists as well, (which currently would translate into adding new arbitrary fields in Lucene's Document)? This would be very useful in my scenario - I'm doing language detection and key-phrase extraction to enhance the index, and I'd love to store this information in the index itself to avoid the need for separate storage.
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
