Andrzej Bialecki wrote:
A couple of questions:
* Did you plan for getContentType() to return the full Content-Type header? I.e. including stuff like character encoding... That would be really useful.
It would have the full content-type header, as specified by http://www.iana.org/assignments/media-types/.
Excellent.
* do you plan to store metadata in inverted lists as well, (which currently would translate into adding new arbitrary fields in Lucene's Document)?
The idea is to make this pluggable. All that you have to do is supply an implementation of the DocumentFactory interface. Typically you'd subclass the default.
For example, you might do something like:
public class MyDocumentFactory extends DocumentFactoryImpl { public Document getDocument (String seg, long doc, FetcherOutput fo, ParseText t, ParseData d) {
Document result = super.getDocument(seg,doc,fo,t,d);
// add my field result.add(Field.Keyword("myMetaField", d.get("myMetaField")));
return result; } }
Does that make sense?
A lot. That's exactly what I'd need. Thank you!
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
