Re: [Nutch-dev] plugin proposal

Andrzej Bialecki Fri, 21 May 2004 03:36:59 -0700

Doug Cutting wrote:

Andrzej Bialecki wrote:
A couple of questions:
* Did you plan for getContentType() to return the full Content-Type header? I.e. including stuff like character encoding... That would be really useful.
It would have the full content-type header, as specified by http://www.iana.org/assignments/media-types/.


Excellent.

* do you plan to store metadata in inverted lists as well, (which currently would translate into adding new arbitrary fields in Lucene's Document)?
The idea is to make this pluggable. All that you have to do is supply an implementation of the DocumentFactory interface. Typically you'd subclass the default.
For example, you might do something like:
public class MyDocumentFactory extends DocumentFactoryImpl {
  public Document getDocument
    (String seg, long doc, FetcherOutput fo, ParseText t, ParseData d) {
    Document result = super.getDocument(seg,doc,fo,t,d);
    // add my field
    result.add(Field.Keyword("myMetaField", d.get("myMetaField")));
    return result;
  }
}
Does that make sense?


A lot. That's exactly what I'd need. Thank you!

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] plugin proposal

Reply via email to