I will use Stefan's plugin code for all extension points, but not his extractor code.
I would be so much happy in case the code will be part of nutch, i don't worry about the extractor code.

1. Fetch

A fetcher takes a sequence of FetchListEntry instances, and produces parallel sequences of FetcherContent and FetcherOutput instances.

Under this proposal, FetcherContent has following fields:

  byte[] getContent();
  String getContentType();  // new.  required to select parser.
  URL getBaseUrl();         // new.  required to select parser.
  Properties getMetaData()  // new.  permits extensions.

I'm confused since we have 2 getMetaData methods, i guess this methods gives information like encoding?
I would found it good if we can have a getFetchMetaData and getContentMetaData
FetcherOutput is as before, except its 'title' and 'outLinks' fields are removed.

Protocols are the primary extension point for the fetcher. Plugins are provided for different URL protocols, e.g., http, ftp, etc. Each maps a URL to a FetcherContent. In particular, a fetcher protocol plugin must implement:

That would be great and roll out the plugin-system power. ;)
public class FetcherException extends NutchException {}

Do a exception handling and delegation model is a good step!

How does this sound? If there are no violent objections, I'll start work on it.
Sounds like Music. ;-)

Just a questions, would it make sense to use the fetch output as well for content archiving?
I'm highly interested to archive fetched content versionized!
This would be the killer feature for all university libraries in this world since they have to archive as well digital content in furture.
Imaging a archive/search peer to peer cluster from universities! ;-O


Stefan



-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to