All,

I just committed the first cut at the new API's we've discussed over the past few weeks. In particular:

- I removed http and ftp protocol implementations and html parsing from the Nutch core;
- I added plugins for the http protocol and for html and plain text parsing. Ftp has not yet been re-added.
- I removed the RequestScheduler fetcher implementation, as I only have time to maintain a single fetcher. If someone would like, then they can port RequestScheduler to the new APIs. Fetcher.java should now correctly support robots.txt and be polite.


I made a lot of changes, so things are probably still a bit unstable. I have also not yet performance tested anything, so some things may have gotten slower. Please send a note to the list if you notice any new problems, performance or otherwise.

Now we can re-add ftp support, and also easily add support for PDF, Word, etc. I'd love to see contributions in these areas.

Next up: I plan to make the indexer and query translator extendable so that one can easily index and query meta data.

Doug


Doug Cutting wrote:
Here's a proposal for how I move forward with Stefan & John's contributions and concerns.

I propose that we have a three stage pipeline by which URLs are processed: fetch, parse, and index. Each stage may be run as a separate process, or several may be run at once. Initially I will code things so that the fetcher will parse too (as it does now) but so that fetch-time parsing can be disabled, and parsing can be done by a separate tool. Note that parsing must be completed before database update or indexing.

I will use Stefan's plugin code for all extension points, but not his extractor code. I will write parsers for only text/html and text/plain initially. Once I get the basics in place then we can start adding more plugins.

The stages are:

1. Fetch

A fetcher takes a sequence of FetchListEntry instances, and produces parallel sequences of FetcherContent and FetcherOutput instances.

Under this proposal, FetcherContent has following fields:

  byte[] getContent();
  String getContentType();  // new.  required to select parser.
  URL getBaseUrl();         // new.  required to select parser.
  Properties getMetaData()  // new.  permits extensions.

FetcherOutput is as before, except its 'title' and 'outLinks' fields are removed.

Protocols are the primary extension point for the fetcher. Plugins are provided for different URL protocols, e.g., http, ftp, etc. Each maps a URL to a FetcherContent. In particular, a fetcher protocol plugin must implement:

public interface FetcherProtocol {
  FetcherContent getContent(URL url) throws FetcherException;
}

public class FetcherException extends NutchException {}

/** Thrown when a request should not be retried.
 * By default, requests are retried. */
public class ResourceGoneException extends FetcherException {}

2. Parse

The content parser takes a sequence of FetcherContent instances and produces a parallel sequence of ParseText and ParseData instances.

ParseText replaces FetcherText.

ParseData is new, and holds:

  String() getTitle();
  OutLink[] getOutLinks();
  Properties getMetaData();

The primary parser extension point is:

public interface ContentParseFactory {
  ContentParse getParse(FetcherOutput fo, FetcherContent c);
}

public interface ContentParse {
  String getText();
  ParseData getData();
}

3. Index

The indexer takes parallel sequences of FetcherOutput, ParseText and ParseData instances and produces a Lucene index.

The primary index extension point is:

public interface DocumentFactory {
  Document getDocument(String segment, long doc,
                       FetcherOutput fo, ParseText t, ParseData d);
}

How does this sound? If there are no violent objections, I'll start work on it.

Doug


-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.
From Windows to Linux, servers to mobile, InstallShield X is the one
installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to