Hi, Doug,
Just a few minor things:
(1) I see a need for an indicator (a field) of whether a fetched content
is truncated or not. One possbile use is at search output presentation:
giving user a warning for incomplete cached content or if the cached content
should be provided at all. I would assume this can be done with
Properties getMetaData() in FetcherContent?
(2) what is the difference (different use) between
Properties getMetaData() in FetcherContent and
Properties getMetaData() in ParseData
?
(3) The cached page may be provided in plain text or html format.
Google does that now. I would assume ParseText will not prevent
html text being saved (other than plain text) if the parser does it
(say, a pdf->html parser). Or is there a better approach?
Thanks a lot.
John
P.S. I can have some time available if you need help on coding task.
On Wed, May 19, 2004 at 11:04:07AM -0700, Doug Cutting wrote:
> Here's a proposal for how I move forward with Stefan & John's
> contributions and concerns.
>
> I propose that we have a three stage pipeline by which URLs are
> processed: fetch, parse, and index. Each stage may be run as a separate
> process, or several may be run at once. Initially I will code things so
> that the fetcher will parse too (as it does now) but so that fetch-time
> parsing can be disabled, and parsing can be done by a separate tool.
> Note that parsing must be completed before database update or indexing.
>
> I will use Stefan's plugin code for all extension points, but not his
> extractor code. I will write parsers for only text/html and text/plain
> initially. Once I get the basics in place then we can start adding more
> plugins.
>
> The stages are:
>
> 1. Fetch
>
> A fetcher takes a sequence of FetchListEntry instances, and produces
> parallel sequences of FetcherContent and FetcherOutput instances.
>
> Under this proposal, FetcherContent has following fields:
>
> byte[] getContent();
> String getContentType(); // new. required to select parser.
> URL getBaseUrl(); // new. required to select parser.
> Properties getMetaData() // new. permits extensions.
>
> FetcherOutput is as before, except its 'title' and 'outLinks' fields are
> removed.
>
> Protocols are the primary extension point for the fetcher. Plugins are
> provided for different URL protocols, e.g., http, ftp, etc. Each maps a
> URL to a FetcherContent. In particular, a fetcher protocol plugin must
> implement:
>
> public interface FetcherProtocol {
> FetcherContent getContent(URL url) throws FetcherException;
> }
>
> public class FetcherException extends NutchException {}
>
> /** Thrown when a request should not be retried.
> * By default, requests are retried. */
> public class ResourceGoneException extends FetcherException {}
>
> 2. Parse
>
> The content parser takes a sequence of FetcherContent instances and
> produces a parallel sequence of ParseText and ParseData instances.
>
> ParseText replaces FetcherText.
>
> ParseData is new, and holds:
>
> String() getTitle();
> OutLink[] getOutLinks();
> Properties getMetaData();
>
> The primary parser extension point is:
>
> public interface ContentParseFactory {
> ContentParse getParse(FetcherOutput fo, FetcherContent c);
> }
>
> public interface ContentParse {
> String getText();
> ParseData getData();
> }
>
> 3. Index
>
> The indexer takes parallel sequences of FetcherOutput, ParseText and
> ParseData instances and produces a Lucene index.
>
> The primary index extension point is:
>
> public interface DocumentFactory {
> Document getDocument(String segment, long doc,
> FetcherOutput fo, ParseText t, ParseData d);
> }
>
> How does this sound? If there are no violent objections, I'll start
> work on it.
>
> Doug
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: SourceForge.net Broadband
> Sign-up now for SourceForge Broadband and get the fastest
> 6.0/768 connection for only $19.95/mo for the first 3 months!
> http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers