Hi, Doug,

Just a few minor things:

(1) I see a need for an indicator (a field) of whether a fetched content
is truncated or not. One possbile use is at search output presentation:
giving user a warning for incomplete cached content or if the cached content
should be provided at all. I would assume this can be done with
Properties getMetaData() in FetcherContent?

(2) what is the difference (different use) between
Properties getMetaData() in FetcherContent and
Properties getMetaData() in ParseData
?

(3) The cached page may be provided in plain text or html format.
Google does that now. I would assume ParseText will not prevent
html text being saved (other than plain text) if the parser does it
(say, a pdf->html parser). Or is there a better approach?

Thanks a lot.

John

P.S. I can have some time available if you need help on coding task.

On Wed, May 19, 2004 at 11:04:07AM -0700, Doug Cutting wrote:
> Here's a proposal for how I move forward with Stefan & John's 
> contributions and concerns.
> 
> I propose that we have a three stage pipeline by which URLs are 
> processed: fetch, parse, and index.  Each stage may be run as a separate 
> process, or several may be run at once.  Initially I will code things so 
> that the fetcher will parse too (as it does now) but so that fetch-time 
> parsing can be disabled, and parsing can be done by a separate tool. 
> Note that parsing must be completed before database update or indexing.
> 
> I will use Stefan's plugin code for all extension points, but not his 
> extractor code.  I will write parsers for only text/html and text/plain 
> initially.  Once I get the basics in place then we can start adding more 
> plugins.
> 
> The stages are:
> 
> 1. Fetch
> 
> A fetcher takes a sequence of FetchListEntry instances, and produces 
> parallel sequences of FetcherContent and FetcherOutput instances.
> 
> Under this proposal, FetcherContent has following fields:
> 
>   byte[] getContent();
>   String getContentType();  // new.  required to select parser.
>   URL getBaseUrl();         // new.  required to select parser.
>   Properties getMetaData()  // new.  permits extensions.
> 
> FetcherOutput is as before, except its 'title' and 'outLinks' fields are 
> removed.
> 
> Protocols are the primary extension point for the fetcher.  Plugins are 
> provided for different URL protocols, e.g., http, ftp, etc.  Each maps a 
> URL to a FetcherContent.  In particular, a fetcher protocol plugin must 
> implement:
> 
> public interface FetcherProtocol {
>   FetcherContent getContent(URL url) throws FetcherException;
> }
> 
> public class FetcherException extends NutchException {}
> 
> /** Thrown when a request should not be retried.
>  * By default, requests are retried. */
> public class ResourceGoneException extends FetcherException {}
> 
> 2. Parse
> 
> The content parser takes a sequence of FetcherContent instances and 
> produces a parallel sequence of ParseText and ParseData instances.
> 
> ParseText replaces FetcherText.
> 
> ParseData is new, and holds:
> 
>   String() getTitle();
>   OutLink[] getOutLinks();
>   Properties getMetaData();
> 
> The primary parser extension point is:
> 
> public interface ContentParseFactory {
>   ContentParse getParse(FetcherOutput fo, FetcherContent c);
> }
> 
> public interface ContentParse {
>   String getText();
>   ParseData getData();
> }
> 
> 3. Index
> 
> The indexer takes parallel sequences of FetcherOutput, ParseText and 
> ParseData instances and produces a Lucene index.
> 
> The primary index extension point is:
> 
> public interface DocumentFactory {
>   Document getDocument(String segment, long doc,
>                        FetcherOutput fo, ParseText t, ParseData d);
> }
> 
> How does this sound?  If there are no violent objections, I'll start 
> work on it.
> 
> Doug
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: SourceForge.net Broadband
> Sign-up now for SourceForge Broadband and get the fastest
> 6.0/768 connection for only $19.95/mo for the first 3 months!
> http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to