On Thu, May 20, 2004 at 10:43:40AM -0700, Doug Cutting wrote:
> [EMAIL PROTECTED] wrote:
> >(1) I see a need for an indicator (a field) of whether a fetched content
> >is truncated or not. One possbile use is at search output presentation:
> >giving user a warning for incomplete cached content or if the cached 
> >content
> >should be provided at all. I would assume this can be done with
> >Properties getMetaData() in FetcherContent?
> 
> This should be detectable by a difference between the length of the 
> content bytes and the contentLength meta data, if present.  Is that good 
> enough?  If it's not, then we could have the protocol implementation add 
> a meta field for just this purpose.

A meta field would be better.

> 
> Should truncation lengths be content-type specific, i.e., we should have 
> a different config parameter to truncate html data than we do for pdf? 
> We could, in general, have our default http protocol handler look for 
> config properties named http.content.limit.text/html, 
> http.content.limit.application/pdf, etc.  That would a good feature to add.

That will be good too.

> 
> >(2) what is the difference (different use) between
> >Properties getMetaData() in FetcherContent and
> >Properties getMetaData() in ParseData
> >?
> 
> The former is for fetcher protocol generated meta data, the second is 
> for parse-generated metadata.  Since the parser has access to the 
> fetcher meta data, it may copy some of these into the parser output, so 
> that they're available for indexing.  For example, the default parser 
> implementation should probably copy content-type.

I like this kind of flexibility.

> 
> >(3) The cached page may be provided in plain text or html format.
> >Google does that now. I would assume ParseText will not prevent
> >html text being saved (other than plain text) if the parser does it
> >(say, a pdf->html parser). Or is there a better approach?
> 
> We need a plain text version for indexing and snippets.  And we need the 
> raw version for the cache.  So, what you're saying is that, for some 
> types, we may also need a third version, a cached conversion to html. 
> That would indeed be a useful feature.  How should we implement it?
> 
> Perhaps the parser could have an additional method:
> 
>   String getHtml();
> 
> When this is non-null, the core Nutch code could store it in a 
> FetcherHtml output file.  The HtmlParser would return null for this, so 
> that we don't store two copies of Html pages.  Then, the cache code can 
> look here for pages whose raw content type is not text/html.

I was thinking maybe a boolean field can be added to ParserText indicating
it is in plain text or html text. However this would require an indexer
with html-parsing capability at index stage. Probably not a good idea?

> 
> I think we should not add this feature until after we complete the 
> current round of changes.

I agree.

> 
> >P.S. I can have some time available if you need help on coding task.
> 
> Thanks, That'd be great!  I'm going to try to put together an initial 
> version of this pretty soon.  It might be incomplete, so maybe I'll 
> check it into a branch.  Then perhaps you can help flesh it out, e.g., 
> converting the ftp code to a plugin, adding a command line to to run the 
> parser, etc.
> 
> My plan is to check this into HEAD as soon as it (a) works; and (b) 
> supports all the functionality of the current HEAD.  Then we can start 
> adding new features (e.g., more parsers) as patches.

Great!

John


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to