[EMAIL PROTECTED] wrote:
(1) I see a need for an indicator (a field) of whether a fetched content
is truncated or not. One possbile use is at search output presentation:
giving user a warning for incomplete cached content or if the cached content
should be provided at all. I would assume this can be done with
Properties getMetaData() in FetcherContent?

This should be detectable by a difference between the length of the content bytes and the contentLength meta data, if present. Is that good enough? If it's not, then we could have the protocol implementation add a meta field for just this purpose.


Should truncation lengths be content-type specific, i.e., we should have a different config parameter to truncate html data than we do for pdf? We could, in general, have our default http protocol handler look for config properties named http.content.limit.text/html, http.content.limit.application/pdf, etc. That would a good feature to add.

(2) what is the difference (different use) between
Properties getMetaData() in FetcherContent and
Properties getMetaData() in ParseData
?

The former is for fetcher protocol generated meta data, the second is for parse-generated metadata. Since the parser has access to the fetcher meta data, it may copy some of these into the parser output, so that they're available for indexing. For example, the default parser implementation should probably copy content-type.


(3) The cached page may be provided in plain text or html format.
Google does that now. I would assume ParseText will not prevent
html text being saved (other than plain text) if the parser does it
(say, a pdf->html parser). Or is there a better approach?

We need a plain text version for indexing and snippets. And we need the raw version for the cache. So, what you're saying is that, for some types, we may also need a third version, a cached conversion to html. That would indeed be a useful feature. How should we implement it?


Perhaps the parser could have an additional method:

  String getHtml();

When this is non-null, the core Nutch code could store it in a FetcherHtml output file. The HtmlParser would return null for this, so that we don't store two copies of Html pages. Then, the cache code can look here for pages whose raw content type is not text/html.

I think we should not add this feature until after we complete the current round of changes.

P.S. I can have some time available if you need help on coding task.

Thanks, That'd be great! I'm going to try to put together an initial version of this pretty soon. It might be incomplete, so maybe I'll check it into a branch. Then perhaps you can help flesh it out, e.g., converting the ftp code to a plugin, adding a command line to to run the parser, etc.


My plan is to check this into HEAD as soon as it (a) works; and (b) supports all the functionality of the current HEAD. Then we can start adding new features (e.g., more parsers) as patches.

Cheers,

Doug


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to