On Thu, May 20, 2004 at 10:43:40AM -0700, Doug Cutting wrote: > [EMAIL PROTECTED] wrote: > >(1) I see a need for an indicator (a field) of whether a fetched content > >is truncated or not. One possbile use is at search output presentation: > >giving user a warning for incomplete cached content or if the cached > >content > >should be provided at all. I would assume this can be done with > >Properties getMetaData() in FetcherContent? > > This should be detectable by a difference between the length of the > content bytes and the contentLength meta data, if present. Is that good > enough? If it's not, then we could have the protocol implementation add > a meta field for just this purpose.
A meta field would be better. > > Should truncation lengths be content-type specific, i.e., we should have > a different config parameter to truncate html data than we do for pdf? > We could, in general, have our default http protocol handler look for > config properties named http.content.limit.text/html, > http.content.limit.application/pdf, etc. That would a good feature to add. That will be good too. > > >(2) what is the difference (different use) between > >Properties getMetaData() in FetcherContent and > >Properties getMetaData() in ParseData > >? > > The former is for fetcher protocol generated meta data, the second is > for parse-generated metadata. Since the parser has access to the > fetcher meta data, it may copy some of these into the parser output, so > that they're available for indexing. For example, the default parser > implementation should probably copy content-type. I like this kind of flexibility. > > >(3) The cached page may be provided in plain text or html format. > >Google does that now. I would assume ParseText will not prevent > >html text being saved (other than plain text) if the parser does it > >(say, a pdf->html parser). Or is there a better approach? > > We need a plain text version for indexing and snippets. And we need the > raw version for the cache. So, what you're saying is that, for some > types, we may also need a third version, a cached conversion to html. > That would indeed be a useful feature. How should we implement it? > > Perhaps the parser could have an additional method: > > String getHtml(); > > When this is non-null, the core Nutch code could store it in a > FetcherHtml output file. The HtmlParser would return null for this, so > that we don't store two copies of Html pages. Then, the cache code can > look here for pages whose raw content type is not text/html. I was thinking maybe a boolean field can be added to ParserText indicating it is in plain text or html text. However this would require an indexer with html-parsing capability at index stage. Probably not a good idea? > > I think we should not add this feature until after we complete the > current round of changes. I agree. > > >P.S. I can have some time available if you need help on coding task. > > Thanks, That'd be great! I'm going to try to put together an initial > version of this pretty soon. It might be incomplete, so maybe I'll > check it into a branch. Then perhaps you can help flesh it out, e.g., > converting the ftp code to a plugin, adding a command line to to run the > parser, etc. > > My plan is to check this into HEAD as soon as it (a) works; and (b) > supports all the functionality of the current HEAD. Then we can start > adding new features (e.g., more parsers) as patches. Great! John ------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
