Doug Cutting wrote:
Should truncation lengths be content-type specific, i.e., we should have a different config parameter to truncate html data than we do for pdf? We could, in general, have our default http protocol handler look for config properties named http.content.limit.text/html, http.content.limit.application/pdf, etc. That would a good feature to add.
I'm in favor of this approach. Partial content of PDF or .doc file is useless most of the time, so it will have to be discarded anyway.
(3) The cached page may be provided in plain text or html format. Google does that now. I would assume ParseText will not prevent html text being saved (other than plain text) if the parser does it (say, a pdf->html parser). Or is there a better approach?
Is it so that Google provides a cache for the _original_ format (other than HTML)? I don't think so, I always thought that they cache only HTML - either the original HTML page, or a result of conversion of other formats to HTML, but never the original binary formats... Except perhaps in the case of plain image formats, when they store thumbnails.
We need a plain text version for indexing and snippets. And we need the raw version for the cache. So, what you're saying is that, for some types, we may also need a third version, a cached conversion to html.
So far the "raw version" == "html version", because Nutch handled only HTML... :-) Now is the time to decide whether to give an option to keep a cache of "really raw" content, i.e. the original binary format like PDF, DOC, BLURFL, etc.
That would indeed be a useful feature. How should we implement it?
Perhaps the parser could have an additional method:
String getHtml();
When this is non-null, the core Nutch code could store it in a FetcherHtml output file. The HtmlParser would return null for this, so that we don't store two copies of Html pages. Then, the cache code can look here for pages whose raw content type is not text/html.
Wait.. I think I'm confused here about what should be stored by default. My take on this is as follows:
* store the plain text for indexing and snippets, no matter what the original format was. This is indispensible in any installation.
* store the HTML version for Web preview, again - no matter what the original format was. These two formats - plain and html - would be pretty mandatory.
* and finally, as an option, store the original content, _IF_ its type is not text/html. This would be purely optional, because it requires a lot more resources.
What do you think?
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
