However, what I think is ultimately needed to match the features of other search engines is not the ability to return the cached non-html content (there might even be copyright issues with this function...), but an html rendering of non-html content, a la Google's "View as HTML" function.
Why are copyright issues different for HTML than for other formats?
I suspect that the original reason that Google did things this way was not for copyright or usability, but rather to take advantage of their HTML-related technology (e.g., boosting scores for headings, etc.) and to minimize storage requirements. If it was primarily a usability issue then they could convert to html on the fly. Rather it appears that Google decided to convert everything to a common-denominator format early in their pipeline, before the cache is written. Nutch keeps a higher-fidelity cache, which permits it to show the original content, as well as any lower-fidelity renderings.
If we someday index, e.g., headings and bolded text specially then we may find it useful to have a common-denominator intermediate format, like html, that all content types are converted to. But until we do, I don't see much point in caching an HTML representation.
Doug
------------------------------------------------------- This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
