Andrzej Bialecki wrote:
However, what I think is ultimately needed to match the features of other search engines is not the ability to return the cached non-html content (there might even be copyright issues with this function...), but an html rendering of non-html content, a la Google's "View as HTML" function.
Why are copyright issues different for HTML than for other formats?
Because it is much less common to encounter a restrictive license on HTML than on other formats.
I suspect that the original reason that Google did things this way was not for copyright or usability, but rather to take advantage of their HTML-related technology (e.g., boosting scores for headings, etc.) and to minimize storage requirements. If it was primarily a usability issue then they could convert to html on the fly. Rather it appears that Google decided to convert everything to a common-denominator format early in their pipeline, before the cache is written. Nutch keeps a higher-fidelity cache, which permits it to show the original content, as well as any lower-fidelity renderings.
This is technically true. However, my point was that someone could treat this high-fidelity caching as unauthorized re-distribution of content covered by more restrictive licenses than HTML. Think e.g. about mp3, avi, and high-quality images, that although technically can be downloaded but their re-distribution is legally encumbered. If Nutch uses a lower quality copy in cache, then it's easy to defend against the accusations of abuse. However, if you can download basically the same content from Nutch's cache as from the original site, you could run into problems.
Google steers nicely around this legal problem by always providing the lower resolution content, and by clearly "stamping" the content so that it cannot be mistaken for the content coming from the original site.
That said, I think this functionality is good to have anyway, even if individual Nutch operators may decide not to display such content on their public sites.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
------------------------------------------------------- This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
