Doug Cutting wrote:
Andrzej Bialecki wrote:

However, what I think is ultimately needed to match the features of other search engines is not the ability to return the cached non-html content (there might even be copyright issues with this function...), but an html rendering of non-html content, a la Google's "View as HTML" function.


Why are copyright issues different for HTML than for other formats?

Because it is much less common to encounter a restrictive license on HTML than on other formats.



I suspect that the original reason that Google did things this way was not for copyright or usability, but rather to take advantage of their HTML-related technology (e.g., boosting scores for headings, etc.) and to minimize storage requirements. If it was primarily a usability issue then they could convert to html on the fly. Rather it appears that Google decided to convert everything to a common-denominator format early in their pipeline, before the cache is written. Nutch keeps a higher-fidelity cache, which permits it to show the original content, as well as any lower-fidelity renderings.

This is technically true. However, my point was that someone could treat this high-fidelity caching as unauthorized re-distribution of content covered by more restrictive licenses than HTML. Think e.g. about mp3, avi, and high-quality images, that although technically can be downloaded but their re-distribution is legally encumbered. If Nutch uses a lower quality copy in cache, then it's easy to defend against the accusations of abuse. However, if you can download basically the same content from Nutch's cache as from the original site, you could run into problems.


Google steers nicely around this legal problem by always providing the lower resolution content, and by clearly "stamping" the content so that it cannot be mistaken for the content coming from the original site.

That said, I think this functionality is good to have anyway, even if individual Nutch operators may decide not to display such content on their public sites.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to