On 2010-02-20 22:45, reinhard schwab wrote:
the content of one page is stored even 7 times.
i believe this comes from

Recno:: 383
URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519

Duplicate content is usually related to the fact that indeed the same content appears under different urls. This is common enough, so I don't see this necessarily as a bug in Nutch - we won't know that the content is identical until we actually fetch it...

Urls may differ in certain systematic ways (e.g. by a set of URL params, such as sessionId, print=yes, etc) or completely unrelated (human errors, peculiarities of the content management system, or mirrors). In your case it seems that the same page is available under different values of g2_highlightId.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to