On 2010-02-20 22:45, reinhard schwab wrote:
the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i believe this comes from
Recno:: 383
URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519
Duplicate content is usually related to the fact that indeed the same
content appears under different urls. This is common enough, so I don't
see this necessarily as a bug in Nutch - we won't know that the content
is identical until we actually fetch it...
Urls may differ in certain systematic ways (e.g. by a set of URL params,
such as sessionId, print=yes, etc) or completely unrelated (human
errors, peculiarities of the content management system, or mirrors). In
your case it seems that the same page is available under different
values of g2_highlightId.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com