On 2010-02-20 23:32, reinhard schwab wrote:
Andrzej Bialecki schrieb:
On 2010-02-20 22:45, reinhard schwab wrote:
the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i believe this comes from
Recno:: 383
URL::
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519
Duplicate content is usually related to the fact that indeed the same
content appears under different urls. This is common enough, so I
don't see this necessarily as a bug in Nutch - we won't know that the
content is identical until we actually fetch it...
Urls may differ in certain systematic ways (e.g. by a set of URL
params, such as sessionId, print=yes, etc) or completely unrelated
(human errors, peculiarities of the content management system, or
mirrors). In your case it seems that the same page is available under
different values of g2_highlightId.
i know. i have implemented several url filters to filter duplicate content.
there is a difference here.
the difference here is that in this case the same content is stored
under the same url several times.
it is stored under
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
and not under
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519
the content for the latter url is empty.
Content:
Ok, then the answer can be found in the protocol status or parse status.
You can get protocol status by doing a segment dump of only the
crawl_fetch part (disable all other parts, then the output is less
confusing). Similarly, parse status can be found in crawl_parse.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com