On 2010-02-20 23:32, reinhard schwab wrote:
Andrzej Bialecki schrieb:
On 2010-02-20 22:45, reinhard schwab wrote:
the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i believe this comes from

Recno:: 383
URL::
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519

Duplicate content is usually related to the fact that indeed the same
content appears under different urls. This is common enough, so I
don't see this necessarily as a bug in Nutch - we won't know that the
content is identical until we actually fetch it...

Urls may differ in certain systematic ways (e.g. by a set of URL
params, such as sessionId, print=yes, etc) or completely unrelated
(human errors, peculiarities of the content management system, or
mirrors). In your case it seems that the same page is available under
different values of g2_highlightId.


i know. i have implemented several url filters to filter duplicate content.
there is a difference here.
the difference here is that in this case the same content is stored
under the same url several times.
it is stored under
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
and not under
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519

the content for the latter url is empty.
Content:

Ok, then the answer can be found in the protocol status or parse status. You can get protocol status by doing a segment dump of only the crawl_fetch part (disable all other parts, then the output is less confusing). Similarly, parse status can be found in crawl_parse.




--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to