Re: SegmentFilter

Andrzej Bialecki Sun, 21 Feb 2010 02:13:50 -0800

On 2010-02-20 23:32, reinhard schwab wrote:

Andrzej Bialecki schrieb:

On 2010-02-20 22:45, reinhard schwab wrote:

the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i believe this comes from


Recno:: 383
URL::
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519


Duplicate content is usually related to the fact that indeed the same
content appears under different urls. This is common enough, so I
don't see this necessarily as a bug in Nutch - we won't know that the
content is identical until we actually fetch it...

Urls may differ in certain systematic ways (e.g. by a set of URL
params, such as sessionId, print=yes, etc) or completely unrelated
(human errors, peculiarities of the content management system, or
mirrors). In your case it seems that the same page is available under
different values of g2_highlightId.

i know. i have implemented several url filters to filter duplicate content.
there is a difference here.
the difference here is that in this case the same content is stored
under the same url several times.
it is stored under
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
and not under
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519

the content for the latter url is empty.
Content:

Ok, then the answer can be found in the protocol status or parse status.You can get protocol status by doing a segment dump of only thecrawl_fetch part (disable all other parts, then the output is lessconfusing). Similarly, parse status can be found in crawl_parse.





--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: SegmentFilter

Reply via email to