Andrzej Bialecki schrieb: > On 2010-02-20 23:32, reinhard schwab wrote: >> Andrzej Bialecki schrieb: >>> On 2010-02-20 22:45, reinhard schwab wrote: >>>> the content of one page is stored even 7 times. >>>> http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 >>>> i believe this comes from >>>> >>>> Recno:: 383 >>>> URL:: >>>> http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 >>> >>> Duplicate content is usually related to the fact that indeed the same >>> content appears under different urls. This is common enough, so I >>> don't see this necessarily as a bug in Nutch - we won't know that the >>> content is identical until we actually fetch it... >>> >>> Urls may differ in certain systematic ways (e.g. by a set of URL >>> params, such as sessionId, print=yes, etc) or completely unrelated >>> (human errors, peculiarities of the content management system, or >>> mirrors). In your case it seems that the same page is available under >>> different values of g2_highlightId. >>> >>> >> i know. i have implemented several url filters to filter duplicate >> content. >> there is a difference here. >> the difference here is that in this case the same content is stored >> under the same url several times. >> it is stored under >> http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 >> and not under >> http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 >> >> the content for the latter url is empty. >> Content: > > Ok, then the answer can be found in the protocol status or parse > status. You can get protocol status by doing a segment dump of only > the crawl_fetch part (disable all other parts, then the output is less > confusing). Similarly, parse status can be found in crawl_parse. > > > > this is the fetch status of one crawl datum
Recno:: 1741 URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 CrawlDatum:: Version: 7 Status: 35 (fetch_redir_temp) Fetch time: Fri Feb 19 01:38:04 CET 2010 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 5184000 seconds (60 days) Score: 0.061658654 Signature: null Metadata: _ngt_: 1266539117359_pst_: temp_moved(13), lastModified=0: http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 if there is a temp redirect, the content is stored under the redirect url? to avoid duplicate content stored under the same url, its may be better not to store the content under the redirect url and only add the redirect url to crawl db? regards