Re: SegmentFilter

reinhard schwab Sun, 21 Feb 2010 03:28:35 -0800

Andrzej Bialecki schrieb:
> On 2010-02-20 23:32, reinhard schwab wrote:
>> Andrzej Bialecki schrieb:
>>> On 2010-02-20 22:45, reinhard schwab wrote:
>>>> the content of one page is stored even 7 times.
>>>> http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
>>>> i believe this comes from
>>>>
>>>> Recno:: 383
>>>> URL::
>>>> http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519
>>>
>>> Duplicate content is usually related to the fact that indeed the same
>>> content appears under different urls. This is common enough, so I
>>> don't see this necessarily as a bug in Nutch - we won't know that the
>>> content is identical until we actually fetch it...
>>>
>>> Urls may differ in certain systematic ways (e.g. by a set of URL
>>> params, such as sessionId, print=yes, etc) or completely unrelated
>>> (human errors, peculiarities of the content management system, or
>>> mirrors). In your case it seems that the same page is available under
>>> different values of g2_highlightId.
>>>
>>>
>> i know. i have implemented several url filters to filter duplicate
>> content.
>> there is a difference here.
>> the difference here is that in this case the same content is stored
>> under the same url several times.
>> it is stored under
>> http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
>> and not under
>> http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519
>>
>> the content for the latter url is empty.
>> Content:
>
> Ok, then the answer can be found in the protocol status or parse
> status. You can get protocol status by doing a segment dump of only
> the crawl_fetch part (disable all other parts, then the output is less
> confusing). Similarly, parse status can be found in crawl_parse.
>
>
>
>
this is the fetch status of one crawl datum


Recno:: 1741
URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519

CrawlDatum::
Version: 7
Status: 35 (fetch_redir_temp)
Fetch time: Fri Feb 19 01:38:04 CET 2010
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 0.061658654
Signature: null
Metadata: _ngt_: 1266539117359_pst_: temp_moved(13), lastModified=0:
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8

if there is a temp redirect, the content is stored under the redirect url?
to avoid duplicate content stored under the same url, its may be better
not to store the content under the redirect url
and only add the redirect url to crawl db?
regards

Re: SegmentFilter

Reply via email to