Re: SegmentFilter

reinhard schwab Sun, 21 Feb 2010 15:31:51 -0800
Andrzej Bialecki schrieb:
> On 2010-02-21 12:36, reinhard schwab wrote:
>> Andrzej Bialecki schrieb:
>>> On 2010-02-20 23:32, reinhard schwab wrote:
>>>> Andrzej Bialecki schrieb:
>>>>> On 2010-02-20 22:45, reinhard schwab wrote:
>>>>>> the content of one page is stored even 7 times.
>>>>>> http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
>>>>>> i believe this comes from
>>>>>>
>>>>>> Recno:: 383
>>>>>> URL::
>>>>>> http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519
>>>>>
>>>>> Duplicate content is usually related to the fact that indeed the same
>>>>> content appears under different urls. This is common enough, so I
>>>>> don't see this necessarily as a bug in Nutch - we won't know that the
>>>>> content is identical until we actually fetch it...
>>>>>
>>>>> Urls may differ in certain systematic ways (e.g. by a set of URL
>>>>> params, such as sessionId, print=yes, etc) or completely unrelated
>>>>> (human errors, peculiarities of the content management system, or
>>>>> mirrors). In your case it seems that the same page is available under
>>>>> different values of g2_highlightId.
>>>>>
>>>>>
>>>> i know. i have implemented several url filters to filter duplicate
>>>> content.
>>>> there is a difference here.
>>>> the difference here is that in this case the same content is stored
>>>> under the same url several times.
>>>> it is stored under
>>>> http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
>>>> and not under
>>>> http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519
>>>>
>>>> the content for the latter url is empty.
>>>> Content:
>>>
>>> Ok, then the answer can be found in the protocol status or parse
>>> status. You can get protocol status by doing a segment dump of only
>>> the crawl_fetch part (disable all other parts, then the output is less
>>> confusing). Similarly, parse status can be found in crawl_parse.
>>>
>>>
>>>
>>>
>> this is the fetch status of one crawl datum
>>
>> Recno:: 1741
>> URL::
>> http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519
>>
>> CrawlDatum::
>> Version: 7
>> Status: 35 (fetch_redir_temp)
>> Fetch time: Fri Feb 19 01:38:04 CET 2010
>> Modified time: Thu Jan 01 01:00:00 CET 1970
>> Retries since fetch: 0
>> Retry interval: 5184000 seconds (60 days)
>> Score: 0.061658654
>> Signature: null
>> Metadata: _ngt_: 1266539117359_pst_: temp_moved(13), lastModified=0:
>> http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
>>
>> if there is a temp redirect, the content is stored under the redirect
>> url?
>> to avoid duplicate content stored under the same url, its may be better
>> not to store the content under the redirect url
>> and only add the redirect url to crawl db?
>
> Certainly - this is why it's usually best _not_ to follow redirects
> immediately, but instead record them in the db, and then follow them
> in the next cycle. You can achieve this effect by setting the
> http.redirect.max=0.
>
> (Note: this is an option, because there are situations when it's still
> better to immediately follow redirects, e.g. when the original url
> sets authentication cookies).
>
>
i have tested it with http.redirect.max=0 and with http.redirect.max=10.
there is no mean to filter duplicate content for a page, if
http.redirect.max>0?
is there no reduce method later which could filter this out?
to retrieve the cookies  only headers are needed. there is no need to
store the content.
the contract of redirects could be changed to only process the http
headers and ignore the content?
content can be processed in the next cycle.
regards
Re: SegmentFilter

Reply via email to