[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

Markus Jelsma (JIRA) Tue, 23 Aug 2011 03:51:11 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089382#comment-13089382
 ]


Markus Jelsma commented on NUTCH-578:
-------------------------------------

I just confirmed this is still an issue. I've changed the +1 day fetch time in 
AbstractFetchSchedule to do some quick tests. With retries at default of three 
i see the following:

{code}
markus@midas:~/projects/apache/nutch/branches/branch-1.4/runtime/local$ 
bin/nutch readdb crawl/crawldb/ -url http://localhost/~markus/forbidden.php
URL: http://localhost/~markus/forbidden.php
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Aug 23 12:26:15 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 40 seconds (0 days)
Score: 1.0
Signature: null
Metadata: 

markus@midas:~/projects/apache/nutch/branches/branch-1.4/runtime/local$ 
bin/nutch readdb crawl/crawldb/ -url http://localhost/~markus/forbidden.php
URL: http://localhost/~markus/forbidden.php
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Aug 23 12:28:31 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 40 seconds (0 days)
Score: 1.0
Signature: null
Metadata: _pst_: exception(16), lastModified=0: Http code=403, 
url=http://localhost/~markus/forbidden.php

markus@midas:~/projects/apache/nutch/branches/branch-1.4/runtime/local$ 
bin/nutch readdb crawl/crawldb/ -url http://localhost/~markus/forbidden.php
URL: http://localhost/~markus/forbidden.php
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Aug 23 12:30:42 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 2
Retry interval: 40 seconds (0 days)
Score: 1.0
Signature: null
Metadata: _pst_: exception(16), lastModified=0: Http code=403, 
url=http://localhost/~markus/forbidden.php

markus@midas:~/projects/apache/nutch/branches/branch-1.4/runtime/local$ 
bin/nutch readdb crawl/crawldb/ -url http://localhost/~markus/forbidden.php
URL: http://localhost/~markus/forbidden.php
Version: 7
Status: 3 (db_gone)
Fetch time: Tue Aug 23 12:32:49 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 3
Retry interval: 40 seconds (0 days)
Score: 1.0
Signature: null
Metadata: _pst_: exception(16), lastModified=0: Http code=403, 
url=http://localhost/~markus/forbidden.php

markus@midas:~/projects/apache/nutch/branches/branch-1.4/runtime/local$ 
bin/nutch readdb crawl/crawldb/ -url http://localhost/~markus/forbidden.php
URL: http://localhost/~markus/forbidden.php
Version: 7
Status: 3 (db_gone)
Fetch time: Tue Aug 23 12:34:55 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 4
Retry interval: 40 seconds (0 days)
Score: 1.0
Signature: null
Metadata: _pst_: exception(16), lastModified=0: Http code=403, 
url=http://localhost/~markus/forbidden.php
{code}

It's status is marked as DB_GONE at the third fetch, not retry, seems like a 
little discrepancy. It's then simply retried again and again every day or 
whatever manual increment is used and the Retries since Fetch is incremented. 
This behaviour is different than a true DB_GONE which doesn't get is Retries 
incremented. It's also different in that the retry interval of a true DB_GONE 
is increased by 50%.

What behaviour is desired? The issue only describes a problem but does not 
provide a written solution. 


> URL fetched with 403 is generated over and over again
> -----------------------------------------------------
>
>                 Key: NUTCH-578
>                 URL: https://issues.apache.org/jira/browse/NUTCH-578
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>         Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
> have checked out the most recent version of the trunk as of Nov 20, 2007
>            Reporter: Nathaniel Powell
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, 
> NUTCH-578_v4.patch, crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, 
> urls.txt
>
>
> I have not changed the following parameter in the nutch-default.xml:
> <property>
>   <name>db.fetch.retry.max</name>
>   <value>3</value>
>   <description>The maximum number of times a url that has encountered
>   recoverable errors is generated for fetch.</description>
> </property>
> However, there is a URL which is on the site that I'm crawling, 
> www.teachertube.com, which keeps being generated over and over again for 
> almost every segment (many more times than 3):
> fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
> url=http://www.teachertube.com/images/
> This is a bug, right?
> Thanks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

Reply via email to