Network error during robots.txt fetch causes file to be ignored
---------------------------------------------------------------

         Key: NUTCH-105
         URL: http://issues.apache.org/jira/browse/NUTCH-105
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
    Reporter: Rod Taylor


Earlier we had a small network glitch which prevented us from retrieving
the robots.txt file for a site we were crawling at the time:

        nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
        task_m_h02y5t  Couldn't get robots.txt for
        http://www.japanesetranslator.co.uk/portfolio/:
        org.apache.commons.httpclient.ConnectTimeoutException: The host
        did not accept the connection within timeout of 10000 ms
        nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
        task_m_h02y5t  Couldn't get robots.txt for
        http://www.japanesetranslator.co.uk/translation/:
        org.apache.commons.httpclient.ConnectTimeoutException: The host
        did not accept the connection within timeout of 10000 ms

Nutch then assumed that because we were unable to retrieve the file due
to network issues, that it didn't exist and we could crawl the entire
website. Nutch then successfully grabbed a few pages which were listed
in the robots.txt as being disallowed.

I think Nutch should continue attempting to retrieve the robots.txt file
until, at very least, we are able to establish a connection to the host,
otherwise the host should be ignored until the next round of fetches.

The webmaster of japanesetranslator.co.uk filed a complaint informing us
of the issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to