Hi,

Our DSpace (v1.4.1) has recently started logging a lot of Internal Server 
Errors that appear to be being caused by a Googlebot. They appear to be 
happening like clockwork every 14 minutes and come in blocks (sometimes lasting 
several hours).

They are all associated with the IP Address 66.249.71.176, which, when looked 
up, appears to be "crawl-66-249-71-176.googlebot.com". The errors all have the 
form:

============================
2010-02-11 11:34:07,739 WARN  org.dspace.app.webui.servlet.InternalErrorServlet 
@ :session_id=9E40BFD899A2AA5C23E81404AF5B97A5:internal_error:-- URL Was: 
https://dspace.stir.ac.uk/dspace/browse-title?bottom=1893/214
-- Method: GET
-- Parameters were:
-- bottom: "1893/214"

java.lang.ClassCastException
        at 
org.dspace.app.webui.servlet.BrowseServlet.doDSGet(BrowseServlet.java:282)
        at 
org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java:151)
        at 
org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99)
==============================

I have checked our robots.txt file (from /usr/src/dspace-1.4.1-source/jsp), 
which contains:

--------------------------------
User-agent: *

Disallow: /browse-author
Disallow: /items-by-author
Disallow: /browse-date
Disallow: /browse-subject
--------------------------------

I'm not that familiar with robots.txt, but I surmise that adding:

Disallow:/browse-title

- might do the trick? However, on further investigation, it appears that the 
googlebot is not obeying any of the rules as it appears that it is accessing 
other "Disallow"ed browse interfaces - I see a lot of this kind of thing in the 
DSpace logs:

2010-02-11 02:09:16,746 INFO  org.dspace.app.webui.servlet.BrowseServlet @ 
anonymous:session_id=FBC689A1F89C3B962F0D9BFEC0B4D8ED:ip_addr=66.249.71.176:browse_author:starts_with=Farkas,
 Jozsef Z.,results=21

- and mapping this to the Tomcat logs:

66.249.71.176 - - [11/Feb/2010:02:09:16 +0000] "GET 
/dspace/browse-author?starts_with=Farkas%2C+Jozsef+Z. HTTP/1.1" 200 16836


So, 2 (related?) issues here - googlebot is causing errors when it is crawling 
the site, and it also appears to me that the googlebot is not obeying the 
robots.txt file at all :-( - or am I misunderstanding anything?

Given that this has only just started happening (we have had no trouble with 
bots or spiders in the past), I was wondering if anyone else had noticed 
anything like this related to the googlebot, or if anyone was aware of anything 
that may have changed to cause this to start happening?

More importantly, rather than me randomly trying things, any bot/robots.txt 
experts out there able to tell me how I can stop this but still allow 
legitimate crawling of the site for indexing purposes?

Cheers,

Mike

Michael White 
eLearning Developer
Centre for eLearning Development (CeLD) 
3V3a, Cottrell
University of Stirling 
Stirling SCOTLAND 
FK9 4LA 
Email: [email protected] 
Tel: +44 (0) 1786 466877 
Fax: +44 (0) 1786 466880 
http://www.is.stir.ac.uk/celd/



-- 
The Sunday Times Scottish University of the Year 2009/2010
The University of Stirling is a charity registered in Scotland, 
 number SC 011159.


------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to