Hi, Our DSpace (v1.4.1) has recently started logging a lot of Internal Server Errors that appear to be being caused by a Googlebot. They appear to be happening like clockwork every 14 minutes and come in blocks (sometimes lasting several hours).
They are all associated with the IP Address 66.249.71.176, which, when looked up, appears to be "crawl-66-249-71-176.googlebot.com". The errors all have the form: ============================ 2010-02-11 11:34:07,739 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=9E40BFD899A2AA5C23E81404AF5B97A5:internal_error:-- URL Was: https://dspace.stir.ac.uk/dspace/browse-title?bottom=1893/214 -- Method: GET -- Parameters were: -- bottom: "1893/214" java.lang.ClassCastException at org.dspace.app.webui.servlet.BrowseServlet.doDSGet(BrowseServlet.java:282) at org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java:151) at org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99) ============================== I have checked our robots.txt file (from /usr/src/dspace-1.4.1-source/jsp), which contains: -------------------------------- User-agent: * Disallow: /browse-author Disallow: /items-by-author Disallow: /browse-date Disallow: /browse-subject -------------------------------- I'm not that familiar with robots.txt, but I surmise that adding: Disallow:/browse-title - might do the trick? However, on further investigation, it appears that the googlebot is not obeying any of the rules as it appears that it is accessing other "Disallow"ed browse interfaces - I see a lot of this kind of thing in the DSpace logs: 2010-02-11 02:09:16,746 INFO org.dspace.app.webui.servlet.BrowseServlet @ anonymous:session_id=FBC689A1F89C3B962F0D9BFEC0B4D8ED:ip_addr=66.249.71.176:browse_author:starts_with=Farkas, Jozsef Z.,results=21 - and mapping this to the Tomcat logs: 66.249.71.176 - - [11/Feb/2010:02:09:16 +0000] "GET /dspace/browse-author?starts_with=Farkas%2C+Jozsef+Z. HTTP/1.1" 200 16836 So, 2 (related?) issues here - googlebot is causing errors when it is crawling the site, and it also appears to me that the googlebot is not obeying the robots.txt file at all :-( - or am I misunderstanding anything? Given that this has only just started happening (we have had no trouble with bots or spiders in the past), I was wondering if anyone else had noticed anything like this related to the googlebot, or if anyone was aware of anything that may have changed to cause this to start happening? More importantly, rather than me randomly trying things, any bot/robots.txt experts out there able to tell me how I can stop this but still allow legitimate crawling of the site for indexing purposes? Cheers, Mike Michael White eLearning Developer Centre for eLearning Development (CeLD) 3V3a, Cottrell University of Stirling Stirling SCOTLAND FK9 4LA Email: [email protected] Tel: +44 (0) 1786 466877 Fax: +44 (0) 1786 466880 http://www.is.stir.ac.uk/celd/ -- The Sunday Times Scottish University of the Year 2009/2010 The University of Stirling is a charity registered in Scotland, number SC 011159. ------------------------------------------------------------------------------ SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

