Are you sure you want to disallow Google from crawling your browse-titles or
your entire repository? I like being able to find our items on Google.

Move your robots.txt like this (sorry if this is duh-obvious but I had that
moment myself a few months back so no worries!):

# cp [tomcat]/webapps/dspace/robots.txt [tomcat]/ROOT/robots.txt

I've been getting a similar one, related to collection pages' handles

-- URL Was: http://irserver.ucd.ie/dspace/browse-title?top=10197/853
-- Method: GET
-- Parameters were:
-- top: "10197/853"

Bad robot indeed!

Joseph Greene
Institutional Repository Project Manager
325 James Joyce Library
University College Dublin
Belfield, Dublin 4

353 (0)1 716 7398
[email protected]
http://irserver.ucd.ie/dspace/

Message: 1
Date: Thu, 11 Feb 2010 12:30:04 +0000
From: Michael White <[email protected]>
Subject: [Dspace-tech] Bad robot! Googlebot and Internal Server Errors
To: "[email protected]"
        <[email protected]>
Message-ID:
        <7c43cb6f3460394f9b5236c0f68d7b6a5d6baa4...@exch2007.ad.stir.ac.uk>
Content-Type: text/plain; charset="us-ascii"

Hi,

Our DSpace (v1.4.1) has recently started logging a lot of Internal Server
Errors that appear to be being caused by a Googlebot. They appear to be
happening like clockwork every 14 minutes and come in blocks (sometimes
lasting several hours).

They are all associated with the IP Address 66.249.71.176, which, when
looked up, appears to be "crawl-66-249-71-176.googlebot.com". The errors all
have the form:

============================
2010-02-11 11:34:07,739 WARN
org.dspace.app.webui.servlet.InternalErrorServlet @
:session_id=9E40BFD899A2AA5C23E81404AF5B97A5:internal_error:-- URL Was:
https://dspace.stir.ac.uk/dspace/browse-title?bottom=1893/214
-- Method: GET
-- Parameters were:
-- bottom: "1893/214"

java.lang.ClassCastException
        at
org.dspace.app.webui.servlet.BrowseServlet.doDSGet(BrowseServlet.java:282)
        at
org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java
:151)
        at
org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99)
==============================

I have checked our robots.txt file (from /usr/src/dspace-1.4.1-source/jsp),
which contains:

--------------------------------
User-agent: *

Disallow: /browse-author
Disallow: /items-by-author
Disallow: /browse-date
Disallow: /browse-subject
--------------------------------

I'm not that familiar with robots.txt, but I surmise that adding:

Disallow:/browse-title

- might do the trick? However, on further investigation, it appears that the
googlebot is not obeying any of the rules as it appears that it is accessing
other "Disallow"ed browse interfaces - I see a lot of this kind of thing in
the DSpace logs:

2010-02-11 02:09:16,746 INFO  org.dspace.app.webui.servlet.BrowseServlet @
anonymous:session_id=FBC689A1F89C3B962F0D9BFEC0B4D8ED:ip_addr=66.249.71.176:
browse_author:starts_with=Farkas, Jozsef Z.,results=21

- and mapping this to the Tomcat logs:

66.249.71.176 - - [11/Feb/2010:02:09:16 +0000] "GET
/dspace/browse-author?starts_with=Farkas%2C+Jozsef+Z. HTTP/1.1" 200 16836


So, 2 (related?) issues here - googlebot is causing errors when it is
crawling the site, and it also appears to me that the googlebot is not
obeying the robots.txt file at all :-( - or am I misunderstanding anything?

Given that this has only just started happening (we have had no trouble with
bots or spiders in the past), I was wondering if anyone else had noticed
anything like this related to the googlebot, or if anyone was aware of
anything that may have changed to cause this to start happening?

More importantly, rather than me randomly trying things, any bot/robots.txt
experts out there able to tell me how I can stop this but still allow
legitimate crawling of the site for indexing purposes?

Cheers,

Mike

Michael White
eLearning Developer
Centre for eLearning Development (CeLD) 3V3a, Cottrell University of
Stirling Stirling SCOTLAND
FK9 4LA
Email: [email protected]
Tel: +44 (0) 1786 466877
Fax: +44 (0) 1786 466880
http://www.is.stir.ac.uk/celd/


------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to