You found my favorite oldie bug! I'm guessing that item 1893/214 has
been withdrawn or deleted. 1.4.1 throws a fit when a crawler tries to
browse a page that should begin with a withdrawn or deleted item.

I've forgotten the fix (other than "upgrade to 1.4.2, in which the bug
was squashed"), but it may make you feel better to know that this bug
can *only* be caused by crawlers; a human being browsing your site
will never encounter it.

Dorothea

On Thu, Feb 11, 2010 at 6:30 AM, Michael White <michael.wh...@stir.ac.uk> wrote:
> Hi,
>
> Our DSpace (v1.4.1) has recently started logging a lot of Internal Server 
> Errors that appear to be being caused by a Googlebot. They appear to be 
> happening like clockwork every 14 minutes and come in blocks (sometimes 
> lasting several hours).
>
> They are all associated with the IP Address 66.249.71.176, which, when looked 
> up, appears to be "crawl-66-249-71-176.googlebot.com". The errors all have 
> the form:
>
> ============================
> 2010-02-11 11:34:07,739 WARN  
> org.dspace.app.webui.servlet.InternalErrorServlet @ 
> :session_id=9E40BFD899A2AA5C23E81404AF5B97A5:internal_error:-- URL Was: 
> https://dspace.stir.ac.uk/dspace/browse-title?bottom=1893/214
> -- Method: GET
> -- Parameters were:
> -- bottom: "1893/214"
>
> java.lang.ClassCastException
>        at 
> org.dspace.app.webui.servlet.BrowseServlet.doDSGet(BrowseServlet.java:282)
>        at 
> org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java:151)
>        at 
> org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99)
> ==============================
>
> I have checked our robots.txt file (from /usr/src/dspace-1.4.1-source/jsp), 
> which contains:
>
> --------------------------------
> User-agent: *
>
> Disallow: /browse-author
> Disallow: /items-by-author
> Disallow: /browse-date
> Disallow: /browse-subject
> --------------------------------
>
> I'm not that familiar with robots.txt, but I surmise that adding:
>
> Disallow:/browse-title
>
> - might do the trick? However, on further investigation, it appears that the 
> googlebot is not obeying any of the rules as it appears that it is accessing 
> other "Disallow"ed browse interfaces - I see a lot of this kind of thing in 
> the DSpace logs:
>
> 2010-02-11 02:09:16,746 INFO  org.dspace.app.webui.servlet.BrowseServlet @ 
> anonymous:session_id=FBC689A1F89C3B962F0D9BFEC0B4D8ED:ip_addr=66.249.71.176:browse_author:starts_with=Farkas,
>  Jozsef Z.,results=21
>
> - and mapping this to the Tomcat logs:
>
> 66.249.71.176 - - [11/Feb/2010:02:09:16 +0000] "GET 
> /dspace/browse-author?starts_with=Farkas%2C+Jozsef+Z. HTTP/1.1" 200 16836
>
>
> So, 2 (related?) issues here - googlebot is causing errors when it is 
> crawling the site, and it also appears to me that the googlebot is not 
> obeying the robots.txt file at all :-( - or am I misunderstanding anything?
>
> Given that this has only just started happening (we have had no trouble with 
> bots or spiders in the past), I was wondering if anyone else had noticed 
> anything like this related to the googlebot, or if anyone was aware of 
> anything that may have changed to cause this to start happening?
>
> More importantly, rather than me randomly trying things, any bot/robots.txt 
> experts out there able to tell me how I can stop this but still allow 
> legitimate crawling of the site for indexing purposes?
>
> Cheers,
>
> Mike
>
> Michael White
> eLearning Developer
> Centre for eLearning Development (CeLD)
> 3V3a, Cottrell
> University of Stirling
> Stirling SCOTLAND
> FK9 4LA
> Email: michael.wh...@stir.ac.uk
> Tel: +44 (0) 1786 466877
> Fax: +44 (0) 1786 466880
> http://www.is.stir.ac.uk/celd/
>
>
>
> --
> The Sunday Times Scottish University of the Year 2009/2010
> The University of Stirling is a charity registered in Scotland,
>  number SC 011159.
>
>
> ------------------------------------------------------------------------------
> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
> http://p.sf.net/sfu/solaris-dev2dev
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>



-- 
Dorothea Salo                ds...@library.wisc.edu
Digital Repository Librarian      AIM: mindsatuw
University of Wisconsin
Rm 218, Memorial Library
(608) 262-5493

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to