This would be a good opportunity to construct a reasonably good default 
robots.txt file and add it to the documentation set.

At http://ses.library.usyd.edu.au/robots.txt, I have the following:

 User-agent: *
 Crawl-Delay: 11
 Disallow: /browse
 Disallow: /browse?
 Disallow: /browse-title
 Disallow: /bitstream
 Disallow: /dspace/
 Disallow: /feed/
 Disallow: /feedback
 Disallow: /password-login
 #Disallow: /retrieve/
 #Disallow: /handle/
 #Disallow: /oai/

/bitstream is intended to deter crawlers from triggering the catalina error + 
dspace warning.

which lines should I re-use from Jeff's example and why?

The lines I have are based on my best guess at what a crawler ought not to be 
interested in.

Thanks in advance.

--
Van Ly : University of Sydney Library


-----Original Message-----
From: Robert Tansley [mailto:roberttans...@google.com]
Sent: Thu 15/01/2009 7:52 AM
To: Shane Beers
Cc: dspace-tech Tech; Jeffrey Trimble
Subject: Re: [Dspace-tech] Google bots and web crawlers
 
As of DSpace 1.5, sitemaps are supported which allow search engines to
selectively crawl only new items, while massively reducing the server
load:

http://www.dspace.org/1_5_1Documentation/ch03.html#N10B44

Unfortunately, it seems that relatively few DSpace instances actually
use this feature.

I would strongly recommend against blocking  /dspace/bitstream/* and
/dspace/html/*, as these prevent crawlers from accessing the full-text
of items, vital for effective indexing. As of DSpace 1.4.2 (and
possibly earlier), these URLs support the if-modified-after header,
which means that crawlers don't re-retrieve files if they haven't been
changed since the last crawl.

Rob

On Wed, Jan 14, 2009 at 14:20, Shane Beers <sbe...@gmu.edu> wrote:
> Jeff:
> We had an issue with our local google instance crawling our DSpace
> installation and causing huge issues. I re-wrote the robots.txt to disallow
> anything besides the item pages themselves - no browsing pages or search
> pages and whatnot. Here is a copy of ours:
> User-agent: *
> Disallow: /dspace/browse-author
> Disallow: /dspace/browse-author*
> Disallow: /dspace/items-by-author
> Disallow: /dspace/items-by-author*
> Disallow: /dspace/browse-date*
> Disallow: /dspace/browse-date
> Disallow: /dspace/browse-title*
> Disallow: /dspace/browse-title
> Disallow: /dspace/feedback
> Disallow: /dspace/feedback/*
> Disallow: /dspace/items-by-subject
> Disallow: /dspace/items-by-subject/*
> Disallow: /dspace/handle/1920/*/brow! se-title*
> ace/handle/1920/*/browse-author*
> Disallow: /dspace/handle/1920/*/browse-subject*
> Disallow: /dspace/handle/1920/*/browse-date*
> Disallow: /dspace/handle/1920/*/items-by-subject*
> Disallow: /dspace/handle/1920/*/items-by-author*
> Disallow: /dspace/bitstream/*
> Disallow: /dspace/image/*
> Disallow: /dspace/html/*
> Disallow: /dspace/simple-search*
> This likely would live in your tomcat directory.
> Shane Beers
> Digital Repository Services Librarian
> George Mason University
> sbe...@gmu.edu
> http://mars.gmu.edu
> !
> 703- lass="Apple-interchange-newline">
>
> On Jan 14, 2009, at 1:09 PM, Jeffrey Trimble wrote:
>
> Is there something simple I can place in the jsp that will prohibit the
> crawlers from
> using my server resources?
> TIA,
> Jeff
>
> Jeffrey Trimble
> Systems Librarian
> Maag Library
> Youngstown State University
> 330-941-2483 (Office)
> jtrim...@cc.ysu.edu
> http://www.maag.ysu.edu
> http! ://digita div>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword_______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to