As of DSpace 1.5, sitemaps are supported which allow search engines to selectively crawl only new items, while massively reducing the server load:
http://www.dspace.org/1_5_1Documentation/ch03.html#N10B44 Unfortunately, it seems that relatively few DSpace instances actually use this feature. I would strongly recommend against blocking /dspace/bitstream/* and /dspace/html/*, as these prevent crawlers from accessing the full-text of items, vital for effective indexing. As of DSpace 1.4.2 (and possibly earlier), these URLs support the if-modified-after header, which means that crawlers don't re-retrieve files if they haven't been changed since the last crawl. Rob On Wed, Jan 14, 2009 at 14:20, Shane Beers <[email protected]> wrote: > Jeff: > We had an issue with our local google instance crawling our DSpace > installation and causing huge issues. I re-wrote the robots.txt to disallow > anything besides the item pages themselves - no browsing pages or search > pages and whatnot. Here is a copy of ours: > User-agent: * > Disallow: /dspace/browse-author > Disallow: /dspace/browse-author* > Disallow: /dspace/items-by-author > Disallow: /dspace/items-by-author* > Disallow: /dspace/browse-date* > Disallow: /dspace/browse-date > Disallow: /dspace/browse-title* > Disallow: /dspace/browse-title > Disallow: /dspace/feedback > Disallow: /dspace/feedback/* > Disallow: /dspace/items-by-subject > Disallow: /dspace/items-by-subject/* > Disallow: /dspace/handle/1920/*/brow! se-title* > ace/handle/1920/*/browse-author* > Disallow: /dspace/handle/1920/*/browse-subject* > Disallow: /dspace/handle/1920/*/browse-date* > Disallow: /dspace/handle/1920/*/items-by-subject* > Disallow: /dspace/handle/1920/*/items-by-author* > Disallow: /dspace/bitstream/* > Disallow: /dspace/image/* > Disallow: /dspace/html/* > Disallow: /dspace/simple-search* > This likely would live in your tomcat directory. > Shane Beers > Digital Repository Services Librarian > George Mason University > [email protected] > http://mars.gmu.edu > ! > 703- lass="Apple-interchange-newline"> > > On Jan 14, 2009, at 1:09 PM, Jeffrey Trimble wrote: > > Is there something simple I can place in the jsp that will prohibit the > crawlers from > using my server resources? > TIA, > Jeff > > Jeffrey Trimble > Systems Librarian > Maag Library > Youngstown State University > 330-941-2483 (Office) > [email protected] > http://www.maag.ysu.edu > http! ://digita div> > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword_______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

