Hi, On 09/04/13 01:50, Mark H. Wood wrote: > Well, one thing I'd do is to complain to the people who run the GSA. > IMHO it shouldn't be hitting any site that hard. If there isn't a way > to throttle it back...there should be.
There is -- using If-Modified-Since and a 200 vs 304 response code. The problem is that DSpace (XMLUI) could be considered broken when it comes to that: it behaves differently for requests it considers to come from bots vs other requests. For bot requests, it uses the item's last modified date; for other requests, it uses the date that the item page was last cached IIRC. This is why adding gsa-crawler to the list of known bots in sitemap.xmap helps. This behaviour is specific to XMLUI, so the folks in this thread who use JSPUI need to check whether something similar exists in their case. > DSpace can generate sitemaps. Can the bot in question be taught to > calm down and just use these? Telling gsa-crawler about the sitemap (which, just to be absolutely clear, is something completely different from the XMLUI sitemap.xmap despite the similarity in names) did help. Since we did both, I'm not sure which measure was ultimately most useful in taming gsa-crawler. cheers, Andrea -- Dr Andrea Schweer IRR Technical Specialist, ITS Information Systems The University of Waikato, Hamilton, New Zealand ------------------------------------------------------------------------------ Minimize network downtime and maximize team effectiveness. Reduce network management and security costs.Learn how to hire the most talented Cisco Certified professionals. Visit the Employer Resources Portal http://www.cisco.com/web/learning/employer_resources/index.html _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

