[CODE4LIB] "Repositories", OAI-PMH and web crawling

Owen Stephens Fri, 24 Feb 2012 03:51:17 -0800

I'm currently working on a project at The Open University in the UK called CORE 
(http://core-project.kmi.open.ac.uk/) which harvests metadata and full-text 
from institutional repositories at UK Universities, and then does analysis on 
the text to calculate (and make available openly) a measure of 'semantic 
similarity' between papers. The idea is to enable discovery of similar items 
(or I guess, dissimilar items if you wanted).


One of the questions this raises is what we are/aren't allowed to do in terms 
of harvesting full-text. While I realise we could get into legal stuff here, at 
the moment we want to put that question to one side. Instead we want to 
consider what Google, and other search engines, do, the mechanisms available to 
control this, and what we do, and the equivalent mechanisms - our starting 
point is that we don't feel we should be at a disadvantage to a web search 
engine in our harvesting and use of repository records.

Of course, Google and other crawlers can crawl the bits of the repository that 
are on the open web, and 'good' crawlers will obey the contents of robots.txt
We use OAI-PMH, and while we often see (usually general and sometimes 
contradictory) statements about what we can/can't do with the contents of a 
repository (or a specific record), it feels like there isn't a nice simple 
mechanism for a repository to say "don't harvest this bit".

To take an example (at random-ish and not to pick on anyone) the University of 
Southampton's repository has the follow robots.txt:
User-agent: *
Sitemap: http://eprints.soton.ac.uk/sitemap.xml
Disallow: /cgi/
Disallow: /66183/
This essentially allows Google et al to crawl the whole repository, with the 
exception of a single paper.
However, the OAI-PMH interface allows the whole repository to be harvested, and 
there seems to be nothing special about that particular paper, and nothing to 
say "please don't harvest".
Where there are statements via OAI-PMH about what is/isn't allowed we find that 
these are usually expressed as textual policies intended for human consumption, 
not designed to be (easily) machine readable.
We are left thinking it would be helpful if there was an equivalent of 
robots.txt for OAI-PMH interfaces. I've been asked to make a proposal for 
discussion, and I'd be interested in any ideas/comments code4lib people have. 
At the moment I'm wondering about:
a) a simple file like robots.txt which can allow/disallow harvesters 
(equivalent to User-agent), and allow/disallow records using a list of record 
ids (or set ids?)
b) use X-Robots-Tag in the http header
The latter has the advantage of being an existing way of doing it, but I wonder 
how fiddly it might be to implement, while a simple file in a known location 
might be easier. Anyway, thoughts appreciated - and alternatives to these of 
course. One obvious alternative that I keep coming back to is 'forget OAI-PMH, 
just crawl the web' ...
Thanks,
Owen
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [email protected]
Telephone: 0121 288 6936

[CODE4LIB] "Repositories", OAI-PMH and web crawling

Reply via email to