Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Owen Stephens Fri, 24 Feb 2012 08:32:44 -0800

Thanks both...

Kyle said: "If someone goes to the trouble of making things available via a 
protocol that exists only to make things harvestable and
then doesn't want it harvested, you can dismiss them ..."

True - but that's essentially what Southampton's configuration seems to say.

Thomas said: "The M in PMH still stands for Metadata, right?  So opening an 
OAI-PMH server implicitly says you're willing to share metadata.  I can 
certainly sympathize with sites wanting to do that but not necessarily wanting 
to offer anything more than "normal" end-user access to full text."

This is a fair point - but I've yet to see an example of a robots.txt file that 
makes this distinction - that is, in general Google is not being told to not 
crawl and cache pdfs, while being granted explicit permission to crawl the 
metadata, no matter what the OAI-PMH situation.

Kyle said: "OAI-PMH runs on top of HTTP, so anything robots.txt already applies 
-- i.e. if they want you to crawl metadata only but not download the objects 
themselves because they don't want to deal with the load or bandwidth charges, 
this should be indicated for all crawlers."

OK - this suggests a way forward for me. Although I don't think we can regard 
robots.txt applying across the board for OAI-PMH (as in the Southampton 
example, the OAI-PMH endpoint is disallowed by robots.txt), it seems to make 
sense that given a resource identifier in the metadata we could use robots.txt 
(and I guess potentially x-robots-tag, assuming most of the resources are not 
simple html) to see whether a web crawler is permitted to crawl it, and so make 
the right decision about what we do.

That sounds vaguely sensible (although I'm still left thinking, maybe we should 
just use a web crawler and ignore OAI-PMH but I guess this was we maybe get the 
best of both worlds).

Thanks again (and of course further thoughts welcome)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [email protected]
Telephone: 0121 288 6936

On 24 Feb 2012, at 14:45, Thomas Dowling wrote:

> On 02/24/2012 09:25 AM, Kyle Banerjee wrote:
> 
>>> We use OAI-PMH, and while we often see (usually general and sometimes
>>> contradictory) statements about what we can/can't do with the contents of a
>>> repository (or a specific record), it feels like there isn't a nice simple
>>> mechanism for a repository to say "don't harvest this bit".
>>> 
>> 
>> I would argue there is -- the whole point of OAI-PMH is to make stuff
>> available for harvesting. If someone goes to the trouble of making things
>> available via a protocol that exists only to make things harvestable and
>> then doesn't want it harvested, you can dismiss them as being totally
>> mental.
> 
> The M in PMH still stands for Metadata, right?  So opening an OAI-PMH
> server implicitly says you're willing to share metadata.  I can certainly
> sympathize with sites wanting to do that but not necessarily wanting to
> offer anything more than "normal" end-user access to full text.
> 
> That said, in a world with unfriendly bots, the repository should still be
> making informed choices about controlling full text crawlers (robots.txt,
> meta tags, HTTP cache directives, etc etc.).
> 
> 
> -- 
> Thomas Dowling
> [email protected]

Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Reply via email to