Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Jonathan Rochkind Thu, 01 Mar 2012 08:35:38 -0800

IF your HTML includes embedded semantic data using HTML5 microdata orRDFa or something similar (using a standard vocabulary -- the standardfor repositories seems to be DC-based, since that's often all you canget out of OAI-PMH anyway) --- then web crawling combined with site mapsprobably provides about as much functionality as OAI-PMH.

But embedded semantic metadata is key. However, even in the currentOAI-PMH-considered-standard-best-practice world, the document-levelmetadata from repositories is often _extremely_ basic, as well as oftenunreliable. This severely limits the functionality that harvesters canput harvests to.

So it's not neccesarily really about OAI-PMH vs web crawling. It's aboutsufficient and sufficiently reliable metadata. And even in the OAI-PMHworld, we rarely have it.

Note for instance that OAISter and similar harvesters are _unable toknow_ whether a harvested document is open access full text or not.That seems like something you'd want to tell people in their searchresults right, they might only want stuff that they can actuallyaccess. But it's not really possible, becuase most (all?) repo's donot reveal any standard metadata in their OAI-PMH that would specify this.


On 3/1/2012 9:38 AM, Ian Ibbotson wrote:

Owen...

Just wanted to say that, whilst I've been silent since my initial response,
I'm not sure I agree with all the viewpoints presented here.. From a point
of view of (for example, CultureGrid) I'm not sure what has been done could
have been pragmatically achieved soley with web crawling as it's described
in this thread. Don't have a problem with anything thats been written here.
It certainly represent a great cross-section of viewpoints. However, from a
jisc discovery perspective, I don't want to contribute to any confirmation
bias that we could dispose of pesky old OAI. I'd be interested in providing
a counter-point to any "Best practice" document that suggested we could.

Ian.

On Thu, Mar 1, 2012 at 12:36 PM, Owen Stephens<[email protected]>  wrote:

Thanks Jason and Ed,

I suspect within this project we'll keep using OAI-PMH because we've got
tight deadlines and the other project strands (which do stuff with the
harvested content) need time from the developer. At the moment it looks
like we will probably combine OAI-PMH with web crawling (using nutch) - so
use data from the

However, that said, one of the things we are meant to be doing is offering
recommendations or good practice guidelines back to the (repository)
community based on our experience. If we have time I would love to tackle
the questions (a)-(d) that you highlight here - perhaps especially (a) and
(c). Since this particular project is part of the wider JISC 'Discovery'
programme (http://discovery.ac.uk and tech principles at
http://technicalfoundations.ukoln.info/guidance/technical-principles-discovery-ecosystem)
- from which one of the main themes might be summarised as 'work with the
web' these questions are definitely relevant.

I need to look at Jason's stuff again as I think this definitely has
parallels with some of the Discovery work, as, of course, does some of the
recent discussion on here about the question of the indexing of library
catalogues by search engines.

Thanks again to all who have contributed to the discussion - very useful

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [email protected]
Telephone: 0121 288 6936

On 1 Mar 2012, at 11:42, Ed Summers wrote:

On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo<[email protected]>

wrote:

I'd like to bring this back to your suggestion to just forget OAI-PMH
and crawl the web. I think that's probably the long-term way forward.

I definitely had the same thoughts while reading this thread. Owen,
are you forced to stay within the context of OAI-PMH because you are
working with existing institutional repositories? I don't know if it's
appropriate, or if it has been done before, but as part of your work
it would be interesting to determine:

a) how many IRs allow crawling (robots.txt or lack thereof)
b) how many IRs support crawling with a sitemap
c) how many IR HTML splashpages use the rel-license [1] pattern
d) how many IRs support syndication (RSS/Atom) to publish changes

If you could do this in a semi-automated way for the UK it would be
great if you could then apply it to IRs around the world. It would
also align really nicely with the sort of work that Jason has been
doing around CAPS [2].

It seems to me that there might be an opportunity to educate digital
repository managers about better aligning their content w/ the Web ...
instead of trying to cook up new standards. I imagine this is way out
of scope for what you are currently doing--if so, maybe this can be
your next grant :-)

//Ed

[1] http://microformats.org/wiki/rel-license
[2] https://github.com/jronallo/capsys

Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Reply via email to