Thanks Ian, Agree that it is clear from this discussion that there are differing viewpoints and also very different requirements depending on the context and desired outcomes.
I think I said earlier in the thread - I'm not against niche solutions, they just make me want to double check that they are justified. For me I'd say the jury is still out on 'crawl' vs 'harvest' - but I think it definitely needs more investigation and thought - and of course different problems require different solutions. It would be interesting to try to go through the case for OAI-PMH, especially specific examples where it has achieved something that would have been difficult/impossible to do with more general solutions. Not sure if that could be done here on list, or better/easier through other discussion - or both (possibly over that beer? :) >From the CORE project, any 'best practice' would be focussed on institutional >research publication repositories, and I it seems highly unlikely to make a >recommendation on 'crawl' vs 'harvest' - we just won't have time to do enough >work on this to understand the pros/cons of these even from our own singular >perspective. I think any recommendations are more along the lines of ensuring >robots.txt is consistent with other policies; the impact of using splash pages >as opposed to links to actual resources in the OAI-PMH feed; configuring >access to embargoed papers (as per Raffaele's suggestion); how to deal with >multi-part resources etc. Anything coming out of the project would, of course, >be just one projects recommendations for JISC to consider not more than that. Cheers, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: [email protected] Telephone: 0121 288 6936 On 1 Mar 2012, at 14:38, Ian Ibbotson wrote: > Owen... > > Just wanted to say that, whilst I've been silent since my initial response, > I'm not sure I agree with all the viewpoints presented here.. From a point > of view of (for example, CultureGrid) I'm not sure what has been done could > have been pragmatically achieved soley with web crawling as it's described > in this thread. Don't have a problem with anything thats been written here. > It certainly represent a great cross-section of viewpoints. However, from a > jisc discovery perspective, I don't want to contribute to any confirmation > bias that we could dispose of pesky old OAI. I'd be interested in providing > a counter-point to any "Best practice" document that suggested we could. > > Ian. > > On Thu, Mar 1, 2012 at 12:36 PM, Owen Stephens <[email protected]> wrote: > >> Thanks Jason and Ed, >> >> I suspect within this project we'll keep using OAI-PMH because we've got >> tight deadlines and the other project strands (which do stuff with the >> harvested content) need time from the developer. At the moment it looks >> like we will probably combine OAI-PMH with web crawling (using nutch) - so >> use data from the >> >> However, that said, one of the things we are meant to be doing is offering >> recommendations or good practice guidelines back to the (repository) >> community based on our experience. If we have time I would love to tackle >> the questions (a)-(d) that you highlight here - perhaps especially (a) and >> (c). Since this particular project is part of the wider JISC 'Discovery' >> programme (http://discovery.ac.uk and tech principles at >> http://technicalfoundations.ukoln.info/guidance/technical-principles-discovery-ecosystem) >> - from which one of the main themes might be summarised as 'work with the >> web' these questions are definitely relevant. >> >> I need to look at Jason's stuff again as I think this definitely has >> parallels with some of the Discovery work, as, of course, does some of the >> recent discussion on here about the question of the indexing of library >> catalogues by search engines. >> >> Thanks again to all who have contributed to the discussion - very useful >> >> Owen >> >> Owen Stephens >> Owen Stephens Consulting >> Web: http://www.ostephens.com >> Email: [email protected] >> Telephone: 0121 288 6936 >> >> On 1 Mar 2012, at 11:42, Ed Summers wrote: >> >>> On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo <[email protected]> >> wrote: >>>> I'd like to bring this back to your suggestion to just forget OAI-PMH >>>> and crawl the web. I think that's probably the long-term way forward. >>> >>> I definitely had the same thoughts while reading this thread. Owen, >>> are you forced to stay within the context of OAI-PMH because you are >>> working with existing institutional repositories? I don't know if it's >>> appropriate, or if it has been done before, but as part of your work >>> it would be interesting to determine: >>> >>> a) how many IRs allow crawling (robots.txt or lack thereof) >>> b) how many IRs support crawling with a sitemap >>> c) how many IR HTML splashpages use the rel-license [1] pattern >>> d) how many IRs support syndication (RSS/Atom) to publish changes >>> >>> If you could do this in a semi-automated way for the UK it would be >>> great if you could then apply it to IRs around the world. It would >>> also align really nicely with the sort of work that Jason has been >>> doing around CAPS [2]. >>> >>> It seems to me that there might be an opportunity to educate digital >>> repository managers about better aligning their content w/ the Web ... >>> instead of trying to cook up new standards. I imagine this is way out >>> of scope for what you are currently doing--if so, maybe this can be >>> your next grant :-) >>> >>> //Ed >>> >>> [1] http://microformats.org/wiki/rel-license >>> [2] https://github.com/jronallo/capsys >>
