Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Owen Stephens Thu, 01 Mar 2012 07:12:25 -0800

Thanks Ian,

Agree that it is clear from this discussion that there are differing viewpoints 
and also very different requirements depending on the context and desired 
outcomes.


I think I said earlier in the thread - I'm not against niche solutions, they 
just make me want to double check that they are justified. For me I'd say the 
jury is still out on 'crawl' vs 'harvest' - but I think it definitely needs 
more investigation and thought - and of course different problems require 
different solutions. It would be interesting to try to go through the case for 
OAI-PMH, especially specific examples where it has achieved something that 
would have been difficult/impossible to do with more general solutions. Not 
sure if that could be done here on list, or better/easier through other 
discussion - or both (possibly over that beer? :)

>From the CORE project, any 'best practice' would be focussed on institutional 
>research publication repositories, and I it seems highly unlikely to make a 
>recommendation on 'crawl' vs 'harvest' - we just won't have time to do enough 
>work on this to understand the pros/cons of these even from our own singular 
>perspective. I think any recommendations are more along the lines of ensuring 
>robots.txt is consistent with other policies; the impact of using splash pages 
>as opposed to links to actual resources in the OAI-PMH feed; configuring 
>access to embargoed papers (as per Raffaele's suggestion); how to deal with 
>multi-part resources etc. Anything coming out of the project would, of course, 
>be just one projects recommendations for JISC to consider not more than that. 

Cheers,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [email protected]
Telephone: 0121 288 6936

On 1 Mar 2012, at 14:38, Ian Ibbotson wrote:

> Owen...
> 
> Just wanted to say that, whilst I've been silent since my initial response,
> I'm not sure I agree with all the viewpoints presented here.. From a point
> of view of (for example, CultureGrid) I'm not sure what has been done could
> have been pragmatically achieved soley with web crawling as it's described
> in this thread. Don't have a problem with anything thats been written here.
> It certainly represent a great cross-section of viewpoints. However, from a
> jisc discovery perspective, I don't want to contribute to any confirmation
> bias that we could dispose of pesky old OAI. I'd be interested in providing
> a counter-point to any "Best practice" document that suggested we could.
> 
> Ian.
> 
> On Thu, Mar 1, 2012 at 12:36 PM, Owen Stephens <[email protected]> wrote:
> 
>> Thanks Jason and Ed,
>> 
>> I suspect within this project we'll keep using OAI-PMH because we've got
>> tight deadlines and the other project strands (which do stuff with the
>> harvested content) need time from the developer. At the moment it looks
>> like we will probably combine OAI-PMH with web crawling (using nutch) - so
>> use data from the
>> 
>> However, that said, one of the things we are meant to be doing is offering
>> recommendations or good practice guidelines back to the (repository)
>> community based on our experience. If we have time I would love to tackle
>> the questions (a)-(d) that you highlight here - perhaps especially (a) and
>> (c). Since this particular project is part of the wider JISC 'Discovery'
>> programme (http://discovery.ac.uk and tech principles at
>> http://technicalfoundations.ukoln.info/guidance/technical-principles-discovery-ecosystem)
>> - from which one of the main themes might be summarised as 'work with the
>> web' these questions are definitely relevant.
>> 
>> I need to look at Jason's stuff again as I think this definitely has
>> parallels with some of the Discovery work, as, of course, does some of the
>> recent discussion on here about the question of the indexing of library
>> catalogues by search engines.
>> 
>> Thanks again to all who have contributed to the discussion - very useful
>> 
>> Owen
>> 
>> Owen Stephens
>> Owen Stephens Consulting
>> Web: http://www.ostephens.com
>> Email: [email protected]
>> Telephone: 0121 288 6936
>> 
>> On 1 Mar 2012, at 11:42, Ed Summers wrote:
>> 
>>> On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo <[email protected]>
>> wrote:
>>>> I'd like to bring this back to your suggestion to just forget OAI-PMH
>>>> and crawl the web. I think that's probably the long-term way forward.
>>> 
>>> I definitely had the same thoughts while reading this thread. Owen,
>>> are you forced to stay within the context of OAI-PMH because you are
>>> working with existing institutional repositories? I don't know if it's
>>> appropriate, or if it has been done before, but as part of your work
>>> it would be interesting to determine:
>>> 
>>> a) how many IRs allow crawling (robots.txt or lack thereof)
>>> b) how many IRs support crawling with a sitemap
>>> c) how many IR HTML splashpages use the rel-license [1] pattern
>>> d) how many IRs support syndication (RSS/Atom) to publish changes
>>> 
>>> If you could do this in a semi-automated way for the UK it would be
>>> great if you could then apply it to IRs around the world. It would
>>> also align really nicely with the sort of work that Jason has been
>>> doing around CAPS [2].
>>> 
>>> It seems to me that there might be an opportunity to educate digital
>>> repository managers about better aligning their content w/ the Web ...
>>> instead of trying to cook up new standards. I imagine this is way out
>>> of scope for what you are currently doing--if so, maybe this can be
>>> your next grant :-)
>>> 
>>> //Ed
>>> 
>>> [1] http://microformats.org/wiki/rel-license
>>> [2] https://github.com/jronallo/capsys
>>

Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Reply via email to