Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Owen Stephens Mon, 27 Feb 2012 02:25:55 -0800

On 26 Feb 2012, at 14:42, Godmar Back wrote:

> May I ask a side question and make a side observation regarding the
> harvesting of full text of the object to which a OAI-PMH record refers?
> 
> In general, is the idea to use the <dc:source>/text() element, treat it as
> a URL, and then expect to find the object there (provided that there was a
> suitable <dc:type> and <dc:format> element)?
> 
I think dc:identifier is usually used to provide a URL for the item being 
described. The examples at 
http://www.openarchives.org/OAI/openarchivesprotocol.html#dublincore follow 
this, and the UK E-Thesis schema 
(http://naca.central.cranfield.ac.uk/ethos-oai/2.0/oai-uketd.xml) does as well.


> Example: http://scholar.lib.vt.edu/theses/OAI/cgi-bin/index.pl allows the
> harvesting of ETD metadata.  Yet, its metadata reads:
> 
> <ListRecords>
>   ....
>   <metadata>
>     <dc>
>        <type>text</type>
>        <format>application/pdf</format>
>        <source>
> http://scholar.lib.vt.edu/theses/available/etd-3345131939761081/</source>
>    ....
> 
> 
> When one visits
> http://scholar.lib.vt.edu/theses/available/etd-3345131939761081/ however
> there is no 'text' document of type 'application/pdf' - rather, it's an
> HTML title page that embeds links to one or more PDF documents, such as
> http://scholar.lib.vt.edu/theses/available/etd-3345131939761081/unrestricted/Walker_1.pdfto
> Walker_5.pdf.
> 
> Is VT's ETD OAI implementation deficient, or is OAI-PMH simply not set up
> to allow the harvesting of full-text without what would basically amount to
> crawling the ETD title page, or other repository-specific mechanisms?

This issue is certainly not unique to VT - we've come across this as part of 
our project. While the OAI-PMH record may point at the PDF, it can also point 
to a intermediary page. This seems to be standard practice in some instances - 
I think because there is a desire, or even requirement, that a user should see 
the intermediary page (which may contain rights information etc.) before 
viewing the full-text item. There may also be an issue where multiple files 
exist for the same item - maybe several data files and a pdf of the thesis 
attached to the same metadata record - as the metadata via OAI-PMH may not 
describe each asset.

I suspect you'd see some specific approaches depending on the default settings 
in different packages. For example this (highly truncated) record from 
Southampton (who use eprints) differentiates the full-text link from the 
repository page by using dc:relation for the latter and dc:identifier for the 
former:

<record>
    <header>
      <identifier>oai:eprints.soton.ac.uk:66183</identifier>
        </header>
    <metadata>
      <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"; 
xmlns:dc="http://purl.org/dc/elements/1.1/"; 
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ 
http://www.openarchives.org/OAI/2.0/oai_dc.xsd";>
        <dc:title>A methodology for developing high damping materials with 
application to noise reduction of railway track</dc:title>
        <dc:creator>Ahmad, Nazirah</dc:creator>
        <dc:format>application/pdf</dc:format>
        
<dc:identifier>http://eprints.soton.ac.uk/66183/2451/P2503.pdf</dc:identifier>
        <dc:relation>http://eprints.soton.ac.uk/66183/</dc:relation>
          </oai_dc:dc>
        </metadata>
</record>

While this one from Cambridge (DSpace) uses a single 'handle' as the identifier 
- which just links to the repository page. Also note that this 'item' actually 
consists of two files - a video and a transcript in MS Word:

<record>
        <header>
                <identifier>oai:www.dspace.cam.ac.uk:1810/29</identifier>
        </header>
        <metadata>
                <oai_dc:dc 
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ 
http://www.openarchives.org/OAI/2.0/oai_dc.xsd"; 
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"; 
xmlns:dc="http://purl.org/dc/elements/1.1/"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";>
                        <dc:title>Interview with Professor Lucy Mair</dc:title>
                        <dc:creator>Macfarlane, Alan</dc:creator>
                        <dc:type>Video</dc:type>
                        <dc:format>25088 bytes</dc:format>
                        <dc:format>413196863 bytes</dc:format>
                        <dc:format>application/msword</dc:format>
                        <dc:format>application/octet-stream</dc:format>
                        
<dc:identifier>http://www.dspace.cam.ac.uk/handle/1810/29</dc:identifier>
                        <dc:language>en_GB</dc:language>
                </oai_dc:dc>
        </metadata>
</record>

> 
> On a related note, regarding rights. As a faculty member, I regularly sign
> ETD approval forms.  At Tech, students have three options to choose from:
> (a) open and immediate access, (b) restricted to VT for 1 year, (c)
> withhold access completely for 1 year for patent/security purposes.  The
> current form does not allow student authors to address whether the
> full-text of their dissertation may be harvested for the purposes of
> full-text indexing in such indexes as Google or Summon, not does it allow
> them to restrict where copies are served from.  Similarly, the dc:rights
> section in the OAI-PMH records address copyright only.  In practice, Google
> crawls, indexes, and serves full-text copies of our dissertations.
> 


Of course, it is absolutely reasonable that some content either not be open or 
have an embargo period - in which case I'd expect it to either not be added to 
the repository, or added and protected by some security which prevents public 
access. I know that in some cases authors wish to delay release of the thesis 
in order to publish a book which may draw on the PhD research - and this can 
take several years, although different institutions set different limits on 
this. I also know of at least one case where a PhD contained information that 
was deemed so confidential, it was agreed never to release it (I wasn't allowed 
to know what the information was!)

In theory copyright could be seen as sufficient to cover the use of the 
full-text item by third parties - either Google is protected by fair use (in 
the US anyway) or not. Unfortunately (and this would certainly be true in the 
UK) - the only way of really discovering if you have a case against Google 
would be to take them to court. Google would say (as they did to the 
newspapers) "it's easy to request we don't index/cache your content - we obey 
robots.txt". Which sort of brings me back to the starting point of the project 
I'm working on - while two wrongs don't make a right, it seems to us that if 
repositories are not preventing Google (or others - for example notably 
CiteSeerX is in the business of crawling repositories 
http://csxstatic.ist.psu.edu/about/crawler) crawling/indexing/caching their 
content, then we hope that a non-profit, publicly funded, service should feel 
able to do the same in the interests of making the content of repositories more 
discoverable and more widely dissmeniated.

Owen


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [email protected]
Telephone: 0121 288 6936

Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Reply via email to