dc-rda  

Re: datasets for testing rda at scale

Alistair Miles
Mon, 16 Feb 2009 01:13:44 -0800

Hi Karen,

On Fri, Feb 13, 2009 at 06:46:37AM -0800, Karen Coyle wrote:
> Alistair,
>
> I did start an analysis of RDA and MARC, but didn't get very far. I'll  
> take that up again. What I was mainly finding is that there are a lot of  
> RDA elements that are listed for more than one MARC element, e.g.
>
> $a Personal name* = 9.2.2 Preferred Name for the Person*
> $b Numeration = *9.2.2 Preferred Name for the Person

Yes, I expect there will be lots of issues like this, in both
directions. Please do continue your analysis, this type if insight is
very useful.

I should say that I don't hope to create either a complete or perfect
mapping from mods to RDF/RDA/FRBR. Rather I hope to map just enough to
capture a significant amount of useful information, to demonstrate the
potential for further work in this direction.

Cheers,

Alistair

>
> There are ones that go the other way, as well, where RDA is more  
> specific than MARC. It made me wonder how it is that we use the specific  
> MARC elements: are they needed for display? do they help input? are they  
> arbitrary?
>
> I haven't looked at MODS, however, and there isn't a mapping provided  
> between MODS and RDA. I'll think about that, however.
>
> kc
>
> *Alistair Miles wrote:
>> Hi all,
>>
>> This is just an update to say that I've converted the LOC/scriblio
>> data to marc xml and from there to mods xml. My next step is to do
>> some analysis of the loc data in mods xml to get an overview of the
>> elements used, then to try to design at least a partial mapping from
>> mods xml to RDF using the RDA and FRBR schemas.
>>
>> FYI the marc xml and mods xml versions of the LOC/scriblio data can be
>> downloaded from the links below...
>>
>> http://dcmi-rda.s3.amazonaws.com/locdata/part01-marcxml.tar.gz
>> http://dcmi-rda.s3.amazonaws.com/locdata/part01-modsxml.tar.gz
>> http://dcmi-rda.s3.amazonaws.com/locdata/part02-marcxml.tar.gz
>> http://dcmi-rda.s3.amazonaws.com/locdata/part02-modsxml.tar.gz
>> [...]
>> http://dcmi-rda.s3.amazonaws.com/locdata/part29-marcxml.tar.gz
>> http://dcmi-rda.s3.amazonaws.com/locdata/part29-modsxml.tar.gz
>>
>> Each download is a gzipped tar containing a *set* of up to 25 xml
>> files. Each of these files is a 10,000 record split of the data in the
>> corresponding part. I broke each part into 10,000 record splits so I
>> could process the transformations more easily.
>>
>> N.B. there is a bug in part 13 split 25, for some reason the marc xml
>> output was incomplete so up to 10,000 records could be missing.
>>
>> FWIW I initially tried the conversions without splitting each
>> part. I.e. I converted each original marc file into a single marc xml
>> file, then tried to transform that to a mods xml file via
>> xsltproc. However I found you need more than 7GB ram to do the marcxml
>> to modsxml transform on a whole part (I tried it on a large ec2
>> instance), so that's when I decided to split each part into smaller
>> chunks, which I figured would be faster to process and more amenable
>> to parallel processing (transforming all the splits from marcxml to
>> modsxml took a couple of hours on a c1.xlarge ec2 instance, running up
>> to 10 transformations in parallel; it can also be done on a laptop,
>> but takes ~10 times longer).
>>
>> Btw if anyone else has experience of the marcxml->modsxml transform on
>> a file of similar size do let me know, I don't do a lot of xslt-ing so
>> may be missing some tricks for making it work on smaller computers.
>>
>> Cheers,
>>
>> Alistair
>>
>>
>> On Mon, Dec 22, 2008 at 03:31:50PM -0500, Ed Summers wrote:
>>   
>>> Hey Alistair:
>>>
>>> On Mon, Dec 22, 2008 at 1:16 PM, Alistair Miles
>>> <alistair.mi...@zoo.ox.ac.uk> wrote:
>>>     
>>>> Any tips for how I could turn these data into RDF?
>>>>       
>>> If you want to work specifically with that dataset you could download
>>> the different parts Karen pointed you to, and convert to MARCXML using
>>> an efficient tool like yaz-marcdump [2]. yaz-marcdump is nice it will
>>> convert from MARC-8 to UTF-8.
>>>
>>> Once you've got it in MARCXML you could then use a stylesheet like
>>> LC's [2] to convert to DublinCore flavored RDF. This might be kinda
>>> lossy for your RDA work though, so you might want MARCXML->MODS [3],
>>> and then use the MODS->RDF conversion that the Simile folks created
>>> (which Karen also pointed you to) [4].
>>>
>>> In fact Simile used that stylesheet on their own MIT Library Catalog
>>> MARC data (Barton) and still seem to have the result online [5]. So
>>> perhaps just using the Barton data is the quickest way to begin
>>> playing with what once was MARC data as RDF? To my knowledge Stefano
>>> Mazzocchi simply created an RDF vocabulary that mirrors the  MODS XML
>>> Schema, but I haven't looked at it in a while.
>>>
>>> Another thing worth checking out might be Rob Styles work [6] with
>>> other people at Talis at converting MARC with full fidelity to RDF.
>>> Perhaps he has some tools (or data) at his disposal? Rob you are on
>>> here right?
>>>
>>> I'd be willing to lend a hand with some of this if necessary, so just
>>> let me know if you think I can help.
>>>
>>> //Ed
>>>
>>> [1] http://www.indexdata.com/yaz/doc/yaz-marcdump.tkl
>>> [2] http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl
>>> [3] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl
>>> [4] http://simile.mit.edu/wiki/MARC/MODS_RDFizer
>>> [5] http://simile.mit.edu/wiki/Dataset:_Barton
>>> [6] 
>>> http://events.linkeddata.org/ldow2008/papers/02-styles-ayers-semantic-marc.pdf
>>>     
>>
>>   
>
> -- 
> -----------------------------------
> Karen Coyle / Digital Library Consultant
> kco...@kcoyle.net http://www.kcoyle.net
> ph.: 510-540-7596   skype: kcoylenet
> fx.: 510-848-3913
> mo.: 510-435-8234
> ------------------------------------

-- 
Alistair Miles
Senior Computing Officer
Image Bioinformatics Research Group
Department of Zoology
The Tinbergen Building
University of Oxford
South Parks Road
Oxford
OX1 3PS
United Kingdom
Web: http://purl.org/net/aliman
Email: alistair.mi...@zoo.ox.ac.uk
Tel: +44 (0)1865 281993