dc-rda  

Re: datasets for testing rda at scale

Alistair Miles
Mon, 16 Feb 2009 00:46:54 -0800

Hi all,

I've generated some statistics on the MODS XML representation of the
LOC dataset, visible at:

http://dublincore.org/dcmirdataskgroup/DataConversion

I have some statistics on the MARC XML representation as well later
today, although I'll probably still work from the MODS to design a
transform to RDF/RDA/FRBR.

Cheers,

Alistair

On Fri, Feb 13, 2009 at 05:47:12PM +0000, Alistair Miles wrote:
> Hi Rob,
> 
> On Fri, Feb 13, 2009 at 09:05:17AM +0000, Rob Styles wrote:
> > Hey Alistair, this is great work and very interesting. I'm keen to see  
> > where the analysis and mapping to RDA goes.
> 
> It would be great to get your input when I finally get to looking at
> the details of the mapping. The devil will be in the detail, I'm sure.
> 
> > What was the rationale for converting to MODS first rather than mapping 
> > straight from the MARC to RDA?
> 
> Mostly time constraints -- I need to get as far as I can with a small
> amount of effort. Having no prior experience with MARC it seemed much
> easier to get going with the MODS representation. I understood from
> Ed's comments that the transformation from MARC to MODS doesn't lose
> much (if any?) information.
> 
> Btw I have a script to generate stats on the usage of elements in the
> mods files, will hopefully be able to run that in the next couple of
> days for the whole loc dataset. Having fun with hadoop :)
> 
> Cheers,
> 
> Alistair
> 
> >
> > rob
> >
> >
> > On 12 Feb 2009, at 10:30, Alistair Miles wrote:
> >
> >> Hi all,
> >>
> >> This is just an update to say that I've converted the LOC/scriblio
> >> data to marc xml and from there to mods xml. My next step is to do
> >> some analysis of the loc data in mods xml to get an overview of the
> >> elements used, then to try to design at least a partial mapping from
> >> mods xml to RDF using the RDA and FRBR schemas.
> >>
> >> FYI the marc xml and mods xml versions of the LOC/scriblio data can be
> >> downloaded from the links below...
> >>
> >> http://dcmi-rda.s3.amazonaws.com/locdata/part01-marcxml.tar.gz
> >> http://dcmi-rda.s3.amazonaws.com/locdata/part01-modsxml.tar.gz
> >> http://dcmi-rda.s3.amazonaws.com/locdata/part02-marcxml.tar.gz
> >> http://dcmi-rda.s3.amazonaws.com/locdata/part02-modsxml.tar.gz
> >> [...]
> >> http://dcmi-rda.s3.amazonaws.com/locdata/part29-marcxml.tar.gz
> >> http://dcmi-rda.s3.amazonaws.com/locdata/part29-modsxml.tar.gz
> >>
> >> Each download is a gzipped tar containing a *set* of up to 25 xml
> >> files. Each of these files is a 10,000 record split of the data in the
> >> corresponding part. I broke each part into 10,000 record splits so I
> >> could process the transformations more easily.
> >>
> >> N.B. there is a bug in part 13 split 25, for some reason the marc xml
> >> output was incomplete so up to 10,000 records could be missing.
> >>
> >> FWIW I initially tried the conversions without splitting each
> >> part. I.e. I converted each original marc file into a single marc xml
> >> file, then tried to transform that to a mods xml file via
> >> xsltproc. However I found you need more than 7GB ram to do the marcxml
> >> to modsxml transform on a whole part (I tried it on a large ec2
> >> instance), so that's when I decided to split each part into smaller
> >> chunks, which I figured would be faster to process and more amenable
> >> to parallel processing (transforming all the splits from marcxml to
> >> modsxml took a couple of hours on a c1.xlarge ec2 instance, running up
> >> to 10 transformations in parallel; it can also be done on a laptop,
> >> but takes ~10 times longer).
> >>
> >> Btw if anyone else has experience of the marcxml->modsxml transform on
> >> a file of similar size do let me know, I don't do a lot of xslt-ing so
> >> may be missing some tricks for making it work on smaller computers.
> >>
> >> Cheers,
> >>
> >> Alistair
> >>
> >>
> >> On Mon, Dec 22, 2008 at 03:31:50PM -0500, Ed Summers wrote:
> >>> Hey Alistair:
> >>>
> >>> On Mon, Dec 22, 2008 at 1:16 PM, Alistair Miles
> >>> <alistair.mi...@zoo.ox.ac.uk> wrote:
> >>>> Any tips for how I could turn these data into RDF?
> >>>
> >>> If you want to work specifically with that dataset you could download
> >>> the different parts Karen pointed you to, and convert to MARCXML  
> >>> using
> >>> an efficient tool like yaz-marcdump [2]. yaz-marcdump is nice it will
> >>> convert from MARC-8 to UTF-8.
> >>>
> >>> Once you've got it in MARCXML you could then use a stylesheet like
> >>> LC's [2] to convert to DublinCore flavored RDF. This might be kinda
> >>> lossy for your RDA work though, so you might want MARCXML->MODS [3],
> >>> and then use the MODS->RDF conversion that the Simile folks created
> >>> (which Karen also pointed you to) [4].
> >>>
> >>> In fact Simile used that stylesheet on their own MIT Library Catalog
> >>> MARC data (Barton) and still seem to have the result online [5]. So
> >>> perhaps just using the Barton data is the quickest way to begin
> >>> playing with what once was MARC data as RDF? To my knowledge Stefano
> >>> Mazzocchi simply created an RDF vocabulary that mirrors the  MODS XML
> >>> Schema, but I haven't looked at it in a while.
> >>>
> >>> Another thing worth checking out might be Rob Styles work [6] with
> >>> other people at Talis at converting MARC with full fidelity to RDF.
> >>> Perhaps he has some tools (or data) at his disposal? Rob you are on
> >>> here right?
> >>>
> >>> I'd be willing to lend a hand with some of this if necessary, so just
> >>> let me know if you think I can help.
> >>>
> >>> //Ed
> >>>
> >>> [1] http://www.indexdata.com/yaz/doc/yaz-marcdump.tkl
> >>> [2] http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl
> >>> [3] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl
> >>> [4] http://simile.mit.edu/wiki/MARC/MODS_RDFizer
> >>> [5] http://simile.mit.edu/wiki/Dataset:_Barton
> >>> [6] 
> >>> http://events.linkeddata.org/ldow2008/papers/02-styles-ayers-semantic-marc.pdf
> >>
> >> -- 
> >> Alistair Miles
> >> Senior Computing Officer
> >> Image Bioinformatics Research Group
> >> Department of Zoology
> >> The Tinbergen Building
> >> University of Oxford
> >> South Parks Road
> >> Oxford
> >> OX1 3PS
> >> United Kingdom
> >> Web: http://purl.org/net/aliman
> >> Email: alistair.mi...@zoo.ox.ac.uk
> >> Tel: +44 (0)1865 281993
> >
> > Rob Styles
> > tel: +44 (0)870 400 5000
> > fax: +44 (0)870 400 5001
> > mobile: +44 (0)7971 475 257
> > msn: mmmmm...@yahoo.com
> > irc: irc.freenode.net/mmmmmrob,isnick
> > web: http://www.talis.com/
> > blog: http://www.dynamicorange.com/blog/
> > blog: http://blogs.talis.com/panlibus/
> > blog: http://blogs.talis.com/nodalities/
> > blog: http://blogs.talis.com/n2/
> >
> > Please consider the environment before printing this email.
> >
> > Find out more about Talis at www.talis.com
> >
> > shared innovationTM
> >
> > Any views or personal opinions expressed within this email may not be those 
> > of Talis Information Ltd or its employees. The content of this email 
> > message and any files that may be attached are confidential, and for the 
> > usage of the intended recipient only. If you are not the intended 
> > recipient, then please return this message to the sender and delete it. Any 
> > use of this e-mail by an unauthorised recipient is prohibited.
> >
> > Talis Information Ltd is a member of the Talis Group of companies and is 
> > registered in England No 3638278 with its registered office at Knights 
> > Court, Solihull Parkway, Birmingham Business Park, B37 7YB.
> 
> -- 
> Alistair Miles
> Senior Computing Officer
> Image Bioinformatics Research Group
> Department of Zoology
> The Tinbergen Building
> University of Oxford
> South Parks Road
> Oxford
> OX1 3PS
> United Kingdom
> Web: http://purl.org/net/aliman
> Email: alistair.mi...@zoo.ox.ac.uk
> Tel: +44 (0)1865 281993

-- 
Alistair Miles
Senior Computing Officer
Image Bioinformatics Research Group
Department of Zoology
The Tinbergen Building
University of Oxford
South Parks Road
Oxford
OX1 3PS
United Kingdom
Web: http://purl.org/net/aliman
Email: alistair.mi...@zoo.ox.ac.uk
Tel: +44 (0)1865 281993