Alistair Miles
Mon, 16 Feb 2009 00:46:54 -0800
Hi all, I've generated some statistics on the MODS XML representation of the LOC dataset, visible at: http://dublincore.org/dcmirdataskgroup/DataConversion I have some statistics on the MARC XML representation as well later today, although I'll probably still work from the MODS to design a transform to RDF/RDA/FRBR. Cheers, Alistair On Fri, Feb 13, 2009 at 05:47:12PM +0000, Alistair Miles wrote: > Hi Rob, > > On Fri, Feb 13, 2009 at 09:05:17AM +0000, Rob Styles wrote: > > Hey Alistair, this is great work and very interesting. I'm keen to see > > where the analysis and mapping to RDA goes. > > It would be great to get your input when I finally get to looking at > the details of the mapping. The devil will be in the detail, I'm sure. > > > What was the rationale for converting to MODS first rather than mapping > > straight from the MARC to RDA? > > Mostly time constraints -- I need to get as far as I can with a small > amount of effort. Having no prior experience with MARC it seemed much > easier to get going with the MODS representation. I understood from > Ed's comments that the transformation from MARC to MODS doesn't lose > much (if any?) information. > > Btw I have a script to generate stats on the usage of elements in the > mods files, will hopefully be able to run that in the next couple of > days for the whole loc dataset. Having fun with hadoop :) > > Cheers, > > Alistair > > > > > rob > > > > > > On 12 Feb 2009, at 10:30, Alistair Miles wrote: > > > >> Hi all, > >> > >> This is just an update to say that I've converted the LOC/scriblio > >> data to marc xml and from there to mods xml. My next step is to do > >> some analysis of the loc data in mods xml to get an overview of the > >> elements used, then to try to design at least a partial mapping from > >> mods xml to RDF using the RDA and FRBR schemas. > >> > >> FYI the marc xml and mods xml versions of the LOC/scriblio data can be > >> downloaded from the links below... > >> > >> http://dcmi-rda.s3.amazonaws.com/locdata/part01-marcxml.tar.gz > >> http://dcmi-rda.s3.amazonaws.com/locdata/part01-modsxml.tar.gz > >> http://dcmi-rda.s3.amazonaws.com/locdata/part02-marcxml.tar.gz > >> http://dcmi-rda.s3.amazonaws.com/locdata/part02-modsxml.tar.gz > >> [...] > >> http://dcmi-rda.s3.amazonaws.com/locdata/part29-marcxml.tar.gz > >> http://dcmi-rda.s3.amazonaws.com/locdata/part29-modsxml.tar.gz > >> > >> Each download is a gzipped tar containing a *set* of up to 25 xml > >> files. Each of these files is a 10,000 record split of the data in the > >> corresponding part. I broke each part into 10,000 record splits so I > >> could process the transformations more easily. > >> > >> N.B. there is a bug in part 13 split 25, for some reason the marc xml > >> output was incomplete so up to 10,000 records could be missing. > >> > >> FWIW I initially tried the conversions without splitting each > >> part. I.e. I converted each original marc file into a single marc xml > >> file, then tried to transform that to a mods xml file via > >> xsltproc. However I found you need more than 7GB ram to do the marcxml > >> to modsxml transform on a whole part (I tried it on a large ec2 > >> instance), so that's when I decided to split each part into smaller > >> chunks, which I figured would be faster to process and more amenable > >> to parallel processing (transforming all the splits from marcxml to > >> modsxml took a couple of hours on a c1.xlarge ec2 instance, running up > >> to 10 transformations in parallel; it can also be done on a laptop, > >> but takes ~10 times longer). > >> > >> Btw if anyone else has experience of the marcxml->modsxml transform on > >> a file of similar size do let me know, I don't do a lot of xslt-ing so > >> may be missing some tricks for making it work on smaller computers. > >> > >> Cheers, > >> > >> Alistair > >> > >> > >> On Mon, Dec 22, 2008 at 03:31:50PM -0500, Ed Summers wrote: > >>> Hey Alistair: > >>> > >>> On Mon, Dec 22, 2008 at 1:16 PM, Alistair Miles > >>> <alistair.mi...@zoo.ox.ac.uk> wrote: > >>>> Any tips for how I could turn these data into RDF? > >>> > >>> If you want to work specifically with that dataset you could download > >>> the different parts Karen pointed you to, and convert to MARCXML > >>> using > >>> an efficient tool like yaz-marcdump [2]. yaz-marcdump is nice it will > >>> convert from MARC-8 to UTF-8. > >>> > >>> Once you've got it in MARCXML you could then use a stylesheet like > >>> LC's [2] to convert to DublinCore flavored RDF. This might be kinda > >>> lossy for your RDA work though, so you might want MARCXML->MODS [3], > >>> and then use the MODS->RDF conversion that the Simile folks created > >>> (which Karen also pointed you to) [4]. > >>> > >>> In fact Simile used that stylesheet on their own MIT Library Catalog > >>> MARC data (Barton) and still seem to have the result online [5]. So > >>> perhaps just using the Barton data is the quickest way to begin > >>> playing with what once was MARC data as RDF? To my knowledge Stefano > >>> Mazzocchi simply created an RDF vocabulary that mirrors the MODS XML > >>> Schema, but I haven't looked at it in a while. > >>> > >>> Another thing worth checking out might be Rob Styles work [6] with > >>> other people at Talis at converting MARC with full fidelity to RDF. > >>> Perhaps he has some tools (or data) at his disposal? Rob you are on > >>> here right? > >>> > >>> I'd be willing to lend a hand with some of this if necessary, so just > >>> let me know if you think I can help. > >>> > >>> //Ed > >>> > >>> [1] http://www.indexdata.com/yaz/doc/yaz-marcdump.tkl > >>> [2] http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl > >>> [3] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl > >>> [4] http://simile.mit.edu/wiki/MARC/MODS_RDFizer > >>> [5] http://simile.mit.edu/wiki/Dataset:_Barton > >>> [6] > >>> http://events.linkeddata.org/ldow2008/papers/02-styles-ayers-semantic-marc.pdf > >> > >> -- > >> Alistair Miles > >> Senior Computing Officer > >> Image Bioinformatics Research Group > >> Department of Zoology > >> The Tinbergen Building > >> University of Oxford > >> South Parks Road > >> Oxford > >> OX1 3PS > >> United Kingdom > >> Web: http://purl.org/net/aliman > >> Email: alistair.mi...@zoo.ox.ac.uk > >> Tel: +44 (0)1865 281993 > > > > Rob Styles > > tel: +44 (0)870 400 5000 > > fax: +44 (0)870 400 5001 > > mobile: +44 (0)7971 475 257 > > msn: mmmmm...@yahoo.com > > irc: irc.freenode.net/mmmmmrob,isnick > > web: http://www.talis.com/ > > blog: http://www.dynamicorange.com/blog/ > > blog: http://blogs.talis.com/panlibus/ > > blog: http://blogs.talis.com/nodalities/ > > blog: http://blogs.talis.com/n2/ > > > > Please consider the environment before printing this email. > > > > Find out more about Talis at www.talis.com > > > > shared innovationTM > > > > Any views or personal opinions expressed within this email may not be those > > of Talis Information Ltd or its employees. The content of this email > > message and any files that may be attached are confidential, and for the > > usage of the intended recipient only. If you are not the intended > > recipient, then please return this message to the sender and delete it. Any > > use of this e-mail by an unauthorised recipient is prohibited. > > > > Talis Information Ltd is a member of the Talis Group of companies and is > > registered in England No 3638278 with its registered office at Knights > > Court, Solihull Parkway, Birmingham Business Park, B37 7YB. > > -- > Alistair Miles > Senior Computing Officer > Image Bioinformatics Research Group > Department of Zoology > The Tinbergen Building > University of Oxford > South Parks Road > Oxford > OX1 3PS > United Kingdom > Web: http://purl.org/net/aliman > Email: alistair.mi...@zoo.ox.ac.uk > Tel: +44 (0)1865 281993 -- Alistair Miles Senior Computing Officer Image Bioinformatics Research Group Department of Zoology The Tinbergen Building University of Oxford South Parks Road Oxford OX1 3PS United Kingdom Web: http://purl.org/net/aliman Email: alistair.mi...@zoo.ox.ac.uk Tel: +44 (0)1865 281993