Karen Coyle
Mon, 16 Feb 2009 10:03:54 -0800
kc Alistair Miles wrote:
Hi all, I've generated some statistics on the MODS XML representation of the LOC dataset, visible at: http://dublincore.org/dcmirdataskgroup/DataConversion I have some statistics on the MARC XML representation as well later today, although I'll probably still work from the MODS to design a transform to RDF/RDA/FRBR. Cheers, Alistair On Fri, Feb 13, 2009 at 05:47:12PM +0000, Alistair Miles wrote:Hi Rob, On Fri, Feb 13, 2009 at 09:05:17AM +0000, Rob Styles wrote:Hey Alistair, this is great work and very interesting. I'm keen to see where the analysis and mapping to RDA goes.It would be great to get your input when I finally get to looking at the details of the mapping. The devil will be in the detail, I'm sure.What was the rationale for converting to MODS first rather than mapping straight from the MARC to RDA?Mostly time constraints -- I need to get as far as I can with a small amount of effort. Having no prior experience with MARC it seemed much easier to get going with the MODS representation. I understood from Ed's comments that the transformation from MARC to MODS doesn't lose much (if any?) information. Btw I have a script to generate stats on the usage of elements in the mods files, will hopefully be able to run that in the next couple of days for the whole loc dataset. Having fun with hadoop :) Cheers, Alistairrob On 12 Feb 2009, at 10:30, Alistair Miles wrote:Hi all, This is just an update to say that I've converted the LOC/scriblio data to marc xml and from there to mods xml. My next step is to do some analysis of the loc data in mods xml to get an overview of the elements used, then to try to design at least a partial mapping from mods xml to RDF using the RDA and FRBR schemas. FYI the marc xml and mods xml versions of the LOC/scriblio data can be downloaded from the links below... http://dcmi-rda.s3.amazonaws.com/locdata/part01-marcxml.tar.gz http://dcmi-rda.s3.amazonaws.com/locdata/part01-modsxml.tar.gz http://dcmi-rda.s3.amazonaws.com/locdata/part02-marcxml.tar.gz http://dcmi-rda.s3.amazonaws.com/locdata/part02-modsxml.tar.gz [...] http://dcmi-rda.s3.amazonaws.com/locdata/part29-marcxml.tar.gz http://dcmi-rda.s3.amazonaws.com/locdata/part29-modsxml.tar.gz Each download is a gzipped tar containing a *set* of up to 25 xml files. Each of these files is a 10,000 record split of the data in the corresponding part. I broke each part into 10,000 record splits so I could process the transformations more easily. N.B. there is a bug in part 13 split 25, for some reason the marc xml output was incomplete so up to 10,000 records could be missing. FWIW I initially tried the conversions without splitting each part. I.e. I converted each original marc file into a single marc xml file, then tried to transform that to a mods xml file via xsltproc. However I found you need more than 7GB ram to do the marcxml to modsxml transform on a whole part (I tried it on a large ec2 instance), so that's when I decided to split each part into smaller chunks, which I figured would be faster to process and more amenable to parallel processing (transforming all the splits from marcxml to modsxml took a couple of hours on a c1.xlarge ec2 instance, running up to 10 transformations in parallel; it can also be done on a laptop, but takes ~10 times longer). Btw if anyone else has experience of the marcxml->modsxml transform on a file of similar size do let me know, I don't do a lot of xslt-ing so may be missing some tricks for making it work on smaller computers. Cheers, Alistair On Mon, Dec 22, 2008 at 03:31:50PM -0500, Ed Summers wrote:Hey Alistair: On Mon, Dec 22, 2008 at 1:16 PM, Alistair Miles <alistair.mi...@zoo.ox.ac.uk> wrote:Any tips for how I could turn these data into RDF?If you want to work specifically with that dataset you could downloadthe different parts Karen pointed you to, and convert to MARCXML usingan efficient tool like yaz-marcdump [2]. yaz-marcdump is nice it will convert from MARC-8 to UTF-8. Once you've got it in MARCXML you could then use a stylesheet like LC's [2] to convert to DublinCore flavored RDF. This might be kinda lossy for your RDA work though, so you might want MARCXML->MODS [3], and then use the MODS->RDF conversion that the Simile folks created (which Karen also pointed you to) [4]. In fact Simile used that stylesheet on their own MIT Library Catalog MARC data (Barton) and still seem to have the result online [5]. So perhaps just using the Barton data is the quickest way to begin playing with what once was MARC data as RDF? To my knowledge Stefano Mazzocchi simply created an RDF vocabulary that mirrors the MODS XML Schema, but I haven't looked at it in a while. Another thing worth checking out might be Rob Styles work [6] with other people at Talis at converting MARC with full fidelity to RDF. Perhaps he has some tools (or data) at his disposal? Rob you are on here right? I'd be willing to lend a hand with some of this if necessary, so just let me know if you think I can help. //Ed [1] http://www.indexdata.com/yaz/doc/yaz-marcdump.tkl [2] http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl [3] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl [4] http://simile.mit.edu/wiki/MARC/MODS_RDFizer [5] http://simile.mit.edu/wiki/Dataset:_Barton [6] http://events.linkeddata.org/ldow2008/papers/02-styles-ayers-semantic-marc.pdf-- Alistair Miles Senior Computing Officer Image Bioinformatics Research Group Department of Zoology The Tinbergen Building University of Oxford South Parks Road Oxford OX1 3PS United Kingdom Web: http://purl.org/net/aliman Email: alistair.mi...@zoo.ox.ac.uk Tel: +44 (0)1865 281993Rob Styles tel: +44 (0)870 400 5000 fax: +44 (0)870 400 5001 mobile: +44 (0)7971 475 257 msn: mmmmm...@yahoo.com irc: irc.freenode.net/mmmmmrob,isnick web: http://www.talis.com/ blog: http://www.dynamicorange.com/blog/ blog: http://blogs.talis.com/panlibus/ blog: http://blogs.talis.com/nodalities/ blog: http://blogs.talis.com/n2/ Please consider the environment before printing this email. Find out more about Talis at www.talis.com shared innovationTM Any views or personal opinions expressed within this email may not be those of Talis Information Ltd or its employees. The content of this email message and any files that may be attached are confidential, and for the usage of the intended recipient only. If you are not the intended recipient, then please return this message to the sender and delete it. Any use of this e-mail by an unauthorised recipient is prohibited. Talis Information Ltd is a member of the Talis Group of companies and is registered in England No 3638278 with its registered office at Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB.-- Alistair Miles Senior Computing Officer Image Bioinformatics Research Group Department of Zoology The Tinbergen Building University of Oxford South Parks Road Oxford OX1 3PS United Kingdom Web: http://purl.org/net/aliman Email: alistair.mi...@zoo.ox.ac.uk Tel: +44 (0)1865 281993
-- ----------------------------------- Karen Coyle / Digital Library Consultant kco...@kcoyle.net http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenet fx.: 510-848-3913 mo.: 510-435-8234 ------------------------------------