Alistair Miles
Tue, 17 Feb 2009 04:25:56 -0800
Hi Corey, Good to hear from you. Yes, I checked out the SIMILE work, although I haven't studied it in detail. If you scroll down the page at: http://dublincore.org/dcmirdataskgroup/DataConversion you'll see a sample record in MARC XML, MODS XML, and SIMILE MODS RDF format for comparison. Cheers, Alistair On Mon, Feb 16, 2009 at 10:26:09AM -0500, Corey A Harper wrote: > Hi Alistair, > > I think I may have mentioned this to you before, but if not, have you > seen the early MIT / SIMILE work on MODS->RDF? [1] While I think > there's a few inaccuracies therein, and it certainly doesn't help at all > with the RDA/FRBR bits of your analysis, it might still be worth looking > at, even if only to inform or augment the work you've got going. > > I'm really excited to see some of this in action as you continue to make > progress. > > Thanks, > -Corey > > [1] http://simile.mit.edu/wiki/MARC/MODS_RDFizer > > Alistair Miles wrote: >> Hi Karen, >> >> On Fri, Feb 13, 2009 at 06:46:37AM -0800, Karen Coyle wrote: >>> Alistair, >>> >>> I did start an analysis of RDA and MARC, but didn't get very far. >>> I'll take that up again. What I was mainly finding is that there are >>> a lot of RDA elements that are listed for more than one MARC >>> element, e.g. >>> >>> $a Personal name* = 9.2.2 Preferred Name for the Person* >>> $b Numeration = *9.2.2 Preferred Name for the Person >> >> Yes, I expect there will be lots of issues like this, in both >> directions. Please do continue your analysis, this type if insight is >> very useful. >> >> I should say that I don't hope to create either a complete or perfect >> mapping from mods to RDF/RDA/FRBR. Rather I hope to map just enough to >> capture a significant amount of useful information, to demonstrate the >> potential for further work in this direction. >> >> Cheers, >> >> Alistair >> >>> There are ones that go the other way, as well, where RDA is more >>> specific than MARC. It made me wonder how it is that we use the >>> specific MARC elements: are they needed for display? do they help >>> input? are they arbitrary? >>> >>> I haven't looked at MODS, however, and there isn't a mapping provided >>> between MODS and RDA. I'll think about that, however. >>> >>> kc >>> >>> *Alistair Miles wrote: >>>> Hi all, >>>> >>>> This is just an update to say that I've converted the LOC/scriblio >>>> data to marc xml and from there to mods xml. My next step is to do >>>> some analysis of the loc data in mods xml to get an overview of the >>>> elements used, then to try to design at least a partial mapping from >>>> mods xml to RDF using the RDA and FRBR schemas. >>>> >>>> FYI the marc xml and mods xml versions of the LOC/scriblio data can be >>>> downloaded from the links below... >>>> >>>> http://dcmi-rda.s3.amazonaws.com/locdata/part01-marcxml.tar.gz >>>> http://dcmi-rda.s3.amazonaws.com/locdata/part01-modsxml.tar.gz >>>> http://dcmi-rda.s3.amazonaws.com/locdata/part02-marcxml.tar.gz >>>> http://dcmi-rda.s3.amazonaws.com/locdata/part02-modsxml.tar.gz >>>> [...] >>>> http://dcmi-rda.s3.amazonaws.com/locdata/part29-marcxml.tar.gz >>>> http://dcmi-rda.s3.amazonaws.com/locdata/part29-modsxml.tar.gz >>>> >>>> Each download is a gzipped tar containing a *set* of up to 25 xml >>>> files. Each of these files is a 10,000 record split of the data in the >>>> corresponding part. I broke each part into 10,000 record splits so I >>>> could process the transformations more easily. >>>> >>>> N.B. there is a bug in part 13 split 25, for some reason the marc xml >>>> output was incomplete so up to 10,000 records could be missing. >>>> >>>> FWIW I initially tried the conversions without splitting each >>>> part. I.e. I converted each original marc file into a single marc xml >>>> file, then tried to transform that to a mods xml file via >>>> xsltproc. However I found you need more than 7GB ram to do the marcxml >>>> to modsxml transform on a whole part (I tried it on a large ec2 >>>> instance), so that's when I decided to split each part into smaller >>>> chunks, which I figured would be faster to process and more amenable >>>> to parallel processing (transforming all the splits from marcxml to >>>> modsxml took a couple of hours on a c1.xlarge ec2 instance, running up >>>> to 10 transformations in parallel; it can also be done on a laptop, >>>> but takes ~10 times longer). >>>> >>>> Btw if anyone else has experience of the marcxml->modsxml transform on >>>> a file of similar size do let me know, I don't do a lot of xslt-ing so >>>> may be missing some tricks for making it work on smaller computers. >>>> >>>> Cheers, >>>> >>>> Alistair >>>> >>>> >>>> On Mon, Dec 22, 2008 at 03:31:50PM -0500, Ed Summers wrote: >>>> >>>>> Hey Alistair: >>>>> >>>>> On Mon, Dec 22, 2008 at 1:16 PM, Alistair Miles >>>>> <alistair.mi...@zoo.ox.ac.uk> wrote: >>>>> >>>>>> Any tips for how I could turn these data into RDF? >>>>>> >>>>> If you want to work specifically with that dataset you could download >>>>> the different parts Karen pointed you to, and convert to MARCXML using >>>>> an efficient tool like yaz-marcdump [2]. yaz-marcdump is nice it will >>>>> convert from MARC-8 to UTF-8. >>>>> >>>>> Once you've got it in MARCXML you could then use a stylesheet like >>>>> LC's [2] to convert to DublinCore flavored RDF. This might be kinda >>>>> lossy for your RDA work though, so you might want MARCXML->MODS [3], >>>>> and then use the MODS->RDF conversion that the Simile folks created >>>>> (which Karen also pointed you to) [4]. >>>>> >>>>> In fact Simile used that stylesheet on their own MIT Library Catalog >>>>> MARC data (Barton) and still seem to have the result online [5]. So >>>>> perhaps just using the Barton data is the quickest way to begin >>>>> playing with what once was MARC data as RDF? To my knowledge Stefano >>>>> Mazzocchi simply created an RDF vocabulary that mirrors the MODS XML >>>>> Schema, but I haven't looked at it in a while. >>>>> >>>>> Another thing worth checking out might be Rob Styles work [6] with >>>>> other people at Talis at converting MARC with full fidelity to RDF. >>>>> Perhaps he has some tools (or data) at his disposal? Rob you are on >>>>> here right? >>>>> >>>>> I'd be willing to lend a hand with some of this if necessary, so just >>>>> let me know if you think I can help. >>>>> >>>>> //Ed >>>>> >>>>> [1] http://www.indexdata.com/yaz/doc/yaz-marcdump.tkl >>>>> [2] http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl >>>>> [3] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl >>>>> [4] http://simile.mit.edu/wiki/MARC/MODS_RDFizer >>>>> [5] http://simile.mit.edu/wiki/Dataset:_Barton >>>>> [6] >>>>> http://events.linkeddata.org/ldow2008/papers/02-styles-ayers-semantic-marc.pdf >>>>> >>>> >>> -- >>> ----------------------------------- >>> Karen Coyle / Digital Library Consultant >>> kco...@kcoyle.net http://www.kcoyle.net >>> ph.: 510-540-7596 skype: kcoylenet >>> fx.: 510-848-3913 >>> mo.: 510-435-8234 >>> ------------------------------------ >> > > -- > Corey A Harper > Metadata Services Librarian > Bobst Library, B42-LL1 > New York University > 70 Washington Square South > New York, NY 10012 > 212.998.2479 > corey.har...@nyu.edu -- Alistair Miles Senior Computing Officer Image Bioinformatics Research Group Department of Zoology The Tinbergen Building University of Oxford South Parks Road Oxford OX1 3PS United Kingdom Web: http://purl.org/net/aliman Email: alistair.mi...@zoo.ox.ac.uk Tel: +44 (0)1865 281993