dc-rda  

Re: datasets for testing rda at scale

Corey A Harper
Mon, 16 Feb 2009 07:26:33 -0800

Hi Alistair,

I think I may have mentioned this to you before, but if not, have you seen the early MIT / SIMILE work on MODS->RDF? [1] While I think there's a few inaccuracies therein, and it certainly doesn't help at all with the RDA/FRBR bits of your analysis, it might still be worth looking at, even if only to inform or augment the work you've got going.

I'm really excited to see some of this in action as you continue to make progress.

Thanks,
-Corey

[1] http://simile.mit.edu/wiki/MARC/MODS_RDFizer

Alistair Miles wrote:
Hi Karen,

On Fri, Feb 13, 2009 at 06:46:37AM -0800, Karen Coyle wrote:
Alistair,

I did start an analysis of RDA and MARC, but didn't get very far. I'll take that up again. What I was mainly finding is that there are a lot of RDA elements that are listed for more than one MARC element, e.g.

$a Personal name* = 9.2.2 Preferred Name for the Person*
$b Numeration = *9.2.2 Preferred Name for the Person

Yes, I expect there will be lots of issues like this, in both
directions. Please do continue your analysis, this type if insight is
very useful.

I should say that I don't hope to create either a complete or perfect
mapping from mods to RDF/RDA/FRBR. Rather I hope to map just enough to
capture a significant amount of useful information, to demonstrate the
potential for further work in this direction.

Cheers,

Alistair

There are ones that go the other way, as well, where RDA is more specific than MARC. It made me wonder how it is that we use the specific MARC elements: are they needed for display? do they help input? are they arbitrary?

I haven't looked at MODS, however, and there isn't a mapping provided between MODS and RDA. I'll think about that, however.

kc

*Alistair Miles wrote:
Hi all,

This is just an update to say that I've converted the LOC/scriblio
data to marc xml and from there to mods xml. My next step is to do
some analysis of the loc data in mods xml to get an overview of the
elements used, then to try to design at least a partial mapping from
mods xml to RDF using the RDA and FRBR schemas.

FYI the marc xml and mods xml versions of the LOC/scriblio data can be
downloaded from the links below...

http://dcmi-rda.s3.amazonaws.com/locdata/part01-marcxml.tar.gz
http://dcmi-rda.s3.amazonaws.com/locdata/part01-modsxml.tar.gz
http://dcmi-rda.s3.amazonaws.com/locdata/part02-marcxml.tar.gz
http://dcmi-rda.s3.amazonaws.com/locdata/part02-modsxml.tar.gz
[...]
http://dcmi-rda.s3.amazonaws.com/locdata/part29-marcxml.tar.gz
http://dcmi-rda.s3.amazonaws.com/locdata/part29-modsxml.tar.gz

Each download is a gzipped tar containing a *set* of up to 25 xml
files. Each of these files is a 10,000 record split of the data in the
corresponding part. I broke each part into 10,000 record splits so I
could process the transformations more easily.

N.B. there is a bug in part 13 split 25, for some reason the marc xml
output was incomplete so up to 10,000 records could be missing.

FWIW I initially tried the conversions without splitting each
part. I.e. I converted each original marc file into a single marc xml
file, then tried to transform that to a mods xml file via
xsltproc. However I found you need more than 7GB ram to do the marcxml
to modsxml transform on a whole part (I tried it on a large ec2
instance), so that's when I decided to split each part into smaller
chunks, which I figured would be faster to process and more amenable
to parallel processing (transforming all the splits from marcxml to
modsxml took a couple of hours on a c1.xlarge ec2 instance, running up
to 10 transformations in parallel; it can also be done on a laptop,
but takes ~10 times longer).

Btw if anyone else has experience of the marcxml->modsxml transform on
a file of similar size do let me know, I don't do a lot of xslt-ing so
may be missing some tricks for making it work on smaller computers.

Cheers,

Alistair


On Mon, Dec 22, 2008 at 03:31:50PM -0500, Ed Summers wrote:
Hey Alistair:

On Mon, Dec 22, 2008 at 1:16 PM, Alistair Miles
<alistair.mi...@zoo.ox.ac.uk> wrote:
Any tips for how I could turn these data into RDF?
If you want to work specifically with that dataset you could download
the different parts Karen pointed you to, and convert to MARCXML using
an efficient tool like yaz-marcdump [2]. yaz-marcdump is nice it will
convert from MARC-8 to UTF-8.

Once you've got it in MARCXML you could then use a stylesheet like
LC's [2] to convert to DublinCore flavored RDF. This might be kinda
lossy for your RDA work though, so you might want MARCXML->MODS [3],
and then use the MODS->RDF conversion that the Simile folks created
(which Karen also pointed you to) [4].

In fact Simile used that stylesheet on their own MIT Library Catalog
MARC data (Barton) and still seem to have the result online [5]. So
perhaps just using the Barton data is the quickest way to begin
playing with what once was MARC data as RDF? To my knowledge Stefano
Mazzocchi simply created an RDF vocabulary that mirrors the  MODS XML
Schema, but I haven't looked at it in a while.

Another thing worth checking out might be Rob Styles work [6] with
other people at Talis at converting MARC with full fidelity to RDF.
Perhaps he has some tools (or data) at his disposal? Rob you are on
here right?

I'd be willing to lend a hand with some of this if necessary, so just
let me know if you think I can help.

//Ed

[1] http://www.indexdata.com/yaz/doc/yaz-marcdump.tkl
[2] http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl
[3] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl
[4] http://simile.mit.edu/wiki/MARC/MODS_RDFizer
[5] http://simile.mit.edu/wiki/Dataset:_Barton
[6] 
http://events.linkeddata.org/ldow2008/papers/02-styles-ayers-semantic-marc.pdf
--
-----------------------------------
Karen Coyle / Digital Library Consultant
kco...@kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------


--
Corey A Harper
Metadata Services Librarian
Bobst Library, B42-LL1
New York University
70 Washington Square South
New York, NY  10012
212.998.2479
corey.har...@nyu.edu