Re: [CODE4LIB] MODS and DCTERMS
Having read the rest of this thread, I find that nothing that's been said changes my initial gut reaction on reading this question: DO NOT USE DCTERMS. It's vocabulary is Just Plain Inadequate, and not only for esoteric cases like the Alternative Chronological Designation of First Issue or Part of Sequence field that Karen mentioned. Despite having 70 (seventy!) elements, it's lacking fundamental fields for describing articles in journals -- there are no journalTitle, volume, issue, startPage or endPage fields. That, for me, is a deal-breaker. (For anyone who wonders: MODS does have a way to represent these elements, although they are unnecessarily complicated as the example at http://www.loc.gov/standards/mods/v3/modsjournal.xml shows.) For anyone who enjoys weeping freely, I recommend the document Guidelines for Encoding Bibliographic Citation Information in Dublin Core Metadata, available at http://dublincore.org/documents/dc-citation-guidelines/index.shtml On 28 April 2010 17:56, MJ Suhonos m...@suhonos.ca wrote: Hi all, I'm digging into earlier threads on Code4Lib and NGC4lib and trying to get some concrete examples around the DCTERMS element set — maybe I haven't been a subscriber for long enough. What I'm looking for in particular are things I can work with *in code/implementation*, most notably: - does there exist a MODS-to-DCTERMS (or vice-versa) crosswalk anywhere? I see one for collections: http://www.loc.gov/standards/mods/v3/mods-collection-description.html and http://www.loc.gov/marc/marc2dc.html for MARC but my ideal use case is, eg. an XSLT to turn a MODS document into an XML-encoded DCTERMS document. Surely someone has done this? (I'm sure I've oversimplified or misunderstood something, but hopefully the general approach is understandable) - for that matter, is there a good example of how to properly serialize DCTERMS for eg. a converted MARC/MODS record in XML (or RDF/XML)? I see, eg. http://dublincore.org/documents/dcq-rdf-xml/ which has been replaced by http://dublincore.org/documents/dc-rdf/ but I'm not sure if the latter obviates the former entirely? Also, the examples at the bottom of the latter don't show, eg. repeated elements or DCMES elements. Do we abandon http://purl.org/dc/elements/1.1/ entirely? For example, is this valid? rdf:RDF xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#; xmlns:dcterms=http://purl.org/dc/terms/; xmlns:dc=http://purl.org/dc/elements/1.1/ rdf:Description rdf:about=http://example.org/123; dc:title xml:lang=enLearning Biology/dcterms:title dcterms:title xml:lang=enLearning Biology/dcterms:title dcterms:alternative xml:lang=enA primer on biological processes/dcterms:title dcterms:creator xml:lang=enBar, Foo/dcterms:creator dcterms:creator xml:lang=enSmith, Jane/dcterms:creator dc:creator xml:lang=enBar, Foo/dc:creator dc:creator xml:lang=enSmith, Jane/dc:creator /rdf:Description /rdf:RDF Apologies for any questions that seem silly or naive — I think i have a pretty firm grasp on the levels of abstraction involved, but for the life of me, I can't find much solid stuff about DCTERMS outside of the DCMI website, which can be a bit of a challenge to navigate at times. Thanks, MJ
Re: [CODE4LIB] MODS and DCTERMS
On Tue, May 4, 2010 at 7:55 AM, Mike Taylor m...@indexdata.com wrote: Having read the rest of this thread, I find that nothing that's been said changes my initial gut reaction on reading this question: DO NOT USE DCTERMS. It's vocabulary is Just Plain Inadequate, and not only for esoteric cases like the Alternative Chronological Designation of First Issue or Part of Sequence field that Karen mentioned. Despite having 70 (seventy!) elements, it's lacking fundamental fields for describing articles in journals -- there are no journalTitle, volume, issue, startPage or endPage fields. That, for me, is a deal-breaker. If you're using Dublin Core as XML, I agree with this. If you're using Dublin Core as RDF (which is, honestly, the only thing it's really good for), this is a non-issue. -Ross.
Re: [CODE4LIB] MODS and DCTERMS
On 4 May 2010 13:19, Ross Singer rossfsin...@gmail.com wrote: On Tue, May 4, 2010 at 7:55 AM, Mike Taylor m...@indexdata.com wrote: Having read the rest of this thread, I find that nothing that's been said changes my initial gut reaction on reading this question: DO NOT USE DCTERMS. It's vocabulary is Just Plain Inadequate, and not only for esoteric cases like the Alternative Chronological Designation of First Issue or Part of Sequence field that Karen mentioned. Despite having 70 (seventy!) elements, it's lacking fundamental fields for describing articles in journals -- there are no journalTitle, volume, issue, startPage or endPage fields. That, for me, is a deal-breaker. If you're using Dublin Core as XML, I agree with this. If you're using Dublin Core as RDF (which is, honestly, the only thing it's really good for), this is a non-issue. Oh, what is the solution when using it in RDF?
Re: [CODE4LIB] MODS and DCTERMS
On Tue, May 4, 2010 at 8:24 AM, Mike Taylor m...@indexdata.com wrote: Oh, what is the solution when using it in RDF? I've been using the Bibliographic Ontology myself: http://bibliontology.com/ Lots of stuff in there for journals, etc ... and reuse of other vocabularies like event, foaf, prism and (ahem) dcterms. //Ed
Re: [CODE4LIB] MODS and DCTERMS
I'd just like to say a word of thanks for everyone who has contributed so far on this thread. The viewpoints raised certainly help clarify at least my understanding of some of the issues and concepts involved. MARCXML is a step in the right direction. MODS goes even further. Neither really go far enough. And that succinctly, Eric manages to summarize my (and I strongly suspect, many others') sentiment on the issue at hand. Of course, the natural follow-on question is go far enough for *what* exactly, and this is where my original question came from. It sounds like once again we have the issue that our current tools (MODS, DCTERMS) aren't good enough, which means we either have to: a) stop doing things while we build new, better tools like Karen's MARC-in-triples (which seems like a really interesting idea) or b) start building imperfect — perhaps highly flawed — things with our current, imperfect tools I'm not nearly smart enough to do a) so my intent is to take a stab at b), or else sit back and consider a new line of work entirely (which happens distressingly often, usually after reading enough discouraging statements from librarians in a given day). I think there's a fundamental difference between MODS and DCTERMS that make this nearly impossible. I've sometimes described this as the difference between metadata as record format (MARC, oai_dc, MODS, etc) and metadata as vocabulary (DCTERMS, DCAM, RDF Vocabs in general). This is a great clarification, and one of the main frustrations I have with MODS: it is bound nearly inseparably to XML as a format (and this is coming from someone who knows and loves XML dearly). The idea of DCTERMS/DC/etc as a format-independent model seems like a step in the right direction, IMO. RDF's grammar comes from the RDF Data Model, and DC's comes from DCAM as well as directly from RDF. The process that Karen Coyle describes is really the only way forward in making a good faith effort to put MARC (the bibliographic data) onto the Semantic Web. Fair enough. But I would contend that putting MARC / bib data on the Semantic Web is just one use case; even though I realize that to Semantic Web advocates that it's the *only* use case worth considering. I find it difficult to imagine that building a record format from just a list of words is completely useless, especially given that right now there's next to *zero* access to bibliographic data from libraries. Maybe the way to go is to just make the MARCXML available via OAI-PMH and OpenSearch and leave it at that. A more rational approach, IMO, would create a general description set (probably numbering 20-50), then expanding that for more detail and for different materials. Users of the sets could define the zones they wish to use in an application profile, so no one would have to carry around data elements that they are sure they will not use. It would also provide a simple but compatible set for folks who don't want to do the whole library description bit. I agree with this 100%, and conceptually that's what DC and DCTERMS seemed to be the basis of, at least to me. This seems to parallel the MARC approach to refinement, which can be expressed as either a hierarchy or a set of independent assertions. Moreover, it's format-independent, so it could be serialized as XML, or RDF, or JSON for that matter. Is this what the RDA entities are supposed to achieve? Let me give another example: the Open Library API returns a JSON tree, eg. http://openlibrary.org/books/OL1M.json But what schema is this? And if it doesn't conform to a standard schema, does that make it useless? If it were based on DCTERMS, at least I'd have a reference at http://dublincore.org/documents/dcmi-terms/ to define the semantics being used (and an RDF namespace at http://purl.org/dc/terms/ to boot). MJ
Re: [CODE4LIB] MODS and DCTERMS
On 5/4/2010 9:54 AM, Karen Coyle wrote: BIBO, which many people seem to like, has almost 200 data elements and classes, and is greatly lacking in some areas (e.g. maps, music). What makes BIBO useful, in my limited experience, is that it integrates commonly used ontologies like FOAF and DCTERMS. Also, since it is an ontology for RDF description, you can supplement other vocabularies for specific cases that BIBO doesn't handle. As Ross Singer just posted as I'm writing this: On 5/4/2010 9:57 AM, Ross Singer wrote: In RDF, you can pull in predicates from other namespaces, where the attributes you're looking for may be defined. What's nice about this is that works sort of like how namespaces are *supposed* to work in XML: that is, an agent that comes along and grabs your triples will parse the assertions from vocabularies it understands and ignore those it doesn't. It's important that we don't look at BIBO or any other bibliographic ontology as an uber vocabulary. One of the many elegant features of RDF, IMHO, is that each specialization can contribute their own vocabulary, e.g. general vocabularies like FOAF, DCTERMS, and BIBO can be refined by more domain specific vocabularies like the music ontology[1], or ontologies for describing archival collections, sheet music, maps... In fact, having only 200 properties and classes gives BIBO an advantage: it's easy to grok and plays nicely with other vocabularies, which could do the heavy lifting for specific resources. I feel like it makes the most sense to let domain specialists create domain specific vocabularies rather than try to cover every conceivable situation in one vocabulary written by a centralized body. One last thought... BIBO in particular is developed by a community. There is an active listserv[2] and the project leads are very receptive to comment. If there is something important missing, let's help them. [1] http://musicontology.com/ [2] http://bibliontology.com/community Aaron
Re: [CODE4LIB] MODS and DCTERMS
On Tue, May 4, 2010 at 10:26 AM, Mike Taylor m...@indexdata.com wrote: Ross, I think that got mangled in the sending -- either that, or it's some strange format that I've never seen before. That said, I am tremendously impressed by all the information you obtained there. What software did you use, how much of this did you have to feed it by hand, and how much did it intuit from existing structured datasets? Oh, that's probably not mangled, that's probably just how Turtle looks :) I'll also send it as RDF/XML. That graph was compiled by a Google Scholar search on Mike Taylor dinosaur, the Ingenta page describing your article, a text editor (TextMate) and 30 minutes of my life I'll never get back. Ok, here's the graph as RDF/XML: ?xml version=1.0 encoding=utf-8? rdf:RDF xmlns:bibo=http://purl.org/ontology/bibo/; xmlns:dcterms=http://purl.org/dc/terms/; xmlns:foaf=http://xmlns.com/foaf/0.1/; xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#; xmlns:xsd=http://www.w3.org/2001/XMLSchema#integer; bibo:AcademicArticle rdf:nodeID=article1 dcterms:abstract xml:lang=enXenoposeidon proneneukos gen. et sp. nov. is a neosauropod represented by BMNH R2095, a well-preserved partial mid-to-posterior dorsal vertebra from the Berriasian-Valanginian Hastings Beds Group of Ecclesbourne Glen, East Sussex, England. It was briefly described by Lydekker in 1893, but it has subsequently been overlooked. This specimen's concave cotyle, large lateral pneumatic fossae, complex system of bony laminae and camerate internal structure show that it represents a neosauropod dinosaur. However, it differs from all other sauropods in the form of its neural arch, which is taller than the centrum, covers the entire dorsal surface of the centrum, has its posterior margin continuous with that of the cotyle, and slopes forward at 35 degrees relative to the vertical. Also unique is a broad, flat area of featureless bone on the lateral face of the arch; the accessory infraparapophyseal and postzygapophyseal laminae which meet in a V; and the asymmetric neural canal, small and round posteriorly but large and teardrop-shaped anteriorly, bounded by arched supporting laminae. The specimen cannot be referred to any known sauropod genus, and clearly represents a new genus and possibly a new `family'. Other sauropod remains from the Hastings Beds Group represent basal Titanosauriformes, Titanosauria and Diplodocidae; X. proneneukos may bring to four the number of sauropod `families' represented in this unit. Sauropods may in general have been much less morphologically conservative than is usually assumed. Since neurocentral fusion is complete in R2095, it is probably from a mature or nearly mature animal. Nevertheless, size comparisons of R2095 with corresponding vertebrae in the Brachiosaurus brancai holotype HMN SII and Diplodocus carnegii holotype CM 84 suggest a rather small sauropod: perhaps 15 m long and 7600 kg in mass if built like a brachiosaurid, or 20 m and 2800 kg if built like a diplodocid./dcterms:abstract dcterms:creator rdf:resource=_:author1/ dcterms:creator rdf:resource=_:author2/ dcterms:isPartOf rdf:resource=_:journal1/ dcterms:issued rdf:datatype=http://www.w3.org/2001/XMLSchema#integerdate;2007-11/dcterms:issued dcterms:language rdf:resource=http://purl.org/NET/marccodes/languages/eng#lang/ dcterms:subject rdf:resource=http://id.loc.gov/authorities/sh85038094#concept/ dcterms:subject rdf:resource=http://id.loc.gov/authorities/sh85097127#concept/ dcterms:subject rdf:resource=http://id.loc.gov/authorities/sh85117730#concept/ dcterms:title xml:lang=enAN UNUSUAL NEW NEOSAUROPOD DINOSAUR FROM THE LOWER CRETACEOUS HASTINGS BEDS GROUP OF EAST SUSSEX, ENGLAND/dcterms:title bibo:authorList rdf:Description rdf:first rdf:resource=_:author1/ rdf:rest rdf:Description rdf:first rdf:resource=_:author2/ rdf:rest rdf:resource=http://www.w3.org/1999/02/22-rdf-syntax-ns#nil/ /rdf:Description /rdf:rest /rdf:Description /bibo:authorList bibo:doi10./j.1475-4983.2007.00728.x/bibo:doi bibo:issue rdf:datatype=http://www.w3.org/2001/XMLSchema#integerinteger;6/bibo:issue bibo:numPages rdf:datatype=http://www.w3.org/2001/XMLSchema#integerinteger;18/bibo:numPages bibo:pageEnd rdf:datatype=http://www.w3.org/2001/XMLSchema#integerinteger;1564/bibo:pageEnd bibo:pageStart rdf:datatype=http://www.w3.org/2001/XMLSchema#integerinteger;1547/bibo:pageStart bibo:pages1547-1564/bibo:pages bibo:volume rdf:datatype=http://www.w3.org/2001/XMLSchema#integerinteger;50/bibo:volume /bibo:AcademicArticle bibo:Journal rdf:nodeID=journal1 dcterms:publisher rdf:resource=_:publisher1/ dcterms:titlePalaeontology/dcterms:title bibo:issn0031-0239/bibo:issn foaf:homepage rdf:resource=http://www3.interscience.wiley.com/journal/118531917/home?CRETRY=1amp;SRETRY=0/ /bibo:Journal
Re: [CODE4LIB] MODS and DCTERMS
Let me give another example: the Open Library API returns a JSON tree, eg. http://openlibrary.org/books/OL1M.json But what schema is this? And if it doesn't conform to a standard schema, does that make it useless? If it were based on DCTERMS, at least I'd have a reference at http://dublincore.org/documents/dcmi-terms/ to define the semantics being used (and an RDF namespace at http://purl.org/dc/terms/ to boot). Ah, after my own heart! I have tried to convince the OL folks to translate their data to dcterms, even did a crosswalk for them. Right now they're in panic mode over a major milestone, but once that's over I may ping you to make this request directly to them on one of their lists. If they only hear it from me, it might just be a personal quirk of mine, right? See, we're on the same page after all. :-) Considering one of my primary use cases is direct interoperation with Open Library then yes, I'm all over it. I'll at least harass Edward and the OL list that DC output is important to others beyond just you alone. I was starting to get discouraged, but now I realize that many of you thought I was proposing DCTERMS as a replacement for MARC; not at all. Imagine Open Library's internal data schema being an easily-serializable model based on DCTERMS. Now imagine every library has a queryable API exactly like theirs. That's where I'm going, and I think (answering my own question above) that it *is* potentially useful. p.s. The JSON API output doesn't require any programming when it uses their data elements; it does do crosswalk to dcterms that's been the hangup. Then again... their code is open source, the crosswalk I did is linked from the launchpad entry here [1] so if anyone wants to contribute… Unfortunately I'm not adept at Python, so writing the code by hand is probably a bit beyond me at this point. But it might make a fun learn-Python-in-a-rainy-weekend project. MJ
Re: [CODE4LIB] MODS and DCTERMS
No apologies required — your dissection of the (very important) differences between MODS and DCTERMS, both in concept and format, was extremely enlightening and helpful; as was all the other input. Any misunderstandings are much more my fault for not being clearer when Ross asked what my use case was. I also made the mistake of referencing RDF, which I (now better) understand incorporates a whole universe of world-views that unnecessarily complicated things. Much learned, and as always, much obliged. MJ On 2010-05-04, at 3:48 PM, Corey Harper wrote: Thank you for this clarification, MJ. I apologize for my initial reaction that there was little value here. Knowing the use-case you define below, I think there's a great deal of value. Beyond just the pragmatic short-term gains, I think a development like this would help pin-point those areas where said schema functionally requires semantics beyond those in the DCTERMS. All the better if some of those terms just happen to be available in Bibliontology or some other namespace... Thanks again, -Corey MJ Suhonos wrote: Let me give another example: the Open Library API returns a JSON tree, eg. http://openlibrary.org/books/OL1M.json But what schema is this? And if it doesn't conform to a standard schema, does that make it useless? If it were based on DCTERMS, at least I'd have a reference at http://dublincore.org/documents/dcmi-terms/ to define the semantics being used (and an RDF namespace at http://purl.org/dc/terms/ to boot). Ah, after my own heart! I have tried to convince the OL folks to translate their data to dcterms, even did a crosswalk for them. Right now they're in panic mode over a major milestone, but once that's over I may ping you to make this request directly to them on one of their lists. If they only hear it from me, it might just be a personal quirk of mine, right? See, we're on the same page after all. :-) Considering one of my primary use cases is direct interoperation with Open Library then yes, I'm all over it. I'll at least harass Edward and the OL list that DC output is important to others beyond just you alone. I was starting to get discouraged, but now I realize that many of you thought I was proposing DCTERMS as a replacement for MARC; not at all. Imagine Open Library's internal data schema being an easily-serializable model based on DCTERMS. Now imagine every library has a queryable API exactly like theirs. That's where I'm going, and I think (answering my own question above) that it *is* potentially useful. p.s. The JSON API output doesn't require any programming when it uses their data elements; it does do crosswalk to dcterms that's been the hangup. Then again... their code is open source, the crosswalk I did is linked from the launchpad entry here [1] so if anyone wants to contribute� Unfortunately I'm not adept at Python, so writing the code by hand is probably a bit beyond me at this point. But it might make a fun learn-Python-in-a-rainy-weekend project. MJ -- Corey A Harper Metadata Services Librarian New York University Libraries 20 Cooper Square, 3rd Floor New York, NY 10003-7112 212.998.2479 corey.har...@nyu.edu
Re: [CODE4LIB] MODS and DCTERMS
Hi MJ, - for that matter, is there a good example of how to properly serialize DCTERMS for eg. a converted MARC/MODS record in XML (or RDF/XML)? I see, eg. http://dublincore.org/documents/dcq-rdf-xml/ which has been replaced by http://dublincore.org/documents/dc-rdf/ but I'm not sure if the latter obviates the former entirely? Also, the examples at the bottom of the latter don't show, eg. repeated elements or DCMES elements. Do we abandon http://purl.org/dc/elements/1.1/ entirely? This has always been ridiculously confusing! Here's my understanding (though anyone else, please chime in and correct me if I've misunderstood): - With the maturation of the DCMI Abstract Model http://dublincore.org/documents/abstract-model/, new bindings were needed to express features of the model not obvious in the old RDF, XML, and XHTML bindings. - For RDF, http://dublincore.org/documents/dc-rdf/ is stable and fully intended to replace http://dublincore.org/documents/dcq-rdf-xml/. - For XML (the non-RDF sort), the most current document is http://dublincore.org/documents/dc-ds-xml/, though note its status is still (after 18 months) only a proposed recommendation. This document itself replaces a transition document http://dublincore.org/documents/2006/05/29/dc-xml/ from 2006 that never got beyond Working Draft status. To get a stable XML binding, you have to go all the way back to 2003 http://dublincore.org/documents/dc-xml-guidelines/index.shtml, a binding which predates much of the current DCMI Abstract Model. - Many found the 2003 XML binding unsatisfactory in that it prescribed the format for individual dc and dcterms properties, but not a full XML format - that is, there was no DC-sanctioned XML root element for a qualified DC record. (This gets at the very heart of the difference in perspective between RDF and XML, properties and elements, etc., I think, but I digress...) The folks I'm aware of that developed workarounds for this were those sharing QDC over OAI-PMH. I find the UIUC OAI registry http://oai.grainger.uiuc.edu/registry/ helpful for investigations of this sort. A quick glance at their report on Distinct Metadata Schemas used in OAI-PMH data providers http://oai.grainger.uiuc.edu/registry/ListSchemas.asp seems to suggest that CONTENTdm uses this schema for QDC http://epubs.cclrc.ac.uk/xsd/qdc.xsd and DSpace uses this one http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd. The latter doesn't actually define a root element either, but since here a! t least the QDC is inside the wrappers the OAI-PMH response requires it's well-formed. What someone does with that once they get it and unpack it, I don't know, since without a container it won't be well-formed XML. The former goes through several levels of importing other things and eventually ends up importing from an .xsd on the Dublin Core site, but they define a root element themselves along the way. (I think.) - So what does one do? I guess it depends on who your target consumers of this data are. If you're looking to work with more traditional library environments, perhaps those that are using CONTENTdm, etc. the legacy hack-ish format might be the best. (I'm part of an initiative to revitalize the Sheet Music Consortium http://digital.library.ucla.edu/sheetmusic/ and lots of our potential contributors are CONTENTdm users, so I think this is the direction I'm going to take that project.) But if you're wanting to talk to DCMI-style folks, the dc-ds-xml, or more likely the dc-rdf option seems more attractive. I'm afraid I'm not much help with the implementation details of dc-rdf, though. One of the DC mailing list would be, though, I suspect. There are a lot of active members there. Ick, huh? :-) Jenn Jenn Riley Metadata Librarian Digital Library Program Indiana University - Bloomington Wells Library W501 (812) 856-5759 www.dlib.indiana.edu Inquiring Librarian blog: www.inquiringlibrarian.blogspot.com
Re: [CODE4LIB] MODS and DCTERMS
I'm still confused about all this stuff too, but I've often see the oai_dc format (for OAI/PMH I think?) used as a 'standard' way to expose simple DC attributes. One thing I was confused about was whether the oai_dc format _required_ the use of the old style DC uri's, or also allowed the use of the DCterms URIs? Anyone know? I kind of think it actually requires the old-style DC uri's, as it was written before dcterms. At least it is one standardized way to expose the old basic DC elements, with a specific XML schema. Jonathan Riley, Jenn wrote: Hi MJ, - for that matter, is there a good example of how to properly serialize DCTERMS for eg. a converted MARC/MODS record in XML (or RDF/XML)? I see, eg. http://dublincore.org/documents/dcq-rdf-xml/ which has been replaced by http://dublincore.org/documents/dc-rdf/ but I'm not sure if the latter obviates the former entirely? Also, the examples at the bottom of the latter don't show, eg. repeated elements or DCMES elements. Do we abandon http://purl.org/dc/elements/1.1/ entirely? This has always been ridiculously confusing! Here's my understanding (though anyone else, please chime in and correct me if I've misunderstood): - With the maturation of the DCMI Abstract Model http://dublincore.org/documents/abstract-model/, new bindings were needed to express features of the model not obvious in the old RDF, XML, and XHTML bindings. - For RDF, http://dublincore.org/documents/dc-rdf/ is stable and fully intended to replace http://dublincore.org/documents/dcq-rdf-xml/. - For XML (the non-RDF sort), the most current document is http://dublincore.org/documents/dc-ds-xml/, though note its status is still (after 18 months) only a proposed recommendation. This document itself replaces a transition document http://dublincore.org/documents/2006/05/29/dc-xml/ from 2006 that never got beyond Working Draft status. To get a stable XML binding, you have to go all the way back to 2003 http://dublincore.org/documents/dc-xml-guidelines/index.shtml, a binding which predates much of the current DCMI Abstract Model. - Many found the 2003 XML binding unsatisfactory in that it prescribed the format for individual dc and dcterms properties, but not a full XML format - that is, there was no DC-sanctioned XML root element for a qualified DC record. (This gets at the very heart of the difference in perspective between RDF and XML, properties and elements, etc., I think, but I digress...) The folks I'm aware of that developed workarounds for this were those sharing QDC over OAI-PMH. I find the UIUC OAI registry http://oai.grainger.uiuc.edu/registry/ helpful for investigations of this sort. A quick glance at their report on Distinct Metadata Schemas used in OAI-PMH data providers http://oai.grainger.uiuc.edu/registry/ListSchemas.asp seems to suggest that CONTENTdm uses this schema for QDC http://epubs.cclrc.ac.uk/xsd/qdc.xsd and DSpace uses this one http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd. The latter doesn't actually define a root element either, but since here! a! t least the QDC is inside the wrappers the OAI-PMH response requires it's well-formed. What someone does with that once they get it and unpack it, I don't know, since without a container it won't be well-formed XML. The former goes through several levels of importing other things and eventually ends up importing from an .xsd on the Dublin Core site, but they define a root element themselves along the way. (I think.) - So what does one do? I guess it depends on who your target consumers of this data are. If you're looking to work with more traditional library environments, perhaps those that are using CONTENTdm, etc. the legacy hack-ish format might be the best. (I'm part of an initiative to revitalize the Sheet Music Consortium http://digital.library.ucla.edu/sheetmusic/ and lots of our potential contributors are CONTENTdm users, so I think this is the direction I'm going to take that project.) But if you're wanting to talk to DCMI-style folks, the dc-ds-xml, or more likely the dc-rdf option seems more attractive. I'm afraid I'm not much help with the implementation details of dc-rdf, though. One of the DC mailing list would be, though, I suspect. There are a lot of active members there. Ick, huh? :-) Jenn Jenn Riley Metadata Librarian Digital Library Program Indiana University - Bloomington Wells Library W501 (812) 856-5759 www.dlib.indiana.edu Inquiring Librarian blog: www.inquiringlibrarian.blogspot.com
Re: [CODE4LIB] MODS and DCTERMS
Out of curiosity, what is your use case for turning this into DC? That might help those of us that are struggling to figure out where to start with trying to help you with an answer. -Ross. On Mon, May 3, 2010 at 11:46 AM, MJ Suhonos m...@suhonos.ca wrote: Thanks for your comments, guys. I was beginning to think the lack of response indicated that I'd asked something either heretical or painfully obvious. :-) That's my understanding as well. oai_dc predates the defining of the 15 legacy DC properties in the dcterms namespace, and it's my guess nobody saw a reason to update the oai_dc definition after this happened. This is at least part of my use case — we do a lot of work with OAI on both ends, and oai_dc is pretty limited due to the original 15 elements. My thinking at this point is that there's no reason we couldn't define something like oai_dcterms and use the full QDC set based on the updated profile. Right? FWIW, I'm not limited to any legacy ties; in fact, my project is aimed at pushing the newer, DC-sanctioned ideas forward, so I suspect in my case using an XML serialization that validates against http://purl.org/dc/terms/ is probably sufficient (whether that's RDF or not doesn't matter at this point). So, back to the other part of the question: has anybody seen a MODS — DCTERMS crosswalk in the wild? It looks like there's a lot of similarity between the two, but before I go too deep down that rabbit hole, I'd like to make sure someone else hasn't already experienced that, erm, joy. MJ
Re: [CODE4LIB] MODS and DCTERMS
dcterms so so terribly lossy that it would be a shame to reduce MARC to it. This is *precisely* the other half of my rationale — a shame? Why? If MARC is the mind prison that some purport it to be, then let's see what a system built devoid of MARC, but based on the best alternative we have looks like. That may well *not* be DCTERMS, but I do like the DCAM model, and there are plenty of non-library systems out there that speak simple DC (OAI-PMH is one example from this thread alone). Being conceptually RDF-compatible is just a bonus for me. This would be an incentive for them to at least consider implementing DCTERMS, which may be terribly lossy compared to MARC, but is a huge increase in expressivity compared to simple DC. Integrating MARC-based records and DC-based records from OAI sources in a single database could be a useful thing to play with. What we need, ASAP, is a triple form of MARC (and I know some folks have experimented with this...) and a translate from MARC to the RDA elements that have been registered in RDF. However, I hear that JSC is going to be adding more detail to the RDA elements so that could mean changes coming down the pike. I am interested in working on MARC as triples, which I see as a transformation format. I have a database of MARC elements that might be a crude basis for this. This seems like it's looking to accomplish different goals than I am, but obviously if there's a MARC-as-triples intermediary that's workable *today* then I'd be happy to use that instead. But I wonder: how navigable is it by people who don't understand MARC? How much loss is potentially involved? QDC basically represents the same things has dcterms, so you can probably just take the existing XSLT and hack on it until it until it represents something that looks more like dcterms than qdc. Yeah, that might be easier than mapping from MODS, though I'll have to see how much I can look at a MARC-based XSLT before my brain melts. Hopefully it wouldn't take *too* much work. That won't address of the issue of breaking up the MARC into individual resources, however. You mention that you are looking for the short hop to RDF, but this is just going to give you a big pile of literals for things like creator/contributor/subject, etc. I'm not really sure what the win would be, there. Well, a MARC-as-triples approach would suffer from the same problem just as much, at least initially. I think the issue of converting literals into URIs is an important second step, but let's get the literals into a workable format first. I should clarify that my ultimate goal isn't to find a magical easy way to RDF, but rather to try to realize a way for libraries to get their data into a format that others are able and willing to play with. I'm betting on the notion that the majority of (presumably non-librarian) users would rather have incomplete data in a format that they can understand and manipulate, rather than have to learn MARC. I certainly would, and I'm a librarian (though probably a poor one because I don't understand or highly value MARC). Naive? Heretical? Probably. But worth a shot, I think. MJ
Re: [CODE4LIB] MODS and DCTERMS
NB: When Karen Coyle, Eric Morgan, and Roy Tennant all reply to your thread within half an hour of each other, you know you've hit the big time. Time to retire young I think. That would be Eric *Lease* Morgan — oh my god, you're right! I'm already losing data! It *is* insidious! I repent! MJ
Re: [CODE4LIB] MODS and DCTERMS
On 5/3/2010 1:55 PM, Karen Coyle wrote: 1. MARC the data format -- too rigid, needs to go away 2. MARC21 bib data -- very detailed, well over 1,000 different data elements, some well-coded data (not all); unfortunately trapped in #1 For the sake of my own understanding, I would love an explanation of the distinction between #1 and #2... Re: #2, how is bibliographic data encoded in MARC any different than bibliographic data encoded in some other format? Without the encoding format, you just have a pile of strings, right? I agree that we have lots of rich bibliographic data encoded in MARC and it is an exciting possibility to move it out of MARC into other, more flexible formats. Why, then, do we need to migrate the 'elements' of the encoding format as well? Taking one look at MARCXML makes it clear that the structure of MARC is not well suited to contemporary, *interoperable*, data formats. Is there something specific to MARC that is not potentially covered by MODS/DCTERMS/BIBO/??? that I'm missing? Thanks, Aaron
Re: [CODE4LIB] MODS and DCTERMS
On Mon, May 3, 2010 at 2:40 PM, MJ Suhonos m...@suhonos.ca wrote: Yes, even to me as a librarian but not a cataloguer, many (most?) of these elements seem like overkill. I have no doubt there is an edge-case for having this fine level of descriptive detail, but I wonder: a) what proportion of records have this level of description b) what kind of (or how much) user access justifies the effort in creating and preserving it On many levels, I agree. Or I wish I could. If you look at a business model like Amazon, for example, it's easy to imagine that their overriding goal is, Make the easy-to-find stuff ridiculously easy to find. The revenue they get from someone finding an edge-case book is exactly the same as the revenue they get from someone buying Harry Potter. The ROI easy to think about. But I work in an academic library. In a lot of ways, our *primary audience* is some grad student 12 years from now who needs one trivial piece of crap to make it all come together in her head. I know we have thousands of books that have never been looked at, but computing the ROI on someone being able to see them some day is difficult. Maybe it's zero. Maybe not. We just can't tell. Now, none of this is to say that MARC/AACR2 is necessarily the best (or even a good) way to go about making these works findable. I'm just saying that evaluating the edge cases in terms of user access are a complicated business. -Bill- -- Bill Dueber Library Systems Programmer University of Michigan Library
Re: [CODE4LIB] MODS and DCTERMS
Although I agree with Roy's suggestion that librarians not gloat about our metadata, the notion that the value of a data element can be elicited from the frequency of its use in the overall domain of library materials is misleading and contrary to the report Roy cites. The sub-section of the very useful and informative OCLC report that Roy cites is very good on this point. Section 2. MARC Tag Usage in WorldCat by Karen Smith-Yoshimura clearly lays out the data in the context of WorldCat and the cataloging practice of the OCLC members. Library holdings are dominated by texts and in terms of titles cataloged texts are dominated by books. This preponderance of books tilts the ratios of use per individual data elements. Many data elements pertain to either a specific form of material, manuscripts, for instance. Others pertain to specific content, musical notation, for instance. Some pertain to both, manuscript scores, for instance. Within the total aggregate of library materials, data elements that are specific per material or content do not rise in usage rates to anything near 20% of the aggregate total of titles. Yet these elements are necessary or valuable to those wishing to discover and use the materials, and when one recalls that 1% use rates in WorldCat equal about 1,000,000 titles the usefulness of many MARC data elements can be seen as widespread. According to the report, 69 MARC tags occur in more than 1% of the records in WorldCat. That is quite a few more than the Roy's 11, but even accounting for Karen's data elements being equivalent to the number of MARC sub-fields this is far fewer than the 1,000 data elements available to a cataloger in MARC. Matthew Beacom By the way, the descriptive fields used in more than 20% of the MARC records in WorldCat are: 245 Title statement 100% 260 Imprint statement 96% 300 Physical description 91% 100 Main entry - personal name 61% 650 Subject added entry - topical term 46% 500 General note 44% 700 Added entry - personal name 28% They answer, more or less, a few basic questions a user might have about the material: What is it called? Who made it? When was it made? How big is it? What is it about? Answers to the question, How can I get it? are usually given in the associated MARC holdings record. -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Roy Tennant Sent: Monday, May 03, 2010 2:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MODS and DCTERMS I would even argue with the statement very detailed, well over 1,000 different data elements, some well-coded data (not all). There are only 11 (yes, eleven) MARC fields that appear in 20% or more of MARC records currently in WorldCat[1], and at least three of those elements are control numbers or other elements that contribute nothing to actual description. I would say overall that we would do well to not gloat about our metadata until we've reviewed the facts on the ground. Luckily, now we can. Roy [1] http://www.oclc.org/research/publications/library/2010/2010-06.pdf On Mon, May 3, 2010 at 11:03 AM, Eric Lease Morgan emor...@nd.edu wrote: On May 3, 2010, at 1:55 PM, Karen Coyle wrote: 1. MARC the data format -- too rigid, needs to go away 2. MARC21 bib data -- very detailed, well over 1,000 different data elements, some well-coded data (not all); unfortunately trapped in #1 The differences between the two points enumerated above, IMHO, seem to be the at the heart of the never-ending debate between computer types and cataloger types when it comes to library metadata. The non-library computer types don't appreciate the value of human-aided systematic description. And the cataloger types don't understand why MARC is a really terrible bit bucket, especially considering the current environment. All too often the two camps don't know to what the other is speaking. MARC must die. Long live MARC. -- Eric Lease Morgan
Re: [CODE4LIB] MODS and DCTERMS
Quoting Beacom, Matthew matthew.bea...@yale.edu: According to the report, 69 MARC tags occur in more than 1% of the records in WorldCat. That is quite a few more than the Roy's 11, but even accounting for Karen's data elements being equivalent to the number of MARC sub-fields this is far fewer than the 1,000 data elements available to a cataloger in MARC. So much depends on how you count things, so at the http://kcoyle.net/rda/ site I have put two MARC-related files. The first is just a list of elements (variable subfields) in alpha order with duplicates removed. Yes, I realize how imperfect this is, and that we will need to look beyond names to *meaning* of elements to determine what we really have. This file does not include indicators, and sometimes indicators really do create a separate element, like when person name becomes Family based on its indicator. That file has over 560 entries. The next file probably needs some more thought, but it is a list of the variable field indicators and subfields, leaving in subfields that are duplicated in different fields. I removed some of the numeric subfields that didn't seem to result in an actual elements (2, 3, 5, 6, 8), but could be wrong about that. I also did not include indicators that are = Undefined. We can debate whether a personal name in an added entry is the same element as a personal name in a subject heading, and similarly for the various places where geographic names are used, titles, etc etc etc. This is the analysis that is needed to reduce MARC21 to a cleaner set of data elements. That file has 1421 entries. Neither of these contains any of the fixed field elements (many of which, IMO, should replace textual elements now carried in MARC21). When I looked at the fixed fields (and this is reported at http://futurelib.pbworks.com/Data+and+Studies), I came up with this count of *unique* fixed field elements (each with multiple values): 008 - 58 007 - 55 Each one of these should become a controlled value list in a SemWeb implementation of MARC. RDA appears to have a total of 68 defined value lists, but I don't believe that those include ones defined elsewhere, such as languages, country codes, etc. kc p.s. linked from that same page is the file I am using for this analysis, in CSV format, if anyone else wants to play with it. I have tried to keep it up to date with MARBI proposals. Matthew Beacom By the way, the descriptive fields used in more than 20% of the MARC records in WorldCat are: 245 Title statement 100% 260 Imprint statement 96% 300 Physical description 91% 100 Main entry - personal name 61% 650 Subject added entry - topical term 46% 500 General note 44% 700 Added entry - personal name 28% They answer, more or less, a few basic questions a user might have about the material: What is it called? Who made it? When was it made? How big is it? What is it about? Answers to the question, How can I get it? are usually given in the associated MARC holdings record. -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Roy Tennant Sent: Monday, May 03, 2010 2:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MODS and DCTERMS I would even argue with the statement very detailed, well over 1,000 different data elements, some well-coded data (not all). There are only 11 (yes, eleven) MARC fields that appear in 20% or more of MARC records currently in WorldCat[1], and at least three of those elements are control numbers or other elements that contribute nothing to actual description. I would say overall that we would do well to not gloat about our metadata until we've reviewed the facts on the ground. Luckily, now we can. Roy [1] http://www.oclc.org/research/publications/library/2010/2010-06.pdf On Mon, May 3, 2010 at 11:03 AM, Eric Lease Morgan emor...@nd.edu wrote: On May 3, 2010, at 1:55 PM, Karen Coyle wrote: 1. MARC the data format -- too rigid, needs to go away 2. MARC21 bib data -- very detailed, well over 1,000 different data elements, some well-coded data (not all); unfortunately trapped in #1 The differences between the two points enumerated above, IMHO, seem to be the at the heart of the never-ending debate between computer types and cataloger types when it comes to library metadata. The non-library computer types don't appreciate the value of human-aided systematic description. And the cataloger types don't understand why MARC is a really terrible bit bucket, especially considering the current environment. All too often the two camps don't know to what the other is speaking. MARC must die. Long live MARC. -- Eric Lease Morgan -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 begin_of_the_skype_highlighting 1-510-435-8234 end_of_the_skype_highlighting skype: kcoylenet
Re: [CODE4LIB] MODS and DCTERMS
Thanks, Matthew, for a much more nuanced and accurate depiction of the data. I would encourage anyone interested in this topic to spend some time with this report, which was one result of a great deal of work by many people in research institutions around the world. The findings and recommendations are well worth your time. Roy On Mon, May 3, 2010 at 11:55 AM, Beacom, Matthew matthew.bea...@yale.eduwrote: Although I agree with Roy's suggestion that librarians not gloat about our metadata, the notion that the value of a data element can be elicited from the frequency of its use in the overall domain of library materials is misleading and contrary to the report Roy cites. The sub-section of the very useful and informative OCLC report that Roy cites is very good on this point. Section 2. MARC Tag Usage in WorldCat by Karen Smith-Yoshimura clearly lays out the data in the context of WorldCat and the cataloging practice of the OCLC members. Library holdings are dominated by texts and in terms of titles cataloged texts are dominated by books. This preponderance of books tilts the ratios of use per individual data elements. Many data elements pertain to either a specific form of material, manuscripts, for instance. Others pertain to specific content, musical notation, for instance. Some pertain to both, manuscript scores, for instance. Within the total aggregate of library materials, data elements that are specific per material or content do not rise in usage rates to anything near 20% of the aggregate total of titles. Yet these elements are necessary or valuable to those wishing to discover and use the materials, and when one recalls that 1% use rates in WorldCat equal about 1,000,000 titles the usefulness of many MARC data elements can be seen as widespread. According to the report, 69 MARC tags occur in more than 1% of the records in WorldCat. That is quite a few more than the Roy's 11, but even accounting for Karen's data elements being equivalent to the number of MARC sub-fields this is far fewer than the 1,000 data elements available to a cataloger in MARC. Matthew Beacom By the way, the descriptive fields used in more than 20% of the MARC records in WorldCat are: 245 Title statement 100% 260 Imprint statement 96% 300 Physical description 91% 100 Main entry - personal name 61% 650 Subject added entry - topical term 46% 500 General note 44% 700 Added entry - personal name 28% They answer, more or less, a few basic questions a user might have about the material: What is it called? Who made it? When was it made? How big is it? What is it about? Answers to the question, How can I get it? are usually given in the associated MARC holdings record. -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Roy Tennant Sent: Monday, May 03, 2010 2:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MODS and DCTERMS I would even argue with the statement very detailed, well over 1,000 different data elements, some well-coded data (not all). There are only 11 (yes, eleven) MARC fields that appear in 20% or more of MARC records currently in WorldCat[1], and at least three of those elements are control numbers or other elements that contribute nothing to actual description. I would say overall that we would do well to not gloat about our metadata until we've reviewed the facts on the ground. Luckily, now we can. Roy [1] http://www.oclc.org/research/publications/library/2010/2010-06.pdf On Mon, May 3, 2010 at 11:03 AM, Eric Lease Morgan emor...@nd.edu wrote: On May 3, 2010, at 1:55 PM, Karen Coyle wrote: 1. MARC the data format -- too rigid, needs to go away 2. MARC21 bib data -- very detailed, well over 1,000 different data elements, some well-coded data (not all); unfortunately trapped in #1 The differences between the two points enumerated above, IMHO, seem to be the at the heart of the never-ending debate between computer types and cataloger types when it comes to library metadata. The non-library computer types don't appreciate the value of human-aided systematic description. And the cataloger types don't understand why MARC is a really terrible bit bucket, especially considering the current environment. All too often the two camps don't know to what the other is speaking. MARC must die. Long live MARC. -- Eric Lease Morgan
Re: [CODE4LIB] MODS and DCTERMS
On May 3, 2010, at 2:47 PM, Aaron Rubinstein wrote: 1. MARC the data format -- too rigid, needs to go away 2. MARC21 bib data -- very detailed, well over 1,000 different data elements, some well-coded data (not all); unfortunately trapped in #1 For the sake of my own understanding, I would love an explanation of the distinction between #1 and #2... Item #1 The first item (#1) is MARC, the data structure -- a container for holding various types of bibliographic information. From one of my older publications [1]: ...the MARC record is a highly structured piece of information. It is like a sentence with a subject, predicate, objects, separated with commas, semicolons, and one period. In data structure language, the MARC record is a hybrid sequential/random access record. The MARC record is made up of three parts: the leader, the directory, the bibliographic data. The leader (or subject in our analogy) is always represented by the first 24 characters of each record. The numbers and letters within the leader describe the record's characteristics. For example, the length of the record is in positions 1 to 5. The type of material the record represents (authority, bibliographic, holdings, et cetera) is signified by the character at position 7. More importantly, the characters from positions 13 to 17 represent the base. The base is a number pointing to the position in the record where the bibliographic information begins. The directory is the second part of a MARC record. (It is the predicate in our analogy.) The directory describes the record's bibliographic information with directory entries. Each entry lists the types of bibliographic information (items called tags), how long the bibliographic information is, and where the information is stored in relation to the base. The end of the directory and all variable length fields are marked with a special character, the ASCII character 30. The last part of a MARC record is the bibliographic information. (It is the object in our sentence analogy.) It is simply all the information (and more) on a catalog card. Each part of the bibliographic information is separated from the rest with the ASCII character 30. Within most of the bibliographic fields are indicators and subfields describing in more detail the fields themselves. The subfields are delimited from the rest of the field with the ASCII character 31. The end of a MARC record is punctuated with an end-of-record mark, ASCII character 29. The ASCII characters 31, 30, and 29 represent our commas, semicolons, and periods, respectively. At the time, MARC -- the data structure -- was really cool. Consider the environment in 1965. No hard disks. Tape drives instead. Data storage was expensive. The medium had to be read from beginning to end. No (or rarely any) sequential data access. Thus, the record and field lengths were relatively short. (No MARC record can be longer 99,999 characters, and no MARC field can be longer than 999 characters.) Remember too the purpose of MARC -- to transmit the content of catalog cards. Given the leader, the directory, and the bibliographic sections of a MARC record all preceded by pseudo checksums and delimited by non-printable ASCII characters, the MARC record -- the data structure comes with a plethora of check and balances. Very nice. Fast forward to the present day. Disk space is cheap. Tapes are not the norm. More importantly the wider computing environment uses XML as their data structure of choice. If libraries are about sharing information, then we need to communicate to them in their language. The language of the Net is XML not MARC. Not only is MARC -- the data structure -- stuck on 50 year-old technology, but more importantly it is not the language of the people to whom we want to share. Item #2 Our bibliographic data (item #2) is the metadata of the Web. While it is important, and it adds a great deal of value, it is not as important as it used to be. It too needs to change. Remember, MARC was originally designed to print catalog cards. Author. Title. Pagination. Series. Notes. Subject headings. Added entries. Looking back, these were relatively simple data elements, but what about system numbers? ISBN numbers? Holdings information? Tables of contents? Abstracts? Ratings? We have stuffed these things into MARC every which way and we call MARC flexible. More importantly, and as many have said previously, string values in MARC records lead to maintenance nightmares. Instead, like a relational database model, values need to be described using keys -- pointers -- to the canonical values. This makes find/replace operations painless, enables for the use of different languages, as well as numerous other advantages. ISBD is also a pain. Take the following string: Kilgour, Frederick Gridley (1914–2006) There is way too much punctuation going on here. Yes,