Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
Hi, I summarized my thoughts about identifiers for data formats in a blog posting: http://jakoblog.de/2009/05/10/who-identifies-the-identifiers/ In short it’s not a technology issue but a commitment issue and the problem of identifying the right identifiers for data formats can be reduced to two fundamental rules of thumb: 1. reuse: don’t create new identifiers for things that already have one. 2. document: if you have to create an identifier describe its referent as open, clear, and detailled as possible to make it reusable. A format should be described with a schema (XML Schema, OWL etc.) or at least a standard. Mostly this schema already has a namespace or similar identifier that can be used for the whole format. For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML Namespace http://www.loc.gov/mods/v3 so this is the best identifier to identify MODS. If you need to identify a specific version then you should *first* look if such identifiers already exist, *second* push the publisher (LOC) to assign official URIs for MODS versions, if this do not already exist, or *third* create and document specific URIs and make that everyone knows about this identifiers. At the moment there are: MODS Version 3 http://www.loc.gov/mods/v3 MODS Version 3.0 info:srw/schema/1/mods-v3.0 MODS Version 3.1 info:srw/schema/1/mods-v3.1 MODS Version 3.2 info:srw/schema/1/mods-v3.2 info:ofi/fmt:xml:xsd:mods MODS Version 3.3 info:srw/schema/1/mods-v3.3 The SRU Schemas registry links the info:srw/schema/1/mods-v3* identifiers to its XML Schemas which is very little documentation but it links to http://www.loc.gov/mods/v3 at least in some way. Ross wrote: First, and most importantly, how do we reconcile these different identifiers for the same thing? Can we come up with some agreement on which ones we should really use? Use the one that is documented best. Secondly, and this gets to the reason why any of this was brought up in the first place, how can we coordinate these identifiers more effectively and efficiently to reuse among various specs and protocols, but not: 1) be tied to a particular community 2) require some laborious and lengthy submission and review process to just say hey, here's my FOAF available via UnAPI The identifier for FOAF is http://xmlns.com/foaf/0.1/. Forget about identifiers that are not URIs. OAI-PMH at least includes a mechanism to map metadataPrefixes to official URIs but this mechanism is not always used. If unAPI lacks a way to map a local name to a global URI, we should better fix unAPI to tell us: ?xml version=1.0 encoding=UTF-8? formats xmlns=http://unapi.info/; format name=foaf uri=http://xmlns.com/foaf/0.1// /formats unAPI should be revised and specified bore strictly to become an RFC anyway. Yes, this requires a laborious and lengthy submission and review process but there is no such thing as a free lunch. 3) be so lax that it throws all hope of authority out the window Reuse existing authorities and document better to create authority. I would expect the various communities to still maintain their own registries of approved data formats (well, OpenURL and SRU, anyway -- it's not as appropriate to UnAPI or Jangle). There should be a distinction between descriptive registries that only list identifiers and formats that are defined elsewhere and authoritative registries that define new identifiers and formats. The number of authoritatively defined identifiers should be small for a given API because the identifier should better be defined by the creator of the format instead by a user of the format. If the creator does not support usable identifiers then better talk to him instead of creating something in parallel. Greetings, Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
On Mon, 2009-05-11 at 11:31 +0100, Jakob Voss wrote A format should be described with a schema (XML Schema, OWL etc.) or at least a standard. Mostly this schema already has a namespace or similar identifier that can be used for the whole format. This is unfortunately not the case. For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML Namespace http://www.loc.gov/mods/v3 so this is the best identifier to identify MODS. And this is a perfect example of why this is not the case. The same mods schema (let alone namespace) defines TWO formats, mods and modsCollection. To quote from the schema: * An instance of this schema is (1) a single MODS record: -- xsd:element name=mods type=modsType/ !-- or (2) a collection of MODS records: -- xsd:element name=modsCollection xsd:complexType xsd:sequence xsd:element ref=mods maxOccurs=unbounded/ /xsd:sequence /xsd:complexType /xsd:element !-- * End of instance definition - So you're using the same identifier to identify two different things at the same time. We discussed this a lot during the development of SRU and there simply isn't an existing identifier for an XML 'format'. Also consider the following more hypothetical, but perfectly feasible situations: * One namespace is used to define two _totally_ separate sets of elements. There's no reason why this can't be done. * One namespace defines so many elements that it's meaningless to call it a format at all. Even though the top level tag might be the same, the contents are so varied that you're unable to realistically process it. Rob
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
On Mon, May 11, 2009 at 16:04, Rob Sanderson azar...@liverpool.ac.uk wrote: * One namespace is used to define two _totally_ separate sets of elements. There's no reason why this can't be done. As opposed to all the reasons for not doing it. :) This is crap design of a higher magnitude, and the designers should be either a) whipped in public and thrown out in shame, or b) repent and made to fix the problem. Even I would opt for the latter, but such a simple task not being done seems to suggest that perhaps the former needs to be put in place. * One namespace defines so many elements that it's meaningless to call it a format at all. Even though the top level tag might be the same, the contents are so varied that you're unable to realistically process it. Yeah, don't use MODS in general; it's a hack. It's even crazier still that many versions have the same namespace. What were they thinking?! Anyway, even if the namespace is botched, you can still (if I'll dare go by the Topic Maps moniker) have multiple namespaces for the same subject (the format in question), and simply publish and use your own and let the TM mechanics handle the ambiguity for you. If enough people do this, and perhaps even use your unofficial identifiers, maybe LOC will see the errors of their ways and repent. Regards, Alex -- --- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps -- http://shelter.nu/blog/
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
On Mon, 2009-05-11 at 12:02 +0100, Alexander Johannesen wrote: On Mon, May 11, 2009 at 16:04, Rob Sanderson azar...@liverpool.ac.uk wrote: * One namespace is used to define two _totally_ separate sets of elements. There's no reason why this can't be done. As opposed to all the reasons for not doing it. :) This is crap design of a higher magnitude, and the designers should be either a) whipped in public and thrown out in shame, or b) repent and made to fix the problem. Even I would opt for the latter, but such a simple task not being done seems to suggest that perhaps the former needs to be put in place. I totally agree that it's an awful design choice. However it's a demonstration that XML namespaces _do not identify format_. And hence, we need another identifier which is not the namespace of the top level element. * One namespace defines so many elements that it's meaningless to call it a format at all. Even though the top level tag might be the same, the contents are so varied that you're unable to realistically process it. Yeah, don't use MODS in general; it's a hack. It's even crazier still that many versions have the same namespace. What were they thinking?! Or TEI for that matter. However I wouldn't call either of them a 'hack' and there are many people who do want to use both of these schemas. Therefore, again, we need another identifier. Q.E.D. Rob
[CODE4LIB] Amazon product API will require a crypto signature
The Amazon products API keeps changing it's name, and has just been changed to Amazon Product Advertising API -- it's the one you use to look up books in Amazon and get metadata for them, though. It looks from an email I got from Amazon that ss of August 15th, you'll need to cryptographically sign requests to this API, to have them responded to. It looks like kind of a pain. I think a bunch of people on this list may be using this API. Beware. Instructions for how to cryptographically sign requests the way they want can be found here: http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/Query_QueryAuth.html http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/rest-signature.html Like I said, it's looking like a pain to me. There are lots of details to get right. If you URI-escape not _exactly_ the same way they do, it's not going to work. Etc. Jonathan
[CODE4LIB] Formats and its identifiers
Hi Rob, You wrote: A format should be described with a schema (XML Schema, OWL etc.) or at least a standard. Mostly this schema already has a namespace or similar identifier that can be used for the whole format. This is unfortunately not the case. It is mostly the case - but people like to misinterpret schemas and tailor them to their needs. For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML Namespace http://www.loc.gov/mods/v3 so this is the best identifier to identify MODS. And this is a perfect example of why this is not the case. The same mods schema (let alone namespace) defines TWO formats, mods and modsCollection. That's your interpretation. According to the schema, the MODS format *is* either a single mods-element or a modsCollection-element. That's exactely what you can refer to with the namespace identifier http://www.loc.gov/mods/v3. If you need to identify the specific element 'mods' of the format only, then you need another identifer. Up to now there is no default way to create an identifier for a specific element in an XML format, see http://www.w3.org/TR/webarch/#xml-fragids But if the MODS specification defines that you can refer to any element with an URI fragment identifier, then the right identifier would be http://www.loc.gov/mods/v3#mods You wrote: I totally agree that it's an awful design choice. However it's a demonstration that XML namespaces _do not identify format_. And hence, we need another identifier which is not the namespace of the top level element. The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' does not identify the top level element but the MODS *format* (in any of the versions 3.0-3.4) itself. This format *includes* the top level element 'mods'. Also consider the following more hypothetical, but perfectly feasible situations: * One namespace is used to define two _totally_ separate sets of elements. There's no reason why this can't be done. Ok, let A and B be two formats with two totally sets of elements (and rules how to use them). If you put them into one namespace, then you get a new format C that is the union of A and B. * One namespace defines so many elements that it's meaningless to call it a format at all. Even though the top level tag might be the same, the contents are so varied that you're unable to realistically process it. Sad but true: The word format in the context of library applications does not make sense anyway in most cases. Technically a format is just a set of possible instances, defined as a formal language or with any other type of specification. The problem of library formats is that many people refer to them without providing a proper specification. Coming back to the mods example: If the SRU Schema registry lists info:srw/schema/1/mods-v3.3 as the identifier for MODS Schema Version 3.3 with a pointer to the XML Schema http://www.loc.gov/standards/mods/v3/mods-3-3.xsd; then *any* XML document that validates against this schema must be considered to be a MODS 3.3 document - either with 'mods' or with 'modsCollection' as root element. Greetings Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
Alexander Johannesen wrote: Yeah, don't use MODS in general; it's a hack. It's even crazier still that many versions have the same namespace. What were they thinking?! Um, MODS is awfully useful for a bunch of reasons. I'm not going to stop using it because they've used namespaces in a way you don't approve of. In the real world, we use things when they solve the problem in front of us in as easy a way as possible, bonus when they are actually standards used by a few other people (like MODS is). If you have the luxury to avoid using things that you don't believe are theoretically sound (and inter-operating with anyone who does use those things), good on you, I guess. Jonathan
Re: [CODE4LIB] Formats and its identifiers
On Mon, May 11, 2009 at 9:53 AM, Jakob Voss jakob.v...@gbv.de wrote: That's your interpretation. According to the schema, the MODS format *is* either a single mods-element or a modsCollection-element. That's exactely what you can refer to with the namespace identifier http://www.loc.gov/mods/v3. Agreed. The same is true, of course, of MARC and, by extension, MARCXML. Part of the format is that it can be one record or multiple. I don't think this a particularly strong argument against using the namespace as an identifier. The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' does not identify the top level element but the MODS *format* (in any of the versions 3.0-3.4) itself. This format *includes* the top level element 'mods'. I'm not really sure of the changes between MODS v.3.0-3.3 -- are they basically backwards and forwards compatible? I imagine there are a lot of cases where the client doesn't care what point release of MODS the thing is serialized as, just that it's MODS and that it can find generally what it's looking for in that structure, right? -Ross.
Re: [CODE4LIB] Formats and its identifiers
On Mon, 2009-05-11 at 14:53 +0100, Jakob Voss wrote: A format should be described with a schema (XML Schema, OWL etc.) or at least a standard. Mostly this schema already has a namespace or similar identifier that can be used for the whole format. This is unfortunately not the case. It is mostly the case - but people like to misinterpret schemas and tailor them to their needs. You're advocating an approach that mostly works, as opposed to one that works in all cases? For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML Namespace http://www.loc.gov/mods/v3 so this is the best identifier to identify MODS. And this is a perfect example of why this is not the case. The same mods schema (let alone namespace) defines TWO formats, mods and modsCollection. That's your interpretation. According to the schema, the MODS format *is* either a single mods-element or a modsCollection-element. According to the __schema__ yes. Not according to the namespace. The namespace is a collection of names only and says precisely nothing about structure. And, yes, given no definition of format, my interpretation is that the mods schema defines two formats, as it defines two top level elements with different contents (eg one may contain the other). This is typically how people would define format in this context, I would say. This is, of course, tangential to the fact that you cannot use the __XML Namespace__ as an identifier for the format, no matter how you define it. That's exactely what you can refer to with the namespace identifier http://www.loc.gov/mods/v3. No, that's a collection of elements, not a schema. If you need to identify the specific element 'mods' of the format only, then you need another identifer. Correct. I'm glad you agree with me. Given that namespaces do not specify anything to do with structure, you thus need a new identifier for EVERY element in a namespace as they could be used as the top level tag of ANY schema. There isn't a widely accepted identifier system for schemas, only schema locations. There are also many methods for defining schemas (schematron, relax-ng, DTDs, xml schema) which can all define exactly the same format. But if the MODS specification defines that you can refer to any element with an URI fragment identifier, then the right identifier would be http://www.loc.gov/mods/v3#mods That would be an identifier for the *element*. The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' does not identify the top level element but the MODS *format* (in any of the versions 3.0-3.4) itself. This format *includes* the top level element 'mods'. No, it identifies a collection of names. These names are structured according to a schema, which is what we need an identifier for. Beyond that, we may also need identifiers for which structure we mean within the schema (eg mods vs modsCollection) Rob
Re: [CODE4LIB] Amazon product API will require a crypto signature
They're also tightened up the API in various ways, and renamed it the Amazon.com Product Advertising API. Although I know of no case when Amazon has shut down a library, it would be hard for any to claim their site had as their principal purpose advertising and marketing the Amazon Site and driving sales of products and services on the Amazon Site. I think it's a terrible mistake for them. Their marginal cost is zero; they don't need to do this. Data openness was a key factor in Amazon's rise. And that was when thee were no other options. With viable other options just emerging—Open Library, Google, at least—now is hardly the time to make it less attractive. Tim On Mon, May 11, 2009 at 9:40 AM, Jonathan Rochkind rochk...@jhu.edu wrote: The Amazon products API keeps changing it's name, and has just been changed to Amazon Product Advertising API -- it's the one you use to look up books in Amazon and get metadata for them, though. It looks from an email I got from Amazon that ss of August 15th, you'll need to cryptographically sign requests to this API, to have them responded to. It looks like kind of a pain. I think a bunch of people on this list may be using this API. Beware. Instructions for how to cryptographically sign requests the way they want can be found here: http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/Query_QueryAuth.html http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/rest-signature.html Like I said, it's looking like a pain to me. There are lots of details to get right. If you URI-escape not _exactly_ the same way they do, it's not going to work. Etc. Jonathan -- Check out my library at http://www.librarything.com/profile/timspalding
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
On Mon, May 11, 2009 at 19:34, Jonathan Rochkind rochk...@jhu.edu wrote: In the real world, we use things when they solve the problem in front of us in as easy a way as possible And somehow you're suggesting that I don't live in the real-world? :) Good try, but as far as I've experienced, people in the library world lives quite a distance away from the real one. Alex -- --- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps -- http://shelter.nu/blog/
Re: [CODE4LIB] Amazon product API will require a crypto signature
In fact, I believe that library-sector developers have asked Amazon and been told that their use is allowed. But definitely, there's no guarantee this will always continue be true. The terms of use don't seem to have substantially changed to me, but they could always start enforcing them more strictly -- for new accounts created to use the Product Advertising API, it looks like there actually will be a manual review step where Amazon staff approves you or doesn't, which never existed before. So, while I'm still using it, I'm also keeping in mind what backup plans I have if they ever ask me to stop. Here are the things I use Amazon API for, with alternates: 1) To take an ISBN, and look up more complete metadata for it. Alternatives: A) Google Books Data API (free for everyone; yes, there is a GBS API which is explicitly authorized for non-javascript access. GBS API will also allow you to find OCLCnums and LCCNs that correspond to an ISBN, when GBS has that data, which it often does thanks to the OCLC relationship.) B) WorldCat API (OCLC members) C) Books In Print API, although BiP seems to be making up their mind about whether they'll throw this in for free with an existing BiP online subscription, or charge extra for it. D) OpenLibrary? (Is this true?) 2) Cover images. Alternatives: A) CoverThing B) Google Books C) OpenLibrary 3) To find an ASIN, in order to make a link to the Amazon page. Ironically, this is actually what the API is _for_, and what Amazon would actually WANT you to do, but it's the thing that's least replaceable. If you have the ISBN, and if you assume the ASIN is the same as the ISBN, you don't need an API. This is often true, but not guaranteed to be true, and I think will become less true when the new ISBN-13 namespace starts to be used. In my case, I use the ASIN to identify if Amazon has a search-inside and/or limited-excerpts available, but the API actually doesn't support that, I've been screen-scraping all along for that, once I have the ASIN. Tim Spalding wrote: They're also tightened up the API in various ways, and renamed it the Amazon.com Product Advertising API. Although I know of no case when Amazon has shut down a library, it would be hard for any to claim their site had as their principal purpose advertising and marketing the Amazon Site and driving sales of products and services on the Amazon Site. I think it's a terrible mistake for them. Their marginal cost is zero; they don't need to do this. Data openness was a key factor in Amazon's rise. And that was when thee were no other options. With viable other options just emerging—Open Library, Google, at least—now is hardly the time to make it less attractive. Tim On Mon, May 11, 2009 at 9:40 AM, Jonathan Rochkind rochk...@jhu.edu wrote: The Amazon products API keeps changing it's name, and has just been changed to Amazon Product Advertising API -- it's the one you use to look up books in Amazon and get metadata for them, though. It looks from an email I got from Amazon that ss of August 15th, you'll need to cryptographically sign requests to this API, to have them responded to. It looks like kind of a pain. I think a bunch of people on this list may be using this API. Beware. Instructions for how to cryptographically sign requests the way they want can be found here: http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/Query_QueryAuth.html http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/rest-signature.html Like I said, it's looking like a pain to me. There are lots of details to get right. If you URI-escape not _exactly_ the same way they do, it's not going to work. Etc. Jonathan
Re: [CODE4LIB] Amazon product API will require a crypto signature
On Mon, May 11, 2009 at 9:31 AM, Tim Spalding t...@librarything.com wrote: I think it's a terrible mistake for them. Their marginal cost is zero; they don't need to do this. Their marginal cost may be quite low, but I'm fairly sure it's not zero. Cycles, storage, and bandwidth aren't free. Amazon has never struck me as a stupid or capricious company -- witness the fact that they survived the .com bust. They've probably thought rather hard about whether they need to spend developer cycles and client goodwill before making this change. Cheers, -Nate
[CODE4LIB] Job Opening: Digital Technologies Development Librarian, NCSU Libraries
Apologies for any cross-postings. North Carolina State University (NCSU) Libraries is pleased to announce a new position opening for a Digital Technologies Development Librarian. This position is based in Raleigh, NC. The full announcement and more information is located at: http://www.lib.ncsu.edu/jobs/epa/dli2/dliva.html - - - - - - - - - - - - - - - - - - - - - - - - - NORTH CAROLINA STATE UNIVERSITY LIBRARIES The North Carolina State University Libraries, recognized as the first recipient of the Association of College and Research Libraries’ Excellence in Academic Libraries Award, offers a working environment of innovation, teamwork, and continuous interaction with students and faculty to further the educational mission of NC State University. The Libraries invites applications and nominations for the following position: DIGITAL TECHNOLOGIES DEVELOPMENT LIBRARIAN Provides technical leadership and hands-on programming expertise for digital library projects. Identifies emerging technologies that have potential for new and improved library services. Working both independently and in team settings, develops new digital library services through an iterative process that emphasizes performance, sustainability, and usability. Develops tools that support ongoing data analysis of library services and digital library projects. Maintains and provides enhancements to existing digital library applications and collaborates closely with Information Technology staff to develop and maintain supporting technology infrastructure. Qualifications: ALA-accredited MLS or equivalent advanced degree in information science, computer science or related field; two or more years of programming experience in a Unix environment; demonstrated application development experience with one or more open source programming languages; strong SQL and database development skills. Demonstrated ability to plan, document and complete projects is expected. Position Number: C-60-0905 Application process and schedule Applications will be reviewed upon receipt; applications will be accepted until finalist candidates are selected. Candidates are encouraged to apply as soon as possible to receive full consideration. The nomination committee may invite candidates for confidential, pre-interview screenings. Appointment requires successful completion of background check. For assistance with this process contact NCSU Libraries Personnel Services Office (919) 515-3522. Affirmative Action/Equal Opportunity Employer
Re: [CODE4LIB] Formats and its identifiers
Ross Singer wrote: Agreed. The same is true, of course, of MARC and, by extension, MARCXML. Part of the format is that it can be one record or multiple. I don't think this a particularly strong argument against using the namespace as an identifier. Actually, the MARC format (not MARCXML) is very much a single-record format. There is a standard for tape headers but no wrapper for MARC (Z39.2) records, since the MARC format doesn't have a way to do that. Having worked for way too long with MARC, I had a lot of trouble with the collection concept in MARCXML and MODS, and am still not sure I see the utility of it beyond what a file of records provides. I'm assuming its main purpose is to provide valid XML when you have a file with more than one bibliographic record. However, it seems that the collection and the records within the collection are part and parcel of the same schema, making the things we think of as records subordinate to the collection, even if it is a collection of one. kc -- --- Karen Coyle / Digital Library Consultant kco...@kcoyle.net http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenet fx.: 510-848-3913 mo.: 510-435-8234