[CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL
Hi Everyone, I posted this over on the Archivists' Toolkit listserv and got no response (yet), so I thought I might try here as well. I have a large quantity (around 300+) of digital objects that I need to add to Archivists' Toolkit. I think I've figured out what queries I need to run in order to do this in MySQL (rather than the interface) but I wanted to get opinions from the peanut gallery before trying it out on my test instance. It seems that there are actually two update queries that need to be used when creating a Digital Object. They are: insert into ArchDescriptionInstances (instanceType, resourceComponentId, resourceId, parentResourceId, instanceDescriminator, archDescriptionInstancesId) values ('Digital object', 336673, null, 543, 'digital', 22567003) and... insert into DigitalObjects (version, lastUpdated, created, lastUpdatedBy, createdBy, title, dateExpression, dateBegin, dateEnd, languageCode, restrictionsApply, eadDaoActuate, eadDaoShow, metsIdentifier, objectType, label, objectOrder, componentId, parentDigitalObjectId, archDescriptionInstancesId, repositoryId) values (0, '2012-04-17 12:05:15', '2012-04-17 12:05:15', 'username', 'username', 'title', '1938-1959', null, null, '', 0, 'onRequest', 'new', '678.1829', 'text', '', 0, '', null, 22567003, 1) There also appears to be some update queries as well, but I'm guessing that they are less important (please correct me if I'm wrong). Has anyone tried to do this in the past? If so do you have scripts that will create Digital Objects for you that you wouldn't mind sharing? Is there anything you think I should know before testing this out in my test instance of AT? Any caveats for me? Any help anyone can provide would be greatly appreciated. Thanks, Rosalyn
[CODE4LIB] Job: Two-Year Research Fellowship in Digital Curation at University of Colorado at Boulder
Two-Year Research Fellowship in Digital Curation Journalism and Mass Communication University of Colorado at Boulder We are seeking to hire a research fellow with a degree in Library and/or Information Science, or an arts, humanities or social science discipline in which the candidate has acquired significant research and practical expertise in the area of digital curation. The ideal candidate should provide evidence of past practical experience in digital curation and possess a clear research and/or creative work agenda in which digital curation is the central activity. We seek to hire someone who has earned a graduate degree within the past three years that emphasizes digital archiving, preservation and curation. A Ph.D. is preferred, but strong candidates with M.A. or M.S. degrees also will be considered. During his/her tenure as a digital curation fellow, the person hired will: 1) curate an original or collaborative project on campus; 2) make occasional campus presentations about the subject of digital curation and its value across disciplines; (3) conduct a graduate seminar, open to graduate students from multiple disciplines, surveying research and best practices within the field of digital curation; and (4) advise faculty and administrators on the development of curriculum in the field of digital curation. The person hired would provide outreach to various constituencies on campus, particularly visual artists, musicians, journalists, filmmakers, librarians, museum curators and archivists who seek to acquire curation skills for creating digital archives of primary data (image, sound, text), and for accessing, analyzing, and presenting such data. The salary would be US $50,000 per year for a two-year contract. Full faculty benefits would be provided for the period of the contract. Screening of applications will begin May 1, 2012 and will continue until the position is filled. For guidelines on applying, go to: www.jobsatcu.com. The job posting number is: **817241** The University of Colorado Boulder is an equal opportunity employer committed to diversity and equality in education and employment. Brought to you by code4lib jobs: http://jobs.code4lib.org/job/896/
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory structure to the byte-offsets in the fixed fields. The values in these places all assume 8-bit character data, it's completely baked in to the file format. -Tod On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote: Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding: a - UCS/Unicode Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset. That doesn't say UTF-8. It says UCS or Unicode. What does that actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called UCS I think?). Whatever it actually means, do people violate it in the wild? Now we get to non-Anglophone centric marc. I think all of which is ISO_2709? A standard which of course is not open access, so I can't get it to see what it says. But leader 09 being used for encoding -- is that Marc21 specific, or is it true of any ISO-2709? Marc8 and unicode being the only valid encodings can't be true of any ISO-2709, right? Is there a generic ISO-2709 way to deal with this, or not so much?
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
We cried our eyes out in 1976 when this first came to our attention at the BL. Even more crying when we couldn't get rid of it in the MARC-I to MARC-II conversion (well before MARC21 was even a twinkle) - a lot of tears are gathering somewhere. Peter -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Bill Dueber Sent: Tuesday, April 17, 2012 5:50 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 On Tue, Apr 17, 2012 at 8:46 PM, Simon Spero sesunc...@gmail.com wrote: Actually Anglo and Francophone centric. And the USMARC style 245 was a poor replacement for the UKMARC approach (someone at the British Library hosted Linked Data meeting wondered why there were punctation characters included in the data in the title field. The catalogers wept slightly). Simon Slightly? I cry my eyes out *every single day* about that. Well, every weekday, anyway. -- Bill Dueber Library Systems Programmer University of Michigan Library
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
On 4/18/2012 6:04 AM, Tod Olson wrote: It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory structure to the byte-offsets in the fixed fields. The values in these places all assume 8-bit character data, it's completely baked in to the file format. I'm not sure that follows. One could certainly have UTF-16 in a Marc record, and still count bytes to get a directory structure and byte offsets. (In some ways it'd be easier since every char would be two bytes). In fact, I worry that the standard may pre-date UTF-8, with it's reference to UCS --- if I understand things right, at one point there was only one unicode encoding, called UCS, which is basically a backwards-compatible subset of what became UTF-16. So I worry the standard really means UCS/UTF-16. But if in fact records in the wild with the 'u' value are far more likely to be UTF-8... well it's certainly not the first time the MARC21 standard was useless/ignored as a standard in answering such questions.
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
Hi Tod, I'm not understanding how UTF-8 would be considered 8-bit character data (other than the ASCII-range of the Unicode repertoire, natch). I don't think ISO 2709 knows from characters, only bytes. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tod Olson Sent: Wednesday, April 18, 2012 5:04 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory structure to the byte-offsets in the fixed fields. The values in these places all assume 8-bit character data, it's completely baked in to the file format. -Tod On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote: Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding: a - UCS/Unicode Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode(tm), an industry subset. That doesn't say UTF-8. It says UCS or Unicode. What does that actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called UCS I think?). Whatever it actually means, do people violate it in the wild? Now we get to non-Anglophone centric marc. I think all of which is ISO_2709? A standard which of course is not open access, so I can't get it to see what it says. But leader 09 being used for encoding -- is that Marc21 specific, or is it true of any ISO-2709? Marc8 and unicode being the only valid encodings can't be true of any ISO-2709, right? Is there a generic ISO-2709 way to deal with this, or not so much?
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
In fact, I worry that the standard may pre-date UTF-8, with it's reference to UCS --- if I understand things right, at one point there was only one unicode encoding, called UCS, which is basically a backwards-compatible subset of what became UTF-16. So I worry the standard really means UCS/UTF-16. Now you're just trying to scare yourself. I've never seen UTF-16 MarcXML. I've never seen anything but UTF-8 encoded MarcXML. Ralph
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
UTF-8 was the marc standard from the beginning: http://www.loc.gov/marc/marbi/1998/98-18.html The first proposals were a character mapping between Unicode and MARC-8 and didn't mention the character encodings, thus the term UCS which was a common term for Unicode at that time. (see: http://www.loc.gov/marc/marbi/1996/96-10.html). But when it got down to brass tacks, it was UTF-8, and left open the possibility of UTF-16 (which was still a viable rival to UTF-8 at the time, as I recall.) UTF-16 had the advantage of every character being of uniform length, but it also did not cover all of the characters of interest to libraries. The decision was also made to use byte count rather than character count in the directory. This was influenced by the UTF-8 decision. kc On 4/18/12 7:04 AM, Jonathan Rochkind wrote: On 4/18/2012 6:04 AM, Tod Olson wrote: It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory structure to the byte-offsets in the fixed fields. The values in these places all assume 8-bit character data, it's completely baked in to the file format. I'm not sure that follows. One could certainly have UTF-16 in a Marc record, and still count bytes to get a directory structure and byte offsets. (In some ways it'd be easier since every char would be two bytes). In fact, I worry that the standard may pre-date UTF-8, with it's reference to UCS --- if I understand things right, at one point there was only one unicode encoding, called UCS, which is basically a backwards-compatible subset of what became UTF-16. So I worry the standard really means UCS/UTF-16. But if in fact records in the wild with the 'u' value are far more likely to be UTF-8... well it's certainly not the first time the MARC21 standard was useless/ignored as a standard in answering such questions. -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
I could be mistaken (never having had the pleasure of reading it), but isn't ISO-2709 specified as a fixed number of characters, and any conflation of characters and 8-bit bytes is on the part of users and implementations? I think ISO 2709 might not know from bytes, only characters. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Doran, Michael D Sent: Wednesday, April 18, 2012 10:05 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 Hi Tod, I'm not understanding how UTF-8 would be considered 8-bit character data (other than the ASCII-range of the Unicode repertoire, natch). I don't think ISO 2709 knows from characters, only bytes. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tod Olson Sent: Wednesday, April 18, 2012 5:04 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory structure to the byte-offsets in the fixed fields. The values in these places all assume 8-bit character data, it's completely baked in to the file format. -Tod On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote: Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding: a - UCS/Unicode Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode(tm), an industry subset. That doesn't say UTF-8. It says UCS or Unicode. What does that actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called UCS I think?). Whatever it actually means, do people violate it in the wild? Now we get to non-Anglophone centric marc. I think all of which is ISO_2709? A standard which of course is not open access, so I can't get it to see what it says. But leader 09 being used for encoding -- is that Marc21 specific, or is it true of any ISO-2709? Marc8 and unicode being the only valid encodings can't be true of any ISO-2709, right? Is there a generic ISO-2709 way to deal with this, or not so much?
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
I could be mistaken (never having had the pleasure of reading it), but isn't ISO-2709 specified as a fixed number of characters, and any conflation of characters and 8-bit bytes is on the part of users and implementations? I don't believe that is the case. Take UTF-8 out of the picture, and consider the MARC-8 character set with its escape sequences and combining characters. A character such as an n with a tilde would consist of two bytes. The Greek small letter alpha, if invoked in accordance with ANSI X3.41, would consist of five bytes (two bytes for the initial escape sequence, a byte for the character, and then two bytes for the escape sequence returning to the default character set). -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Huwig,Steve Sent: Wednesday, April 18, 2012 9:21 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 I could be mistaken (never having had the pleasure of reading it), but isn't ISO-2709 specified as a fixed number of characters, and any conflation of characters and 8-bit bytes is on the part of users and implementations? I think ISO 2709 might not know from bytes, only characters. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Doran, Michael D Sent: Wednesday, April 18, 2012 10:05 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 Hi Tod, I'm not understanding how UTF-8 would be considered 8-bit character data (other than the ASCII-range of the Unicode repertoire, natch). I don't think ISO 2709 knows from characters, only bytes. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tod Olson Sent: Wednesday, April 18, 2012 5:04 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory structure to the byte-offsets in the fixed fields. The values in these places all assume 8-bit character data, it's completely baked in to the file format. -Tod On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote: Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding: a - UCS/Unicode Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode(tm), an industry subset. That doesn't say UTF-8. It says UCS or Unicode. What does that actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called UCS I think?). Whatever it actually means, do people violate it in the wild? Now we get to non-Anglophone centric marc. I think all of which is ISO_2709? A standard which of course is not open access, so I can't get it to see what it says. But leader 09 being used for encoding -- is that Marc21 specific, or is it true of any ISO-2709? Marc8 and unicode being the only valid encodings can't be true of any ISO-2709, right? Is there a generic ISO-2709 way to deal with this, or not so much?
Re: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL
Rosalyn, I've written a number of scripts of this nature. Here's a quick one I wrote recently to add DAOs to our AT for an audio digitization project (note it does not include file versions, just Components, Instances and DAOs). It starts at the ResourceComponent identified by the long at the top of the script. The resourceId is also hard-coded in a number of places. I've got some tidier Java that runs as part of a automated process for a large digitization project, but all the basic Inserts are in this: https://github.com/yalemssa/ATK_DAO_Scripts/blob/master/components_atk.groovy Don Mennerich donald.menner...@yale.edumailto:donald.menner...@yale.edu From: Rosalyn Metz rosalynm...@gmail.commailto:rosalynm...@gmail.com Date: Wed, Apr 18, 2012 at 9:23 AM Subject: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL To: CODE4LIB@listserv.nd.edumailto:CODE4LIB@listserv.nd.edu Hi Everyone, I posted this over on the Archivists' Toolkit listserv and got no response (yet), so I thought I might try here as well. I have a large quantity (around 300+) of digital objects that I need to add to Archivists' Toolkit. I think I've figured out what queries I need to run in order to do this in MySQL (rather than the interface) but I wanted to get opinions from the peanut gallery before trying it out on my test instance. It seems that there are actually two update queries that need to be used when creating a Digital Object. They are: insert into ArchDescriptionInstances (instanceType, resourceComponentId, resourceId, parentResourceId, instanceDescriminator, archDescriptionInstancesId) values ('Digital object', 336673, null, 543, 'digital', 22567003) and... insert into DigitalObjects (version, lastUpdated, created, lastUpdatedBy, createdBy, title, dateExpression, dateBegin, dateEnd, languageCode, restrictionsApply, eadDaoActuate, eadDaoShow, metsIdentifier, objectType, label, objectOrder, componentId, parentDigitalObjectId, archDescriptionInstancesId, repositoryId) values (0, '2012-04-17 12:05:15', '2012-04-17 12:05:15', 'username', 'username', 'title', '1938-1959', null, null, '', 0, 'onRequest', 'new', '678.1829', 'text', '', 0, '', null, 22567003, 1) There also appears to be some update queries as well, but I'm guessing that they are less important (please correct me if I'm wrong). Has anyone tried to do this in the past? If so do you have scripts that will create Digital Objects for you that you wouldn't mind sharing? Is there anything you think I should know before testing this out in my test instance of AT? Any caveats for me? Any help anyone can provide would be greatly appreciated. Thanks, Rosalyn
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
I don't know about ISO 2709 itself, but the MARC21 implementation of it refers to octets, aka 8-bit bytes: http://www.loc.gov/marc/specifications/specrecstruc.html Characters may be encoded using one or more than one octet, depending on the character set. All ASCII characters are encoded using one octet in the ASCII encoding and the Unicode UTF-8 encoding, thus a character is equivalent in length to an octet when an element's values are restricted to ASCII. --Andy On Wed, Apr 18, 2012 at 7:20 AM, Huwig,Steve huw...@oclc.org wrote: I could be mistaken (never having had the pleasure of reading it), but isn't ISO-2709 specified as a fixed number of characters, and any conflation of characters and 8-bit bytes is on the part of users and implementations? I think ISO 2709 might not know from bytes, only characters.
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
-Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 19:55 To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding: a - UCS/Unicode Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset. That doesn't say UTF-8. It says UCS or Unicode. What does that actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called UCS I think?). Whatever it actually means, do people violate it in the wild? First UCS/Unicode basically means the same thing. Second UTF-8, UTF-16, UTF-32 are encoding forms for UCS/Unicode. The MARC documentation does actually say MARC binary records *must* be encoded UTF-8 when LDR/09 content has the value 'a'. You need to refer to the appropriate standards for this information and definitions: http://www.loc.gov/marc/specifications/speccharucs.html#implementation Unicode specifies three encoding forms, of which only one, UTF-8 (UCS Transformation Format 8), is authorized for use in MARC 21 records. http://www.unicode.org/glossary/#UCS UCS. Acronym for Universal Character Set, which is specified by International Standard ISO/IEC 10646, which is equivalent in repertoire to the Unicode Standard. http://www.unicode.org/glossary/#unicode_encoding_form Unicode Encoding Form. A character encoding form that assigns each Unicode scalar value to a unique code unit sequence. The Unicode Standard defines three Unicode encoding forms: UTF-8, UTF-16, and UTF-32. (See definition D79 in Section 3.9, Unicode Encoding Forms.) http://www.unicode.org/glossary/#UTF_8 UTF-8. A multibyte encoding for text that represents each Unicode character with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is the predominant form of Unicode in web pages. More technically: (1) The UTF-8 encoding form. (2) The UTF-8 encoding scheme. (3) “UCS Transformation Format 8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the definitions in the Unicode Standard. http://www.unicode.org/glossary/#UTF_16 UTF-16. A multibyte encoding for text that represents each Unicode character with 2 or 4 bytes; it is not backward-compatible with ASCII. It is the internal form of Unicode in many programming languages, such as Java, C#, and JavaScript, and in many operating systems. More technically: (1) The UTF-16 encoding form. (2) The UTF-16 encoding scheme. (3) “Transformation format for 16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003; technically equivalent to the definitions in the Unicode Standard. Andy
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
ISO 2709 doesn't care how many bytes your characters are. The directory and offsets and other things count bytes, not characters. That was exactly my point. (Which I am stating since you quoted me and I couldn't tell if you were refuting my point, or using it to support your conclusion.) ;-) -- Michael -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, April 18, 2012 11:09 AM To: Code for Libraries Cc: Doran, Michael D Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 On 4/18/2012 11:09 AM, Doran, Michael D wrote: I don't believe that is the case. Take UTF-8 out of the picture, and consider the MARC-8 character set with its escape sequences and combining characters. A character such as an n with a tilde would consist of two bytes. The Greek small letter alpha, if invoked in accordance with ANSI X3.41, would consist of five bytes (two bytes for the initial escape sequence, a byte for the character, and then two bytes for the escape sequence returning to the default character set). ISO 2709 doesn't care how many bytes your characters are. The directory and offsets and other things count bytes, not characters. (which was, in my opinion, the _right_ decision, for once with marc!) How bytes translate into characters is not a concern of ISO 2709. The majority of non-7-bit-ASCII encodings will have chars that are more than one byte, either sometimes or always. This is true of MARC8 (some chars), UTF8 (some chars), and UTF16 (all chars), all of them. (It is not true of Latin-1 though, for instance, I don't think). ISO 2709 doesn't care what char encodings you use, and there's no standard ISO 2709 way to determine what char encodings are used for _data_ in the MARC record. ISO 2709 does say that _structural_ elements like field names, subfield names, the directory itself, seperator chars, etc, all need to be (essentially, over-simplifying) 7-bit-ASCII. The actual data itself is application dependent, 2709 doesn't care, and 2709 doesn't give any standard cross-2709 way to determine it. That is my conclusion at the moment, helped by all of you all in this thread, thanks!
[CODE4LIB] Islandora Camp 2012 Registration Public Brainstorm/Call for Proposals
* Apologies for cross-posting * We're excited to invite you all to the third annual Islandora Camp (Aug 1-3, 2012). Islandora Camp welcomes developers, administrators, and users of Islandora to meet, learn, and grow the ecosystem! Registration for Islandora Camp is now open, and is available via the following link: http://islandora.ca/node/add/islandora-camp-registration Registration is $350, and includes a banquet dinner at Stanhope (http://www.stanhopebeachresort.com) on August 2nd. The agenda is still pending our call for proposals (see below). However, we expect a similar structure to last year, with concurrent sessions running all three days appropriate to both beginners and advanced Islandorians. Public Brainstorm and Call for Proposals We've created a Google Moderator stream for Islandora Camp here: http://www.google.com/moderator/#16/e=1fe634. You can view all of the presentation ideas and vote on your favourites! You can also suggest your own ideas for posters, presentations, papers, user groups, and workshops - just indicate whether you're volunteering to present or just interested in attending a session on a particular topic. Please get your suggestions into the system by the end of May to make sure they're considered for the conference schedule. Mark your calendars: The Red Island Repository Institute will be back in 2012. Tentative dates are September 24-28, 2012. We will post more information as it becomes available. -- David Wilcox, BA, MLIS Islandora Training/Support Coordinator Robertson Library University of Prince Edward Island dwil...@upei.ca Skype Name: david.wilcox82 902.620.5167
[CODE4LIB] Representing geographic hiearchy in linked data
No Message Collected
Re: [CODE4LIB] Job: Senior Application Developer at New York Public Library
No Message Collected
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
In practice it seems to mean UTF-8. At least I've only seen UTF-8, and I can't imagine the code that processes this stuff being safe for UTF-16 or UTF-32. All of the offsets are byte-oriented, and there's too much legacy code that makes assumption about null-terminated strings. -Tod On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote: Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding: a - UCS/Unicode Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset. That doesn't say UTF-8. It says UCS or Unicode. What does that actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called UCS I think?). Whatever it actually means, do people violate it in the wild? Now we get to non-Anglophone centric marc. I think all of which is ISO_2709? A standard which of course is not open access, so I can't get it to see what it says. But leader 09 being used for encoding -- is that Marc21 specific, or is it true of any ISO-2709? Marc8 and unicode being the only valid encodings can't be true of any ISO-2709, right? Is there a generic ISO-2709 way to deal with this, or not so much?
[CODE4LIB] JCDL 2012 registration opens today, April 5
No Message Collected
Re: [CODE4LIB] Job: Senior Application Developer at New York Public Library
No Message Collected
[CODE4LIB] Google Scholar Indexing Guidelines: Highwire Press vs. Eprints vs. BE Press vs. PRISM?
No Message Collected
[CODE4LIB] Job: Records Management Archivist at Johns Hopkins University
The Johns Hopkins University Sheridan Libraries is hiring a Records Management Archivist to work with the University Archivist to develop an innovative approach to records management with the purpose of improving our stewardship of a university history that exists in print, digitized, and born-digital form and can come from within and outside the boundaries of official university activity. Essential skills and knowledge areas include: strong technological literacy and curiosity; deep understanding of born-digital archives and the emerging tools and techniques used to manage born-digital archives; experience with traditional archives functions and processes; comprehension of holistic approaches to information management that account for recorded memory that takes a variety of forms, including analog, digitized, and born-digital; excellent interpersonal and communication skills needed when interviewing a diverse community of records creators for the purpose of discerning information management behavior and its impact on the documentation of university memory; excellent writing and information visualization skills needed to create records retention schedules, functional requirements and other documentation, and information models; the creativity, entrepreneurial spirit, and critical thinking competencies needed to play a crucial role in redefining institutional records management for an increasingly born-digital world in which important analog traces will endure. Questions about the position should be addressed to jsteele at jhu dot edu (no phone calls, please). For a more prescribed list of duties and additional qualifications and to apply to this unique opportunity, please visit https://hrnt.jhu.edu/jhujobs/job_view.cfm?view_req_id=52166. Brought to you by code4lib jobs: http://jobs.code4lib.org/job/895/