[CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL

2012-04-18 Thread Rosalyn Metz
Hi Everyone,

I posted this over on the Archivists' Toolkit listserv and got no response
(yet), so I thought I might try here as well.

I have a large quantity (around 300+) of digital objects that I need to add
to Archivists' Toolkit.  I think I've figured out what queries I need to
run in order to do this in MySQL (rather than the interface) but I wanted
to get opinions from the peanut gallery before trying it out on my test
instance.

It seems that there are actually two update queries that need to be used
when creating a Digital Object.  They are:

insert into ArchDescriptionInstances
(instanceType, resourceComponentId, resourceId, parentResourceId,
instanceDescriminator, archDescriptionInstancesId)
values
('Digital object', 336673, null, 543, 'digital', 22567003)


and...

insert into DigitalObjects
(version, lastUpdated, created, lastUpdatedBy, createdBy, title,
dateExpression, dateBegin, dateEnd, languageCode, restrictionsApply,
eadDaoActuate, eadDaoShow, metsIdentifier, objectType, label, objectOrder,
componentId, parentDigitalObjectId, archDescriptionInstancesId,
repositoryId)
values
(0, '2012-04-17 12:05:15', '2012-04-17 12:05:15', 'username', 'username',
'title', '1938-1959', null, null, '', 0, 'onRequest', 'new', '678.1829',
'text', '', 0, '', null, 22567003, 1)


There also appears to be some update queries as well, but I'm guessing that
they are less important (please correct me if I'm wrong).  Has anyone tried
to do this in the past? If so do you have scripts that will create Digital
Objects for you that you wouldn't mind sharing?  Is there anything you
think I should know before testing this out in my test instance of AT?  Any
caveats for me?

Any help anyone can provide would be greatly appreciated.

Thanks,
Rosalyn


[CODE4LIB] Job: Two-Year Research Fellowship in Digital Curation at University of Colorado at Boulder

2012-04-18 Thread jobs
Two-Year Research Fellowship in Digital Curation

Journalism and Mass Communication

University of Colorado at Boulder

  
We are seeking to hire a research fellow with a degree in Library and/or
Information Science, or an arts, humanities or social science discipline in
which the candidate has acquired significant research and practical expertise
in the area of digital curation. The ideal candidate should provide evidence
of past practical experience in digital curation and possess a clear research
and/or creative work agenda in which digital curation is the central activity.
We seek to hire someone who has earned a graduate degree within the past three
years that emphasizes digital archiving, preservation and curation. A Ph.D. is
preferred, but strong candidates with M.A. or M.S. degrees also will be
considered.

  
During his/her tenure as a digital curation fellow, the person hired will: 1)
curate an original or collaborative project on campus; 2) make occasional
campus presentations about the subject of digital curation and its value
across disciplines; (3) conduct a graduate seminar, open to
graduate students from multiple disciplines, surveying research and best
practices within the field of digital curation; and (4) advise faculty and
administrators on the development of curriculum in the field of digital
curation.

  
The person hired would provide outreach to various constituencies on campus,
particularly visual artists, musicians, journalists, filmmakers, librarians,
museum curators and archivists who seek to acquire curation skills for
creating digital archives of primary data (image, sound, text), and for
accessing, analyzing, and presenting such data.

  
The salary would be US $50,000 per year for a two-year contract. Full faculty
benefits would be provided for the period of the contract.

  
Screening of applications will begin May 1, 2012 and will continue until the
position is filled. For guidelines on applying, go to: www.jobsatcu.com. The
job posting number is: **817241**

  
The University of Colorado Boulder is an equal opportunity employer committed
to diversity and equality in education and employment.



Brought to you by code4lib jobs: http://jobs.code4lib.org/job/896/


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Tod Olson
It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory 
structure to the byte-offsets in the fixed fields. The values in these places 
all assume 8-bit character data, it's completely baked in to the file format.

-Tod

On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:

 Okay, forget XML for a moment, let's just look at marc 'binary'.
 
 First, for Anglophone-centric MARC21.
 
 The LC docs don't actually say quite what I thought about leader byte 09, 
 used to advertise encoding:
 
 
 a - UCS/Unicode
 Character coding in the record makes use of characters from the Universal 
 Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset.
 
 
 
 That doesn't say UTF-8. It says UCS or Unicode. What does that actually 
 mean?  Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be 
 called UCS I think?).  Whatever it actually means, do people violate it in 
 the wild?
 
 
 
 Now we get to non-Anglophone centric marc. I think all of which is ISO_2709?  
 A standard which of course is not open access, so I can't get it to see what 
 it says.
 
 But leader 09 being used for encoding -- is that Marc21 specific, or is it 
 true of any ISO-2709?  Marc8 and unicode being the only valid encodings 
 can't be true of any ISO-2709, right?
 
 Is there a generic ISO-2709 way to deal with this, or not so much?


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Peter Noerr
We cried our eyes out in 1976 when this first came to our attention at the BL. 
Even more crying when we couldn't get rid of it in the MARC-I to MARC-II 
conversion (well before MARC21 was even a twinkle) - a lot of tears are 
gathering somewhere.

Peter



 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Bill 
 Dueber
 Sent: Tuesday, April 17, 2012 5:50 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 
 and MARC21
 
 On Tue, Apr 17, 2012 at 8:46 PM, Simon Spero sesunc...@gmail.com wrote:
 
  Actually Anglo and Francophone centric. And the USMARC style 245 was a
  poor replacement for the UKMARC approach (someone at the British
  Library hosted Linked Data meeting wondered why there were punctation
  characters included in the data in the title field. The catalogers wept 
  slightly).
 
  Simon
 
 
 
 Slightly? I cry my eyes out *every single day* about that. Well, every 
 weekday, anyway.
 
 
 --
 Bill Dueber
 Library Systems Programmer
 University of Michigan Library


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Jonathan Rochkind

On 4/18/2012 6:04 AM, Tod Olson wrote:

It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory 
structure to the byte-offsets in the fixed fields. The values in these places 
all assume 8-bit character data, it's completely baked in to the file format.


I'm not sure that follows. One could certainly have UTF-16 in a Marc 
record, and still count bytes to get a directory structure and byte 
offsets. (In some ways it'd be easier since every char would be two bytes).


In fact, I worry that the standard may pre-date UTF-8, with it's 
reference to UCS ---  if I understand things right, at one point there 
was only one unicode encoding, called UCS, which is basically a 
backwards-compatible subset of what became UTF-16.


So I worry the standard really means UCS/UTF-16.

But if in fact records in the wild with the 'u' value are far more 
likely to be UTF-8... well it's certainly not the first time the MARC21 
standard was useless/ignored as a standard in answering such questions.


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Doran, Michael D
Hi Tod,

I'm not understanding how UTF-8 would be considered 8-bit character data (other 
than the ASCII-range of the Unicode repertoire, natch).  I don't think ISO 2709 
knows from characters, only bytes.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/


 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Tod Olson
 Sent: Wednesday, April 18, 2012 5:04 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
 ISO_2709 and MARC21
 
 It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory
 structure to the byte-offsets in the fixed fields. The values in these
 places all assume 8-bit character data, it's completely baked in to the
 file format.
 
 -Tod
 
 On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:
 
  Okay, forget XML for a moment, let's just look at marc 'binary'.
 
  First, for Anglophone-centric MARC21.
 
  The LC docs don't actually say quite what I thought about leader byte
 09, used to advertise encoding:
 
 
  a - UCS/Unicode
  Character coding in the record makes use of characters from the
 Universal Coded Character Set (UCS) (ISO 10646), or Unicode(tm), an industry
 subset.
 
 
 
  That doesn't say UTF-8. It says UCS or Unicode. What does that
 actually mean?  Does it mean UTF-8, or does it mean UTF-16 (closer to
 what used to be called UCS I think?).  Whatever it actually means, do
 people violate it in the wild?
 
 
 
  Now we get to non-Anglophone centric marc. I think all of which is
 ISO_2709?  A standard which of course is not open access, so I can't get
 it to see what it says.
 
  But leader 09 being used for encoding -- is that Marc21 specific, or is
 it true of any ISO-2709?  Marc8 and unicode being the only valid
 encodings can't be true of any ISO-2709, right?
 
  Is there a generic ISO-2709 way to deal with this, or not so much?


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread LeVan,Ralph
 In fact, I worry that the standard may pre-date UTF-8, with it's 
 reference to UCS ---  if I understand things right, at one point
there 
 was only one unicode encoding, called UCS, which is basically a 
 backwards-compatible subset of what became UTF-16.

 So I worry the standard really means UCS/UTF-16.

Now you're just trying to scare yourself.  I've never seen UTF-16
MarcXML.  I've never seen anything but UTF-8 encoded MarcXML.

Ralph


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Karen Coyle

UTF-8 was the marc standard from the beginning:

http://www.loc.gov/marc/marbi/1998/98-18.html

The first proposals were a character mapping between Unicode and MARC-8 
and didn't mention the character encodings, thus the term UCS which 
was a common term for Unicode at that time. (see: 
http://www.loc.gov/marc/marbi/1996/96-10.html). But when it got down to 
brass tacks, it was UTF-8, and left open the possibility of UTF-16 
(which was still a viable rival to UTF-8 at the time, as I recall.) 
UTF-16 had the advantage of every character being of uniform length, but 
it also did not cover all of the characters of interest to libraries.


The decision was also made to use byte count rather than character count 
in the directory. This was influenced by the UTF-8 decision.


kc

On 4/18/12 7:04 AM, Jonathan Rochkind wrote:

On 4/18/2012 6:04 AM, Tod Olson wrote:

It has to mean UTF-8. ISO 2709 is very byte-oriented, from the
directory structure to the byte-offsets in the fixed fields. The
values in these places all assume 8-bit character data, it's
completely baked in to the file format.


I'm not sure that follows. One could certainly have UTF-16 in a Marc
record, and still count bytes to get a directory structure and byte
offsets. (In some ways it'd be easier since every char would be two bytes).

In fact, I worry that the standard may pre-date UTF-8, with it's
reference to UCS --- if I understand things right, at one point there
was only one unicode encoding, called UCS, which is basically a
backwards-compatible subset of what became UTF-16.

So I worry the standard really means UCS/UTF-16.

But if in fact records in the wild with the 'u' value are far more
likely to be UTF-8... well it's certainly not the first time the MARC21
standard was useless/ignored as a standard in answering such questions.


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Huwig,Steve
I could be mistaken (never having had the pleasure of reading it), but
isn't ISO-2709 specified as a fixed number of characters, and any
conflation of characters and 8-bit bytes is on the part of users and
implementations?

I think ISO 2709 might not know from bytes, only characters. 

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
Of
 Doran, Michael D
 Sent: Wednesday, April 18, 2012 10:05 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
 ISO_2709 and MARC21
 
 Hi Tod,
 
 I'm not understanding how UTF-8 would be considered 8-bit character
 data (other than the ASCII-range of the Unicode repertoire, natch).  I
 don't think ISO 2709 knows from characters, only bytes.
 
 -- Michael
 
 # Michael Doran, Systems Librarian
 # University of Texas at Arlington
 # 817-272-5326 office
 # 817-688-1926 mobile
 # do...@uta.edu
 # http://rocky.uta.edu/doran/
 
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
 Of
  Tod Olson
  Sent: Wednesday, April 18, 2012 5:04 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
  ISO_2709 and MARC21
 
  It has to mean UTF-8. ISO 2709 is very byte-oriented, from the
 directory
  structure to the byte-offsets in the fixed fields. The values in
 these
  places all assume 8-bit character data, it's completely baked in to
 the
  file format.
 
  -Tod
 
  On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:
 
   Okay, forget XML for a moment, let's just look at marc 'binary'.
  
   First, for Anglophone-centric MARC21.
  
   The LC docs don't actually say quite what I thought about leader
 byte
  09, used to advertise encoding:
  
  
   a - UCS/Unicode
   Character coding in the record makes use of characters from the
  Universal Coded Character Set (UCS) (ISO 10646), or Unicode(tm), an
 industry
  subset.
  
  
  
   That doesn't say UTF-8. It says UCS or Unicode. What does that
  actually mean?  Does it mean UTF-8, or does it mean UTF-16 (closer
to
  what used to be called UCS I think?).  Whatever it actually means,
 do
  people violate it in the wild?
  
  
  
   Now we get to non-Anglophone centric marc. I think all of which is
  ISO_2709?  A standard which of course is not open access, so I can't
 get
  it to see what it says.
  
   But leader 09 being used for encoding -- is that Marc21 specific,
 or is
  it true of any ISO-2709?  Marc8 and unicode being the only valid
  encodings can't be true of any ISO-2709, right?
  
   Is there a generic ISO-2709 way to deal with this, or not so much?


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Doran, Michael D
 I could be mistaken (never having had the pleasure of reading it), but
 isn't ISO-2709 specified as a fixed number of characters, and any
 conflation of characters and 8-bit bytes is on the part of users and
 implementations?

I don't believe that is the case.  Take UTF-8 out of the picture, and consider 
the MARC-8 character set with its escape sequences and combining characters.  A 
character such as an n with a tilde would consist of two bytes.  The Greek 
small letter alpha, if invoked in accordance with ANSI X3.41, would consist of 
five bytes (two bytes for the initial escape sequence, a byte for the 
character, and then two bytes for the escape sequence returning to the default 
character set).

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Huwig,Steve
 Sent: Wednesday, April 18, 2012 9:21 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
 ISO_2709 and MARC21
 
 I could be mistaken (never having had the pleasure of reading it), but
 isn't ISO-2709 specified as a fixed number of characters, and any
 conflation of characters and 8-bit bytes is on the part of users and
 implementations?
 
 I think ISO 2709 might not know from bytes, only characters.
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
 Of
  Doran, Michael D
  Sent: Wednesday, April 18, 2012 10:05 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
  ISO_2709 and MARC21
 
  Hi Tod,
 
  I'm not understanding how UTF-8 would be considered 8-bit character
  data (other than the ASCII-range of the Unicode repertoire, natch).  I
  don't think ISO 2709 knows from characters, only bytes.
 
  -- Michael
 
  # Michael Doran, Systems Librarian
  # University of Texas at Arlington
  # 817-272-5326 office
  # 817-688-1926 mobile
  # do...@uta.edu
  # http://rocky.uta.edu/doran/
 
 
   -Original Message-
   From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
  Of
   Tod Olson
   Sent: Wednesday, April 18, 2012 5:04 AM
   To: CODE4LIB@LISTSERV.ND.EDU
   Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
   ISO_2709 and MARC21
  
   It has to mean UTF-8. ISO 2709 is very byte-oriented, from the
  directory
   structure to the byte-offsets in the fixed fields. The values in
  these
   places all assume 8-bit character data, it's completely baked in to
  the
   file format.
  
   -Tod
  
   On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:
  
Okay, forget XML for a moment, let's just look at marc 'binary'.
   
First, for Anglophone-centric MARC21.
   
The LC docs don't actually say quite what I thought about leader
  byte
   09, used to advertise encoding:
   
   
a - UCS/Unicode
Character coding in the record makes use of characters from the
   Universal Coded Character Set (UCS) (ISO 10646), or Unicode(tm), an
  industry
   subset.
   
   
   
That doesn't say UTF-8. It says UCS or Unicode. What does that
   actually mean?  Does it mean UTF-8, or does it mean UTF-16 (closer
 to
   what used to be called UCS I think?).  Whatever it actually means,
  do
   people violate it in the wild?
   
   
   
Now we get to non-Anglophone centric marc. I think all of which is
   ISO_2709?  A standard which of course is not open access, so I can't
  get
   it to see what it says.
   
But leader 09 being used for encoding -- is that Marc21 specific,
  or is
   it true of any ISO-2709?  Marc8 and unicode being the only valid
   encodings can't be true of any ISO-2709, right?
   
Is there a generic ISO-2709 way to deal with this, or not so much?


Re: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL

2012-04-18 Thread Mennerich, Donald
Rosalyn,

I've written a number of scripts of this nature. Here's a quick one I wrote 
recently to add DAOs to our AT for an audio digitization project (note it does 
not include file versions, just Components, Instances and DAOs).
It starts at the ResourceComponent identified by the long at the top of the 
script. The resourceId is also hard-coded in a number of places. I've got some 
tidier Java that runs as part of a automated process for a large digitization 
project, but all the basic Inserts are in this: 
https://github.com/yalemssa/ATK_DAO_Scripts/blob/master/components_atk.groovy

Don Mennerich
donald.menner...@yale.edumailto:donald.menner...@yale.edu


From: Rosalyn Metz rosalynm...@gmail.commailto:rosalynm...@gmail.com
Date: Wed, Apr 18, 2012 at 9:23 AM
Subject: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL
To: CODE4LIB@listserv.nd.edumailto:CODE4LIB@listserv.nd.edu


Hi Everyone,

I posted this over on the Archivists' Toolkit listserv and got no response
(yet), so I thought I might try here as well.

I have a large quantity (around 300+) of digital objects that I need to add
to Archivists' Toolkit.  I think I've figured out what queries I need to
run in order to do this in MySQL (rather than the interface) but I wanted
to get opinions from the peanut gallery before trying it out on my test
instance.

It seems that there are actually two update queries that need to be used
when creating a Digital Object.  They are:

insert into ArchDescriptionInstances
(instanceType, resourceComponentId, resourceId, parentResourceId,
instanceDescriminator, archDescriptionInstancesId)
values
('Digital object', 336673, null, 543, 'digital', 22567003)


and...

insert into DigitalObjects
(version, lastUpdated, created, lastUpdatedBy, createdBy, title,
dateExpression, dateBegin, dateEnd, languageCode, restrictionsApply,
eadDaoActuate, eadDaoShow, metsIdentifier, objectType, label, objectOrder,
componentId, parentDigitalObjectId, archDescriptionInstancesId,
repositoryId)
values
(0, '2012-04-17 12:05:15', '2012-04-17 12:05:15', 'username', 'username',
'title', '1938-1959', null, null, '', 0, 'onRequest', 'new', '678.1829',
'text', '', 0, '', null, 22567003, 1)


There also appears to be some update queries as well, but I'm guessing that
they are less important (please correct me if I'm wrong).  Has anyone tried
to do this in the past? If so do you have scripts that will create Digital
Objects for you that you wouldn't mind sharing?  Is there anything you
think I should know before testing this out in my test instance of AT?  Any
caveats for me?

Any help anyone can provide would be greatly appreciated.

Thanks,
Rosalyn


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Andy Kohler
I don't know about ISO 2709 itself, but the MARC21 implementation of
it refers to octets, aka 8-bit bytes:
http://www.loc.gov/marc/specifications/specrecstruc.html

Characters may be encoded using one or more than one octet, depending
on the character set. All ASCII characters are encoded using one octet
in the ASCII encoding and the Unicode UTF-8 encoding, thus a character
is equivalent in length to an octet when an element's values are
restricted to ASCII.

--Andy

On Wed, Apr 18, 2012 at 7:20 AM, Huwig,Steve huw...@oclc.org wrote:
 I could be mistaken (never having had the pleasure of reading it), but
 isn't ISO-2709 specified as a fixed number of characters, and any
 conflation of characters and 8-bit bytes is on the part of users and
 implementations?

 I think ISO 2709 might not know from bytes, only characters.


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Houghton,Andrew
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Tuesday, April 17, 2012 19:55
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] more on MARC char encoding: Now we're about
 ISO_2709 and MARC21
 
 Okay, forget XML for a moment, let's just look at marc 'binary'.
 
 First, for Anglophone-centric MARC21.
 
 The LC docs don't actually say quite what I thought about leader byte
 09, used to advertise encoding:
 
 
 a - UCS/Unicode
 Character coding in the record makes use of characters from the
 Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an
 industry subset.
 
 
 
 That doesn't say UTF-8. It says UCS or Unicode. What does that
 actually mean?  Does it mean UTF-8, or does it mean UTF-16 (closer to
 what used to be called UCS I think?).  Whatever it actually means, do
 people violate it in the wild?
 
First UCS/Unicode basically means the same thing. Second UTF-8, UTF-16, UTF-32 
are encoding forms for UCS/Unicode. The MARC documentation does actually say 
MARC binary records *must* be encoded UTF-8 when LDR/09 content has the value 
'a'.

You need to refer to the appropriate standards for this information and 
definitions:

http://www.loc.gov/marc/specifications/speccharucs.html#implementation
Unicode specifies three encoding forms, of which only one, UTF-8 (UCS 
Transformation Format 8), is authorized for use in MARC 21 records.

http://www.unicode.org/glossary/#UCS
UCS. Acronym for Universal Character Set, which is specified by International 
Standard ISO/IEC 10646, which is equivalent in repertoire to the Unicode 
Standard.

http://www.unicode.org/glossary/#unicode_encoding_form
Unicode Encoding Form. A character encoding form that assigns each Unicode 
scalar value to a unique code unit sequence. The Unicode Standard defines three 
Unicode encoding forms: UTF-8, UTF-16, and UTF-32. (See definition D79 in 
Section 3.9, Unicode Encoding Forms.)

http://www.unicode.org/glossary/#UTF_8
UTF-8. A multibyte encoding for text that represents each Unicode character 
with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is the 
predominant form of Unicode in web pages. More technically: (1) The UTF-8 
encoding form. (2) The UTF-8 encoding scheme. (3) “UCS Transformation Format 
8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the 
definitions in the Unicode Standard.

http://www.unicode.org/glossary/#UTF_16
UTF-16. A multibyte encoding for text that represents each Unicode character 
with 2 or 4 bytes; it is not backward-compatible with ASCII. It is the internal 
form of Unicode in many programming languages, such as Java, C#, and 
JavaScript, and in many operating systems. More technically: (1) The UTF-16 
encoding form. (2) The UTF-16 encoding scheme. (3) “Transformation format for 
16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003; technically 
equivalent to the definitions in the Unicode Standard.

Andy


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Doran, Michael D
 ISO 2709 doesn't care how many bytes your characters are. The directory
 and offsets and other things count bytes, not characters.

That was exactly my point.  (Which I am stating since you quoted me and I 
couldn't tell if you were refuting my point, or using it to support your 
conclusion.)  ;-)

-- Michael

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Wednesday, April 18, 2012 11:09 AM
 To: Code for Libraries
 Cc: Doran, Michael D
 Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
 ISO_2709 and MARC21
 
 On 4/18/2012 11:09 AM, Doran, Michael D wrote:
  I don't believe that is the case.  Take UTF-8 out of the picture, and
 consider the MARC-8 character set with its escape sequences and combining
 characters.  A character such as an n with a tilde would consist of two
 bytes.  The Greek small letter alpha, if invoked in accordance with ANSI
 X3.41, would consist of five bytes (two bytes for the initial escape
 sequence, a byte for the character, and then two bytes for the escape
 sequence returning to the default character set).
 
 ISO 2709 doesn't care how many bytes your characters are. The directory
 and offsets and other things count bytes, not characters. (which was, in
 my opinion, the _right_ decision, for once with marc!)
 
 How bytes translate into characters is not a concern of ISO 2709.
 
 The majority of non-7-bit-ASCII encodings will have chars that are more
 than one byte, either sometimes or always. This is true of MARC8 (some
 chars), UTF8 (some chars), and UTF16 (all chars), all of them. (It is
 not true of Latin-1 though, for instance, I don't think).
 
 ISO 2709 doesn't care what char encodings you use, and there's no
 standard ISO 2709 way to determine what char encodings are used for
 _data_ in the MARC record. ISO 2709 does say that _structural_ elements
 like field names, subfield names, the directory itself, seperator chars,
 etc, all need to be (essentially, over-simplifying) 7-bit-ASCII. The
 actual data itself is application dependent, 2709 doesn't care, and 2709
 doesn't give any standard cross-2709 way to determine it.
 
 That is my conclusion at the moment, helped by all of you all in this
 thread, thanks!


[CODE4LIB] Islandora Camp 2012 Registration Public Brainstorm/Call for Proposals

2012-04-18 Thread David Wilcox
* Apologies for cross-posting *

We're excited to invite you all to the third annual Islandora Camp
(Aug 1-3, 2012).  Islandora Camp welcomes developers, administrators,
and users of Islandora  to meet, learn, and grow the ecosystem!
Registration for Islandora Camp is now open, and is available via the
following link:
http://islandora.ca/node/add/islandora-camp-registration

Registration is $350, and includes a banquet dinner at Stanhope
(http://www.stanhopebeachresort.com) on August 2nd. The agenda is
still pending our call for proposals (see below). However, we expect a
similar structure to last year, with concurrent sessions running all
three days appropriate to both beginners and advanced Islandorians.

Public Brainstorm and Call for Proposals

We've created a Google Moderator stream for Islandora Camp here:
http://www.google.com/moderator/#16/e=1fe634. You can view all of the
presentation ideas and vote on your favourites! You can also suggest
your own ideas for posters, presentations, papers, user groups, and
workshops - just indicate whether you're volunteering to present or
just interested in attending a session on a particular topic. Please
get your suggestions into the system by the end of May to make sure
they're considered for the conference schedule.

Mark your calendars: The Red Island Repository Institute will be back
in 2012. Tentative dates are September 24-28, 2012. We will post more
information as it becomes available.

-- 
David Wilcox, BA, MLIS
Islandora Training/Support Coordinator
Robertson Library
University of Prince Edward Island
dwil...@upei.ca
Skype Name: david.wilcox82
902.620.5167


[CODE4LIB] Representing geographic hiearchy in linked data

2012-04-18 Thread Ethan Gruber
 No Message Collected 


Re: [CODE4LIB] Job: Senior Application Developer at New York Public Library

2012-04-18 Thread Ross Singer
 No Message Collected 


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Tod Olson
In practice it seems to mean UTF-8. At least I've only seen UTF-8, and I can't 
imagine the code that processes this stuff being safe for UTF-16 or UTF-32. All 
of the offsets are byte-oriented, and there's too much legacy code that makes 
assumption about null-terminated strings.

-Tod

On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:

 Okay, forget XML for a moment, let's just look at marc 'binary'.
 
 First, for Anglophone-centric MARC21.
 
 The LC docs don't actually say quite what I thought about leader byte 09, 
 used to advertise encoding:
 
 
 a - UCS/Unicode
 Character coding in the record makes use of characters from the Universal 
 Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset.
 
 
 
 That doesn't say UTF-8. It says UCS or Unicode. What does that actually 
 mean?  Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be 
 called UCS I think?).  Whatever it actually means, do people violate it in 
 the wild?
 
 
 
 Now we get to non-Anglophone centric marc. I think all of which is ISO_2709?  
 A standard which of course is not open access, so I can't get it to see what 
 it says.
 
 But leader 09 being used for encoding -- is that Marc21 specific, or is it 
 true of any ISO-2709?  Marc8 and unicode being the only valid encodings 
 can't be true of any ISO-2709, right?
 
 Is there a generic ISO-2709 way to deal with this, or not so much?


[CODE4LIB] JCDL 2012 registration opens today, April 5

2012-04-18 Thread Howard, Barrie
 No Message Collected 


Re: [CODE4LIB] Job: Senior Application Developer at New York Public Library

2012-04-18 Thread Cary Gordon
 No Message Collected 


[CODE4LIB] Google Scholar Indexing Guidelines: Highwire Press vs. Eprints vs. BE Press vs. PRISM?

2012-04-18 Thread Brett Bonfield
 No Message Collected 


[CODE4LIB] Job: Records Management Archivist at Johns Hopkins University

2012-04-18 Thread jobs
The Johns Hopkins University Sheridan Libraries is hiring a Records Management
Archivist to work with the University Archivist to develop an innovative
approach to records management with the purpose of improving our stewardship
of a university history that exists in print, digitized, and born-digital form
and can come from within and outside the boundaries of official university
activity.

  
Essential skills and knowledge areas include:

  
strong technological literacy and curiosity;

deep understanding of born-digital archives and the emerging tools and
techniques used to manage born-digital archives;

experience with traditional archives functions and processes;

comprehension of holistic approaches to information management that account
for recorded memory that takes a variety of forms, including analog,
digitized, and born-digital;

excellent interpersonal and communication skills needed when interviewing a
diverse community of records creators for the purpose of discerning
information management behavior and its impact on the documentation of
university memory;

excellent writing and information visualization skills needed to create
records retention schedules, functional requirements and other documentation,
and information models;

the creativity, entrepreneurial spirit, and critical thinking competencies
needed to play a crucial role in redefining institutional records management
for an increasingly born-digital world in which important analog traces will
endure.

  
Questions about the position should be addressed to jsteele at jhu dot edu (no
phone calls, please). For a more prescribed list of duties
and additional qualifications and to apply to this unique opportunity, please
visit https://hrnt.jhu.edu/jhujobs/job_view.cfm?view_req_id=52166.



Brought to you by code4lib jobs: http://jobs.code4lib.org/job/895/