RE: MARC::Charset

Doran, Michael D Wed, 14 Mar 2007 07:02:09 -0800

Hi Ashley,

> I think &#12345; is now legal in MARC-8 now to indicate a 
> Unicode character that isn't in the MARC-8 repertoire.


Yes, that's also my understanding [1,2], though I've not personally come across 
any records yet that use that method.  (Although not being a cataloger, I don't 
routinely examine a lot of MARC records.)

> So, basically, you either need prior knowledge about the 
> actual character encoding used, or you have to test. Testing 
> for UTF-8 is fairly straightforward...

How are you testing for UTF-8?

> Distinguishing Latin-1 from MARC-8 is a bit more like guess work.
> As a test for MARC-8 I look for the common combining diacritics
> followed by a vowel.

Do you have a programmatic way to do that test, or are you "eye-balling" the 
records.

Since MARC-8, Latin-1, and UTF-8 all share the same single octet encodings for 
the ASCII repertoire of characters, it can be a bit of a problem determing the 
character set for a batch of MARC records for English language items, due to 
the paucity of combining accent characters.  And the fact, as you point out, 
that you cannot always trust the MARC leader 09 position and you might in fact 
have a batch that is actually encoded in more that one character set, makes it 
even more interesting. 

-- Michael

[1] MARC PROPOSAL NO. 2006-04: Technique for conversion of Unicode to MARC-8 
    http://www.loc.gov/marc/marbi/2006/2006-04.html

[2] MARC PROPOSAL NO. 2006-09: Lossless technique for conversion of Unicode to 
MARC-8
    http://www.loc.gov/marc/marbi/2006/2006-09.html

Plug: For more resources on character sets, with an emphasis on library 
automation, see
 - Coded Character Sets
   http://rocky.uta.edu/doran/charsets/
 - and especially
   http://rocky.uta.edu/doran/charsets/resources.html

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Ashley Sanders [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, March 14, 2007 4:59 AM
> Cc: perl4lib
> Subject: Re: MARC::Charset
> 
> > Your MARC records appear to be encoded in MARC-8 as evidenced by 
> > "ergáo" in which the combining accent character comes before the 
> > character to be modified.  I.e. the byte string that displays as 
> > "ergáo" in your email would display as "ergò" (with a Latin 
> small letter o with grave) in a MARC-8 aware client.
> 
> I'd just like to relate my recent experiences of retrieving 
> MARC21 records through various library Z39.50 servers. Put 
> simply, you cannot trust the MARC leader character
> 9 to correctly indicate the character set used.
> 
>  From libraries that have set the leader to indicate the 
> records are in the MARC-8 character set, I have retrieved 
> records encoded as Latin-1, UTF-8 and MARC-8.
> 
>  From libraries that set the leader to indicate Unicode, I 
> get records in MARC-8 and UTF-8.
> 
> You also get encodings in MARC-8 records like \1EF6 to 
> indicate a Unicode character.
> I think &#12345; is now legal in MARC-8 now to indicate a 
> Unicode character that isn't in the MARC-8 repertoire.
> 
> So, basically, you either need prior knowledge about the 
> actual character encoding used, or you have to test. Testing 
> for UTF-8 is fairly straightforward and a long string of text 
> (which admittedly you don't tend to get in MARC
> records) that
> tests as UTF-8 is very unlikely to be anything else. Distinguishing
> Latin-1 from
> MARC-8 is a bit more like guess work. As a test for MARC-8 I 
> look for the common combining diacritics followed by a vowel.
> 
> Regards,
> 
> Ashley.
> -- 
> Ashley Sanders               [EMAIL PROTECTED]
> Copac http://copac.ac.uk A MIMAS Service funded by JISC
>

RE: MARC::Charset

Reply via email to