RE: MARC::Charset question

Doran, Michael D Fri, 18 May 2007 11:17:10 -0700

Hi Michael,

> An example is the author (personal name) of the book that can 
> be found at http://catalog.loc.gov/ by searching for ISBN 
> 5040039875 (I'm guessing the fact that the website appears to 
> be displaying a corrupted name may be part of the problem here).


The Library of Congress catalog is outputting the MARC data to your browser in 
Unicode UTF-8 and it looks correct to me.  It may *appear* corrupted, depending 
on what font you choose to display the encoding (try Arial Unicode MS if you 
are in a Windows environment).

> This name is 'Dontsova, Daria' (approximately),

Below is the UTF-16 encoding of the name in question, based on a copy-and-paste 
directly from the browser 
(http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?BBID=12550873).

U+0044  LATIN CAPITAL LETTER D
U+006F  LATIN SMALL LETTER O
U+006E  LATIN SMALL LETTER N
U+0074  LATIN SMALL LETTER T
U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
U+0073  LATIN SMALL LETTER S
U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF
U+006F  LATIN SMALL LETTER O
U+0076  LATIN SMALL LETTER V
U+0061  LATIN SMALL LETTER A
U+002C  COMMA
U+0020  SPACE, BLANK / SPACE
U+0044  LATIN CAPITAL LETTER D
U+0061  LATIN SMALL LETTER A
U+0072  LATIN SMALL LETTER R
U+02B9  SOFT SIGN, PRIME / MODIFIER LETTER PRIME
U+0069  LATIN SMALL LETTER I
U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
U+0061  LATIN SMALL LETTER A
U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF
U+002E  PERIOD, DECIMAL POINT / FULL STOP


> ... in hex:
> 446f6eeb74ec736f76612c20446172a7eb69ec612e.
> When transcoded by marc8_to_utf8() the result is
> 446f6e74cda173006f76612c20446172cab969cda161002e
> - which contains 2 null (00) characters.

44 6f 6e [eb] 74    [ec] 73      6f 76 61 2c 20 44 61 72 [a7]    [eb] 69 [ec]   
 61      2e
44 6f 6e      74 [cd a1] 73 [00] 6f 76 61 2c 20 44 61 72 [ca b9]      69 [cd 
a1] 61 [00] 2e

Hmmmm.  It looks like the MARC-8 'COMBINING LIGATURE LEFT HALF' ("0xEB") and/or 
the MARC-8 'COMBINING LIGATURE RIGHT HALF' ("0xEC") got converted to a Unicode 
'COMBINING DOUBLE INVERTED BREVE' ("0xCD 0xA1" in UTF-8 [1]).  That doesn't 
sound like something that MARC::Charset would do.

-- Michael

[1] Unicode Character 'COMBINING DOUBLE INVERTED BREVE' (U+0361)
    http://www.fileformat.info/info/unicode/char/0361/index.htm

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/


> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
> Sent: Friday, May 18, 2007 5:49 AM
> To: perl4lib@perl.org; [EMAIL PROTECTED]
> Subject: MARC::Charset question
> 
> Hi,
> 
> I'm using marc8_to_utf8() on Library of Congress data. I'm 
> finding that I get occasional null characters inserted in the 
> output text, and I'm wondering what this means.
> 
> An example is the author (personal name) of the book that can 
> be found at http://catalog.loc.gov/ by searching for ISBN 
> 5040039875 (I'm guessing the fact that the website appears to 
> be displaying a corrupted name may be part of the problem here).
> 
> This name is 'Dontsova, Daria' (approximately), in hex:
> 446f6eeb74ec736f76612c20446172a7eb69ec612e. When transcoded by
> marc8_to_utf8() the result is
> 446f6e74cda173006f76612c20446172cab969cda161002e - which 
> contains 2 null (00) characters.
> 
> Is it safe to ignore these null characters (i.e. strip them 
> out of the result, which otherwise seems good)?
> 
> Thanks,
> 
> Michael
>

RE: MARC::Charset question

Reply via email to