RE: :Z3950 and diacritics

Michael D Doran Tue, 16 Dec 2003 10:42:19 -0800

First, we probably want to figure out what character set the records are
encoded in as received from LOC.  Since only the non-ASCII characters will
give us a clue, we can look at the umlauted-u ("ü") in Dürer.


Charset hex character(s) used to represent "ü"
------- ----------------
MARC-8  0xE8 0x75 (combining umlaut/diaeresis preceding latin small letter
u)
MARC-UCS*       0x75 0xCC 0x88 (latin small letter u followed by combining
umlaut/diaeresis -- in this case the combining character is represented by
two bytes)
Latin-1 0xFC (precomposed latin small letter u with umlaut/diaeresis)

* MARC-UCS/Unicode is UTF-8 encoded, therefore U+0075 becomes 0x75 and the
U+0308 becomes 0xCC 0x88.  The MARC-21 specification does not allow the use
of the precomposed Unicode character for an umluated-u.

>   aRussell, Francis,d1910-14aThe world of D^Žurer,

Since you are getting the base character OK (latin small letter u), we
should probably assume a base-plus-combining character scheme, and since the
combining character(s) come *before* the base character, we can probably
assume MARC-8.  If we could actually *verify* the hex encoding, we can go on
to what is happening to the records subsequent to the Z39.50 download... and
what to do with MARC-8, since it is not a character set used outside of
library-specific software applications.  ;-)

BTW, the character set should also agree with the value in character
position 9 in the leader of the MARC record:
        09 - Character coding scheme
        Identifies the character coding scheme used in the record. 
        # - MARC-8 (the pound symbol "#" represents a blank in this case)
        a - UCS/Unicode 
        [from http://www.loc.gov/marc/bibliographic/ecbdldrd.html#mrcblea ]

> From: Tajoli Zeno [mailto:[EMAIL PROTECTED]
> The best option, now, is to use charset where 1 character
> is always 1 byte, for example ISO 8859_1

Be aware that converting MARC-8 to Latin-1 has the potential for data loss,
since there are many more characters that can be represented in MARC-8, than
can be represented in Latin-1.  The better bet is to convert to Unicode
UTF-8 (or get the records in that character set to begin with, if that is an
option).

> > From: Tajoli Zeno [mailto:[EMAIL PROTECTED]
> > 3)In the leader, position 0-4 you have the number of 
> > character, NOT the number of bytes. 
>
> From: Colin Campbell [mailto:[EMAIL PROTECTED]
> No it should be number of bytes (LOC has clarified this in 
> their spec by saying "number of octets".) It has always
> been the length in bytes.

>From the MARC 21 specifications...

  UCS/Unicode Markers and the MARC 21 Record Leader
  
  In MARC 21 records, Leader character position 9 contains value
  a if the data is encoded using UCS/Unicode characters. If any
  UCS/Unicode characters are to be included in the MARC 21 record,
  the entire MARC record must be encoded using UCS/Unicode characters.
  The record length contained in Leader positions 0-4 is a count of
  the number of octets in the record, not characters. The Leader
  position 9 value is not dependent on the character encoding used.
  This rule applies to MARC 21 records encoded using both the MARC-8
  and UCS/Unicode character sets.
  [from http://www.loc.gov/marc/specifications/speccharucs.html]

-- Michael

#  Michael Doran, Systems Librarian
#  University of Texas at Arlington
#  817-272-5326 office 
#  817-239-5368 cell
#  [EMAIL PROTECTED]
#  http://rocky.uta.edu/doran/



> -----Original Message-----
> From: Eric Lease Morgan [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, December 16, 2003 7:57 AM
> To: Perl4Lib
> Subject: Net::Z3950 and diacritics
> 
> 
> On 12/15/03 8:54 AM, Eric Lease Morgan <[EMAIL PROTECTED]> wrote:
> 
> > In order to get the MARC records for my "catalog" I have 
> been searching the
> > LOC catalog, identifying the record I desire, and using 
> Net::Z3950 to download
> > the desired record via the MARC 001 tag. Tastes great. Less filling.
> > 
> > When I loop through my MARC records MARC::Batch sometimes 
> warns that the MARC
> > leader is incorrect. This happens when the record contains 
> a diacritic.
> > Specifically, my MARC::Batch object returns "Invalid record 
> length..." I have
> > discovered that I can plow right through the record anyway 
> by turning on
> > strict_off, but my resulting records get really ugly at the 
> point of the
> > diacritic:
> > 
> >  
> http://infomotions.com/books/?cmd=search&query=id=russell-worl
> d-107149566
> 
> Upon further investigation, it seems that MARC::Batch is not 
> necessarily
> causing my problem with diacritics, instead, the problem may 
> lie in the way
> I am downloading my records using Net::Z3950.
> 
> How do I tell Net::Z3950 to download a specific MARC record 
> using a specific
> character set?
> 
> To download my MARC records from the LOC I feed a locally 
> developed Perl
> script, using Net::Z3950, the value from a LOC MARC record, 
> field 001. This
> retrieves one an only one record. I then suck up the found 
> record and put it
> into a MARC::Record object. It is all done like this:
> 
> 
>   # define sum constants
>   my $DATABASE = 'voyager';
>   my $SERVER   = 'z3950.loc.gov';
>   my $PORT     = '7090';
>   
>   # create a LOC (Voyager) 001 query
>   my $query = "[EMAIL PROTECTED] 1=7 3118006";
>   
>   # create a z39.50 object
>   my $z3950 = Net::Z3950::Manager->new(databaseName => $DATABASE);
>   
>   # assign the object some z39.50 characteristics
>   $z3950->option(elementSetName => "f");
>   $z3950->option(preferredRecordSyntax => 
> Net::Z3950::RecordSyntax::USMARC);
>       
>   # connect to the server and check for success
>   my $connection = $z3950->connect($SERVER, $PORT);
>       
>   # search
>   my $results = $connection->search($query);
>   
>   # get the found record and turn it into a MARC::Record object
>   my $record = $results->record(1);
>   $record = MARC::Record->new_from_usmarc($record->rawdata());
> 
>   # create a file name
>   my $id = time;
> 
>   # write the record
>   open MARC, "> $id.marc";
>   print MARC $record->as_usmarc;
>   close MARC;
> 
> 
> This process works just fine for records that contain no 
> diacritics, but
> when diacritics are in the records extra characters end up in my saved
> files, like this:
> 
>   00901nam  22002651
>     ^^^
>   45000010008000000050017000080080041000250350021000669060045000870
>   10001700132040001800149050001800167082001000185100002900195245009
>   20022426000340031630000470035049000290039750400260042660000340045
>   27100021004869910044005079910055005510990029006063118006
>   19740417000000.0731207s1967    nyuabf   b    000 0beng  
>   9(DLC)   67029856  a7bcbccorignewdueocipf19gy-gencatlg
>   a   67029856   aDLCcDLCdDLC00aND588.D9bR8500a759.31
>   aRussell, Francis,d1910-14aThe world of D^Žurer,
>                                               ^^^^^^^
>   1471-1528,cby Francis Russell and the editors of Time-Life
>   Books.  aNew York,bTime, inc.c[1967]  a183 p.billus.,
>   maps, col. plates.c32 cm.0 aTime-Life library of art
>   aBibliography: p. 177.10aD^Žurer, Albrecht,d1471-1528.2
>                               ^^^^^^^
>   aTime-Life Books.  bc-GenCollhND588.D9iR85tCopy 1wBOOKS
>   bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF
>   arussell-world-1071495663
> 
> Notice how Dürer got munged into D^Žurer, twice, and 
> consequently the record
> length is not 901 but 903 instead.
> 
> Some people say I must be sure to request a specific 
> character set from the
> LOC when downloading my MARC records, specifically MARC-8 or 
> MARC-UCS. Which
> one of these character sets do I want and how do I tell the 
> remote database
> which one I want?
> 
> -- 
> Eric "The Ugly American Who Doesn't Understand Diacritics" Morgan
> University Libraries of Notre Dame
> 
> (574) 631-8604
> 
>

RE: :Z3950 and diacritics

Reply via email to