First, we probably want to figure out what character set the records are encoded in as received from LOC. Since only the non-ASCII characters will give us a clue, we can look at the umlauted-u ("ü") in Dürer.
Charset hex character(s) used to represent "ü" ------- ---------------- MARC-8 0xE8 0x75 (combining umlaut/diaeresis preceding latin small letter u) MARC-UCS* 0x75 0xCC 0x88 (latin small letter u followed by combining umlaut/diaeresis -- in this case the combining character is represented by two bytes) Latin-1 0xFC (precomposed latin small letter u with umlaut/diaeresis) * MARC-UCS/Unicode is UTF-8 encoded, therefore U+0075 becomes 0x75 and the U+0308 becomes 0xCC 0x88. The MARC-21 specification does not allow the use of the precomposed Unicode character for an umluated-u. > aRussell, Francis,d1910-14aThe world of D^Žurer, Since you are getting the base character OK (latin small letter u), we should probably assume a base-plus-combining character scheme, and since the combining character(s) come *before* the base character, we can probably assume MARC-8. If we could actually *verify* the hex encoding, we can go on to what is happening to the records subsequent to the Z39.50 download... and what to do with MARC-8, since it is not a character set used outside of library-specific software applications. ;-) BTW, the character set should also agree with the value in character position 9 in the leader of the MARC record: 09 - Character coding scheme Identifies the character coding scheme used in the record. # - MARC-8 (the pound symbol "#" represents a blank in this case) a - UCS/Unicode [from http://www.loc.gov/marc/bibliographic/ecbdldrd.html#mrcblea ] > From: Tajoli Zeno [mailto:[EMAIL PROTECTED] > The best option, now, is to use charset where 1 character > is always 1 byte, for example ISO 8859_1 Be aware that converting MARC-8 to Latin-1 has the potential for data loss, since there are many more characters that can be represented in MARC-8, than can be represented in Latin-1. The better bet is to convert to Unicode UTF-8 (or get the records in that character set to begin with, if that is an option). > > From: Tajoli Zeno [mailto:[EMAIL PROTECTED] > > 3)In the leader, position 0-4 you have the number of > > character, NOT the number of bytes. > > From: Colin Campbell [mailto:[EMAIL PROTECTED] > No it should be number of bytes (LOC has clarified this in > their spec by saying "number of octets".) It has always > been the length in bytes. >From the MARC 21 specifications... UCS/Unicode Markers and the MARC 21 Record Leader In MARC 21 records, Leader character position 9 contains value a if the data is encoded using UCS/Unicode characters. If any UCS/Unicode characters are to be included in the MARC 21 record, the entire MARC record must be encoded using UCS/Unicode characters. The record length contained in Leader positions 0-4 is a count of the number of octets in the record, not characters. The Leader position 9 value is not dependent on the character encoding used. This rule applies to MARC 21 records encoded using both the MARC-8 and UCS/Unicode character sets. [from http://www.loc.gov/marc/specifications/speccharucs.html] -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-239-5368 cell # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Eric Lease Morgan [mailto:[EMAIL PROTECTED] > Sent: Tuesday, December 16, 2003 7:57 AM > To: Perl4Lib > Subject: Net::Z3950 and diacritics > > > On 12/15/03 8:54 AM, Eric Lease Morgan <[EMAIL PROTECTED]> wrote: > > > In order to get the MARC records for my "catalog" I have > been searching the > > LOC catalog, identifying the record I desire, and using > Net::Z3950 to download > > the desired record via the MARC 001 tag. Tastes great. Less filling. > > > > When I loop through my MARC records MARC::Batch sometimes > warns that the MARC > > leader is incorrect. This happens when the record contains > a diacritic. > > Specifically, my MARC::Batch object returns "Invalid record > length..." I have > > discovered that I can plow right through the record anyway > by turning on > > strict_off, but my resulting records get really ugly at the > point of the > > diacritic: > > > > > http://infomotions.com/books/?cmd=search&query=id=russell-worl > d-107149566 > > Upon further investigation, it seems that MARC::Batch is not > necessarily > causing my problem with diacritics, instead, the problem may > lie in the way > I am downloading my records using Net::Z3950. > > How do I tell Net::Z3950 to download a specific MARC record > using a specific > character set? > > To download my MARC records from the LOC I feed a locally > developed Perl > script, using Net::Z3950, the value from a LOC MARC record, > field 001. This > retrieves one an only one record. I then suck up the found > record and put it > into a MARC::Record object. It is all done like this: > > > # define sum constants > my $DATABASE = 'voyager'; > my $SERVER = 'z3950.loc.gov'; > my $PORT = '7090'; > > # create a LOC (Voyager) 001 query > my $query = "[EMAIL PROTECTED] 1=7 3118006"; > > # create a z39.50 object > my $z3950 = Net::Z3950::Manager->new(databaseName => $DATABASE); > > # assign the object some z39.50 characteristics > $z3950->option(elementSetName => "f"); > $z3950->option(preferredRecordSyntax => > Net::Z3950::RecordSyntax::USMARC); > > # connect to the server and check for success > my $connection = $z3950->connect($SERVER, $PORT); > > # search > my $results = $connection->search($query); > > # get the found record and turn it into a MARC::Record object > my $record = $results->record(1); > $record = MARC::Record->new_from_usmarc($record->rawdata()); > > # create a file name > my $id = time; > > # write the record > open MARC, "> $id.marc"; > print MARC $record->as_usmarc; > close MARC; > > > This process works just fine for records that contain no > diacritics, but > when diacritics are in the records extra characters end up in my saved > files, like this: > > 00901nam 22002651 > ^^^ > 45000010008000000050017000080080041000250350021000669060045000870 > 10001700132040001800149050001800167082001000185100002900195245009 > 20022426000340031630000470035049000290039750400260042660000340045 > 27100021004869910044005079910055005510990029006063118006 > 19740417000000.0731207s1967 nyuabf b 000 0beng > 9(DLC) 67029856 a7bcbccorignewdueocipf19gy-gencatlg > a 67029856 aDLCcDLCdDLC00aND588.D9bR8500a759.31 > aRussell, Francis,d1910-14aThe world of D^Žurer, > ^^^^^^^ > 1471-1528,cby Francis Russell and the editors of Time-Life > Books. aNew York,bTime, inc.c[1967] a183 p.billus., > maps, col. plates.c32 cm.0 aTime-Life library of art > aBibliography: p. 177.10aD^Žurer, Albrecht,d1471-1528.2 > ^^^^^^^ > aTime-Life Books. bc-GenCollhND588.D9iR85tCopy 1wBOOKS > bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF > arussell-world-1071495663 > > Notice how Dürer got munged into D^Žurer, twice, and > consequently the record > length is not 901 but 903 instead. > > Some people say I must be sure to request a specific > character set from the > LOC when downloading my MARC records, specifically MARC-8 or > MARC-UCS. Which > one of these character sets do I want and how do I tell the > remote database > which one I want? > > -- > Eric "The Ugly American Who Doesn't Understand Diacritics" Morgan > University Libraries of Notre Dame > > (574) 631-8604 > >