> I was under the impression that the MARC record length in the > Leader was the record length in bytes rather than the number > of characters.
According to this source, the Leader record length is in bytes: MARC Leader > record length = "Five numeric characters equal to the total number of bytes in the logical record" [1] I also checked my charset mail folder and found this in a message from way back in 2003: "...there is some difficulty computing the record length properly, since MARC::Record uses character length, rather than byte length, which are the same thing when you are dealing with 8 bit characters." -- Ed Summers [2] I looked through the MARC::Record CHANGES file [3]. Although there are some enhancements/fixes regarding the use of UTF-8, I don't see anything that explicitely states that more current versions of MARC::Record now compute the record length in bytes. It seems like that would be a good thing. -- Michael [1] MARC 21 Record Builder http://www.loc.gov/marc/marc2onix.html [2] "MARC-Charset-0.5 questions" July 2003 thread on perl4lib [3] CHANGES : Revision history for Perl extension MARC::Record. http://search.cpan.org/src/MIKERY/MARC-Record-2.0.0/Changes # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Doran, Michael D > Sent: Monday, March 03, 2008 10:36 AM > To: 'Leif Andersson'; perl4lib@perl.org > Subject: RE: Help for utf-8 output > > Hi Leif, > > I really appreciate you taking a look at this and responding. > Although I consider myself somewhat knowledgeable about > character sets, I still find these kinds of problems to be confusing. > > > In this case the leader and actual length will not agree, > as your utf8 > > characters have turned into latin1. > > I was under the impression that the MARC record length in the > Leader was the record length in bytes rather than the number > of characters. Is that your understanding? > > Also, I am still troubleshooting my particular set of records > (I was out of town last week) since this problem only appears > to manifest itself for records with non-ASCII characters in > the 100 and 245 fields. Records with a note field having > non-ASCII characters doesn't cause a problem. > > -- Michael > > # Michael Doran, Systems Librarian > # University of Texas at Arlington > # 817-272-5326 office > # 817-688-1926 mobile > # [EMAIL PROTECTED] > # http://rocky.uta.edu/doran/ > > > > -----Original Message----- > > From: Leif Andersson [mailto:[EMAIL PROTECTED] > > Sent: Saturday, March 01, 2008 2:51 PM > > To: Doran, Michael D; perl4lib@perl.org; [EMAIL PROTECTED] > > Subject: Re: Help for utf-8 output > > > > It seems there is a little bug (by design) kicking in. > > > > The leader gets wrong and some characters get wrong in this case: > > + Reading a raw marc record (utf8) from file > > + Turning it into a MARC::Record object > > + Without modification writing it out to file. > > Yes. Even without modification the bug manifests itself! > > > > Let's start with code simply copying one record from a file > utf8.mrc > > containing one or more marc records. This basic operation not > > involving MARC::Record is OK. > > > > #!perl -w > > use strict; > > # > > open(IN, "utf8.mrc") || die "1"; > > open(OUT, ">out_good.mrc") || die "2"; binmode IN; binmode OUT; # # > > Read in raw MARC $/ = "\x1D"; my $marc = <IN>; print OUT $marc; > > __END__ > > > > Now, we're adding MARC::Record to the process, along with > some debug > > info. > > Example code producing *faulty* record: > > > > #!perl -w > > use strict; > > use MARC::Record; > > use Devel::Peek; > > # > > open(IN, "utf8.mrc") || die "1"; > > open(OUT, ">out_bad.mrc") || die "2"; > > binmode IN; > > binmode OUT; > > # > > # Read in raw MARC > > $/ = "\x1D"; > > my $marc = <IN>; > > Dump($marc); # the utf8-flag is not on my $obj = > > MARC::Record->new_from_usmarc( $marc ); # Convert back to > raw MARC my > > $marc2 = $obj->as_usmarc(); Dump($marc2); # the utf8-flag > IS on print > > OUT $marc2; __END__ > > > > > > In this case the leader and actual length will not agree, > as your utf8 > > characters have turned into latin1. > > The problem is that $marc2 has the utf8 flag set internally by Perl. > > And the conversion on output is made in spite of binmode. > > > > We can get around the problem by either (for instance) use bytes; > > or > > Encode::_utf8_off($marc2); > > before printing to file. > > > > But shouldn't MARC::Record take care of this for us? > > A file of MARC records may contain records in different encodings. > > The text parts of a MARC record can be treated as made up > by certain > > encodings, but the "blob" itself, I suppose, should be > exposed to the > > caller as pure binary. > > > > Are there any drawbacks in letting MARC::Record strip off > any eventual > > utf8 flag before returning the record as_usmarc() ? > > If not I suggest this change be made to a future release of > > MARC::Record. > > > > I shall also add that this character mess only sets in when > doing IO. > > If you are updating your databases through one API or > another you are > > probably OK! > > > > > > Leif > > ====================================== > > Leif Andersson, Systems Librarian > > Stockholm University Library > > SE-106 91 Stockholm > > SWEDEN > > Phone : +46 8 162769 > > Mobile: +46 70 6904281 > > > > -----Ursprungligt meddelande----- > > Från: Doran, Michael D [mailto:[EMAIL PROTECTED] > > Skickat: den 21 februari 2008 18:49 > > Till: perl4lib@perl.org > > Ämne: RE: Help for utf-8 output > > > > Hi Jackie, > > > > I'm working on a very similar problem... converting > > theses/dissertations records (in XML) to MARC records. I'm > still in > > the testing stage, but have had similar problems with records with > > diacritics in the 100 or 245 fields (however diacritics in a 520a > > field don't seem to cause any problems). Since our records are not > > "diacritic rich" it's hard to determine the exact extent of the > > problem. > > > > I am using these versions: > > Perl v5.8.8 > > MARC::Charset 0.98 > > MARC::Lint 1.43 > > MARC::Record 2.0 > > XML::LibXML 1.66 > > > > Here's an example "bad" record (which I have minimized to > just the 245 > > field): > > > > marcdump test.mrc > > test.mrc > > LDR 00127cam a2200037 4500 > > 245 13 _aAn Empirical Test Of The Situational Leadership® Model In > > Japan / > > _cRiho Yoshioka. > > > > Recs Errs Filename > > ----- ----- -------- > > 1 1 test.mrc > > > > When I run test.mrc through MARC::Lint, I get this message: > > > > Invalid record length in record 1: Leader says 00127 bytes > but it's > > actually 125 Invalid length in directory for tag > > 245 in record 1 field does not end in end of field > character in tag > > 245 in record 1 > > > > When examined in vi the character in question, a Registered Sign, > > appears to be correctly UTF-8 encoded C2AE, and the bib Leader > > (position 09=a) indicates that it is Unicode encoded. > > I've attached the MARC record. > > > > I noticed that when I run your record (ck245.dat) through > MARC::Lint, > > I get the same invalid record length message: > > > > Invalid record length in record 3: Leader says 00567 bytes > but it's > > actually 569 field does not end in end of field character > in tag 100 > > in record 3 field does not end in end of field character > in tag 245 > > in record 3 Invalid indicators ".10" forced to blanks in > record 3 for > > tag 245 > > > > field does not end in end of field character in tag 260 in > record 3 > > Invalid indicators ". " forced to blanks in record > > 3 for tag 260 > > > > field does not end in end of field character in tag 300 in > record 3 > > Invalid indicators ". " forced to blanks in record > > 3 for tag 300 > > > > field does not end in end of field character in tag 502 in > record 3 > > Invalid indicators ". " forced to blanks in record > > 3 for tag 502 > > > > field does not end in end of field character in tag 504 in > record 3 > > Invalid indicators ". " forced to blanks in record > > 3 for tag 504 > > > > field does not end in end of field character in tag 690 in > record 3 > > Invalid indicators ". 4" forced to blanks in record > > 3 for tag 690 > > > > Anybody have any ideas? > > > > -- Michael > > > > # Michael Doran, Systems Librarian > > # University of Texas at Arlington > > # 817-272-5326 office > > # 817-688-1926 mobile > > # [EMAIL PROTECTED] > > # http://rocky.uta.edu/doran/ > > > > > > > -----Original Message----- > > > From: Shieh, Jackie [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday, February 19, 2008 10:50 AM > > > To: perl4lib@perl.org > > > Subject: Help for utf-8 output > > > > > > I was wondering if anyone has similar experience and has > > come up with > > > good solutions to help solving the challenge below?! > > > > > > What I have is an Excel spreadsheet for dissertations > which I have > > > saved as a tab delimited file (examining the file in TextPad, the > > > diacritics appears to be fine), then read in and output the > > file as a > > > utf-8 MARC file. I <print> title field confirming author > > field that > > > contains diacritics with the title showing proper > indicator values. > > > > > > But when I looked the MARC itself, the fields that follow > the field > > > containing diacritics are all off its original position. > > See attached > > > zip file. Examples below: first two have diacritics in a > > 100 field, > > > last one diacritic is in 245 subfield b) > > > > > > 001 diss 34001 > > > 100 1 _aP<E9>rez, Nancy L. > > > 245 _aSynchronic and Diachronic Matlatzinkan Phonology. > > > > > > 001 diss 34042 > > > 100 1 _aValent<ED>n-M<E1>rquez, Wilfredo > > > 245 _aDoing being boricua : > > > > > > 001 diss 33892 > > > 100 1 _aDavis, Jennifer M. > > > 245 14 _aThe Functional Complexities of Inherited Cardiac > > Troponin I > > > Mutations : > > > _bIdentification of Ca<B2>+ Independent Contractile > > > Dysfunction. > > > > > > I would be greatly appreciate any suggestion to solve this. > > > Thank you most kindly. > > > > > > Regards, > > > > > > --Jackie > > > > > > |Jackie Shieh > > > |Data Loads & Development > > > |Harlan Hatcher Graduate Library > > > |University of Michigan > > > |920 North University > > > |Ann Arbor, MI 48109-1205 > > > |Phone: 734.763.6070 FAX: 734.615.9788 > > > |E-mail: JShieh [AT] umich [DOT] edu > > > > >