RE: Help for utf-8 output - followup on Record Length

Doran, Michael D Mon, 03 Mar 2008 13:15:58 -0800

> I was under the impression that the MARC record length in the 
> Leader was the record length in bytes rather than the number 
> of characters.


According to this source, the Leader record length is in bytes:

  MARC Leader > record length = "Five numeric characters equal
  to the total number of bytes in the logical record" [1]

I also checked my charset mail folder and found this in a message from way back 
in 2003:

  "...there is some difficulty computing the record length properly,
  since MARC::Record uses character length, rather than byte length,
  which are the same thing when you are dealing with 8 bit characters."
   -- Ed Summers [2]

I looked through the MARC::Record CHANGES file [3].  Although there are some 
enhancements/fixes regarding the use of UTF-8, I don't see anything that 
explicitely states that more current versions of MARC::Record now compute the 
record length in bytes.  It seems like that would be a good thing.

-- Michael

[1] MARC 21 Record Builder
    http://www.loc.gov/marc/marc2onix.html

[2] "MARC-Charset-0.5 questions" July 2003 thread on perl4lib

[3] CHANGES : Revision history for Perl extension MARC::Record.
    http://search.cpan.org/src/MIKERY/MARC-Record-2.0.0/Changes

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Doran, Michael D 
> Sent: Monday, March 03, 2008 10:36 AM
> To: 'Leif Andersson'; perl4lib@perl.org
> Subject: RE: Help for utf-8 output
> 
> Hi Leif,
> 
> I really appreciate you taking a look at this and responding. 
>  Although I consider myself somewhat knowledgeable about 
> character sets, I still find these kinds of problems to be confusing.
> 
> > In this case the leader and actual length will not agree, 
> as your utf8 
> > characters have turned into latin1.
> 
> I was under the impression that the MARC record length in the 
> Leader was the record length in bytes rather than the number 
> of characters.  Is that your understanding?
> 
> Also, I am still troubleshooting my particular set of records 
> (I was out of town last week) since this problem only appears 
> to manifest itself for records with non-ASCII characters in 
> the 100 and 245 fields.  Records with a note field having 
> non-ASCII characters doesn't cause a problem. 
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
>  
> 
> > -----Original Message-----
> > From: Leif Andersson [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, March 01, 2008 2:51 PM
> > To: Doran, Michael D; perl4lib@perl.org; [EMAIL PROTECTED]
> > Subject: Re: Help for utf-8 output
> > 
> > It seems there is a little bug (by design) kicking in.
> > 
> > The leader gets wrong and some characters get wrong in this case:
> >    + Reading a raw marc record (utf8) from file
> >    + Turning it into a MARC::Record object
> >    + Without modification writing it out to file.
> >      Yes. Even without modification the bug manifests itself!
> > 
> > Let's start with code simply copying one record from a file 
> utf8.mrc 
> > containing one or more marc records. This basic operation not 
> > involving MARC::Record  is OK.
> > 
> > #!perl -w
> > use strict;
> > #
> > open(IN, "utf8.mrc")  || die "1";
> > open(OUT, ">out_good.mrc") || die "2"; binmode IN; binmode OUT; # # 
> > Read in raw MARC $/ = "\x1D"; my $marc = <IN>; print OUT $marc; 
> > __END__
> > 
> > Now, we're adding MARC::Record to the process, along with 
> some debug 
> > info.
> > Example code producing *faulty* record:
> > 
> > #!perl -w
> > use strict;
> > use MARC::Record;
> > use Devel::Peek;
> > #
> > open(IN, "utf8.mrc")  || die "1";
> > open(OUT, ">out_bad.mrc") || die "2";
> > binmode IN;
> > binmode OUT;
> > #
> > # Read in raw MARC
> > $/ = "\x1D";
> > my $marc = <IN>;
> > Dump($marc);  # the utf8-flag is not on my $obj  = 
> > MARC::Record->new_from_usmarc( $marc ); # Convert back to 
> raw MARC my 
> > $marc2 = $obj->as_usmarc(); Dump($marc2); # the utf8-flag 
> IS on print 
> > OUT $marc2; __END__
> > 
> > 
> > In this case the leader and actual length will not agree, 
> as your utf8 
> > characters have turned into latin1.
> > The problem is that $marc2 has the utf8 flag set internally by Perl.
> > And the conversion on output is made in spite of binmode.
> > 
> > We can get around the problem by either (for instance) use bytes;
> >   or
> > Encode::_utf8_off($marc2);
> > before printing to file.
> > 
> > But shouldn't MARC::Record take care of this for us?
> > A file of MARC records may contain records in different encodings.
> > The text parts of a MARC record can be treated as made up 
> by certain 
> > encodings, but the "blob" itself, I suppose, should be 
> exposed to the 
> > caller as pure binary.
> > 
> > Are there any drawbacks in letting MARC::Record strip off 
> any eventual 
> > utf8 flag before returning the record as_usmarc() ?
> > If not I suggest this change be made to a future release of 
> > MARC::Record.
> > 
> > I shall also add that this character mess only sets in when 
> doing IO.
> > If you are updating your databases through one API or 
> another you are 
> > probably OK!
> > 
> > 
> > Leif
> > ======================================
> > Leif Andersson, Systems Librarian
> > Stockholm University Library
> > SE-106 91 Stockholm
> > SWEDEN
> > Phone : +46 8 162769
> > Mobile: +46 70 6904281
> > 
> > -----Ursprungligt meddelande-----
> > Från: Doran, Michael D [mailto:[EMAIL PROTECTED]
> > Skickat: den 21 februari 2008 18:49
> > Till: perl4lib@perl.org
> > Ämne: RE: Help for utf-8 output
> > 
> > Hi Jackie,
> > 
> > I'm working on a very similar problem... converting 
> > theses/dissertations records (in XML) to MARC records.  I'm 
> still in 
> > the testing stage, but have had similar problems with records with 
> > diacritics in the 100 or 245 fields (however diacritics in a 520a 
> > field don't seem to cause any problems).  Since our records are not 
> > "diacritic rich" it's hard to determine the exact extent of the 
> > problem.
> > 
> > I am using these versions:
> >   Perl v5.8.8
> >   MARC::Charset 0.98
> >   MARC::Lint 1.43
> >   MARC::Record 2.0
> >   XML::LibXML 1.66
> > 
> > Here's an example "bad" record (which I have minimized to 
> just the 245 
> > field):
> > 
> > marcdump test.mrc
> > test.mrc
> > LDR 00127cam a2200037   4500
> > 245 13 _aAn Empirical Test Of The Situational Leadership® Model In 
> > Japan /
> >        _cRiho Yoshioka.
> > 
> >  Recs  Errs Filename
> > ----- ----- --------
> >     1     1 test.mrc
> > 
> > When I run test.mrc through MARC::Lint, I get this message:
> > 
> >  Invalid record length in record 1: Leader says 00127 bytes 
> but it's 
> > actually 125  Invalid length in directory for tag
> > 245 in record 1  field does not end in end of field 
> character in tag 
> > 245 in record 1
> > 
> > When examined in vi the character in question, a Registered Sign, 
> > appears to be correctly UTF-8 encoded C2AE, and the bib Leader 
> > (position 09=a) indicates that it is Unicode encoded.
> > I've attached the MARC record.
> > 
> > I noticed that when I run your record (ck245.dat) through 
> MARC::Lint, 
> > I get the same invalid record length message:
> > 
> >  Invalid record length in record 3: Leader says 00567 bytes 
> but it's 
> > actually 569  field does not end in end of field character 
> in tag 100 
> > in record 3  field does not end in end of field character 
> in tag 245 
> > in record 3  Invalid indicators ".10" forced to blanks in 
> record 3 for 
> > tag 245
> > 
> >  field does not end in end of field character in tag 260 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record
> > 3 for tag 260
> > 
> >  field does not end in end of field character in tag 300 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record
> > 3 for tag 300
> > 
> >  field does not end in end of field character in tag 502 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record
> > 3 for tag 502
> > 
> >  field does not end in end of field character in tag 504 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record
> > 3 for tag 504
> > 
> >  field does not end in end of field character in tag 690 in 
> record 3  
> > Invalid indicators ". 4" forced to blanks in record
> > 3 for tag 690
> > 
> > Anybody have any ideas?
> > 
> > -- Michael
> > 
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 mobile
> > # [EMAIL PROTECTED]
> > # http://rocky.uta.edu/doran/
> >  
> > 
> > > -----Original Message-----
> > > From: Shieh, Jackie [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, February 19, 2008 10:50 AM
> > > To: perl4lib@perl.org
> > > Subject: Help for utf-8 output
> > > 
> > > I was wondering if anyone has similar experience and has
> > come up with
> > > good solutions to help solving the challenge below?!
> > > 
> > > What I have is an Excel spreadsheet for dissertations 
> which I have 
> > > saved as a tab delimited file (examining the file in TextPad, the 
> > > diacritics appears to be fine), then read in and output the
> > file as a
> > > utf-8 MARC file. I  <print> title field confirming author
> > field that
> > > contains diacritics with the title showing proper 
> indicator values.
> > > 
> > > But when I looked the MARC itself, the fields that follow 
> the field 
> > > containing diacritics are all off its original position.
> > See attached
> > > zip file.  Examples below: first two have diacritics in a
> > 100 field,
> > > last one diacritic is in 245 subfield b)
> > > 
> > > 001     diss 34001
> > > 100 1  _aP<E9>rez, Nancy L.
> > > 245     _aSynchronic and Diachronic Matlatzinkan Phonology.
> > > 
> > > 001     diss 34042
> > > 100 1  _aValent<ED>n-M<E1>rquez, Wilfredo
> > > 245     _aDoing being boricua :
> > > 
> > > 001     diss 33892
> > > 100 1   _aDavis, Jennifer M.
> > > 245 14 _aThe Functional Complexities of Inherited Cardiac
> > Troponin I
> > > Mutations :
> > >             _bIdentification of Ca<B2>+ Independent Contractile 
> > > Dysfunction.
> > > 
> > > I would be greatly appreciate any suggestion to solve this. 
> > > Thank you most kindly. 
> > > 
> > > Regards,
> > >  
> > > --Jackie
> > >  
> > > |Jackie Shieh
> > > |Data Loads & Development
> > > |Harlan Hatcher Graduate Library
> > > |University of Michigan
> > > |920 North University
> > > |Ann Arbor, MI 48109-1205
> > > |Phone: 734.763.6070 FAX: 734.615.9788
> > > |E-mail: JShieh [AT] umich [DOT] edu
> > > 
> >

RE: Help for utf-8 output - followup on Record Length

Reply via email to