I use MarcEdit to view records and check if the mnemonic form of a diacritic (e.g. {eacute}) appears or not and what the LDR/09 value is. That's the best way I've come up with so far. MarcEdit is pretty good at guessing what the character encoding is without relying on the LDR/09 value. I think there are some perl modules you could use that "guess" what the encoding is of a character but I've never used them. I'm interested in finding out other methods (preferably automated) for detecting wrong or mixed character encodings in a MARC record.
Shelley ----- Original Message ----- > From: "Eric Lease Morgan" <emor...@nd.edu> > To: perl4lib@perl.org > Sent: Wednesday, March 27, 2013 2:11:26 PM > Subject: Re: reading and writing of utf-8 with marc::batch [double encoding] > > > On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan <emor...@nd.edu> > wrote: > > > When it calls as_usmarc, I think MARC::Batch tries to honor the > > value set in position #9 of the leader. In other words, if the > > leader is empty, then it tries to output records as MARC-8, and > > when the leader is a value of "a", it tries to encode the data as > > UTF-8. > > How can I figure out whether or not a MARC record contains ONLY > characters from the UTF-8 character set? > > Put another way, how can I determine whether or not position #9 of a > given MARC leader is accurate? If position #9 is an "a", then how > can I read the balance of the record to determine whether or not all > the characters really and truly are UTF-8 encoded? > > -- > Eric "This Is Almost Too Much For Me" Morgan > >