Re: reading and writing of utf-8 with marc::batch [double encoding]
Eric, How can I figure out whether or not a MARC record contains ONLY characters from the UTF-8 character set? You can use a regex to check if a string is utf-8. There are various examples floating around the internet. An example is the one here: http://www.w3.org/International/questions/qa-forms-utf-8 You'll need to add the MARC control characters ^_, ^^, and ^] to the ASCII part of the expression in the above page. (I think the w3c example is aimed at XML1.0 in which the MARC control characters are not allowed.) Ashley. -- Ashley Sanders a.sand...@manchester.ac.uk http://copac.ac.uk -- A Mimas service funded by JISC at the University of Manchester
Re: reading and writing of utf-8 with marc::batch [double encoding]
A number of people have alluded to the problem of double encoding, and I'm beginning to think this is true. I have isolated a number of problem records. They all contain diacritics, but they do not have an a in position #9 of the leader -- http://dh.crc.nd.edu/tmp/original.marc Can someone verify that the file contains UTF-8 characters for me? For these same records I have also added an a in position #9 and created a similar file -- http://dh.crc.nd.edu/tmp/fixed.marc Is it true that original.marc is not denoted correctly, but fixed.marc is denoted correctly? -- Eric Morgan
Re: reading and writing of utf-8 with marc::batch [double encoding]
Hi, On Wed, Mar 27, 2013 at 11:20 AM, Eric Lease Morgan emor...@nd.edu wrote: I have isolated a number of problem records. They all contain diacritics, but they do not have an a in position #9 of the leader -- http://dh.crc.nd.edu/tmp/original.marc Can someone verify that the file contains UTF-8 characters for me? I've eyeballed it and confirm that the encoding of that file is UTF-8. For these same records I have also added an a in position #9 and created a similar file -- http://dh.crc.nd.edu/tmp/fixed.marc I've looked this over as well. Is it true that original.marc is not denoted correctly, but fixed.marc is denoted correctly? Yes. The Leader/09 must be set to 'a' if the character encoding in use is UTF-8. Regards, Galen -- Galen Charlton gmcha...@gmail.com
Re: reading and writing of utf-8 with marc::batch [double encoding]
On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan emor...@nd.edu wrote: When it calls as_usmarc, I think MARC::Batch tries to honor the value set in position #9 of the leader. In other words, if the leader is empty, then it tries to output records as MARC-8, and when the leader is a value of a, it tries to encode the data as UTF-8. How can I figure out whether or not a MARC record contains ONLY characters from the UTF-8 character set? Put another way, how can I determine whether or not position #9 of a given MARC leader is accurate? If position #9 is an a, then how can I read the balance of the record to determine whether or not all the characters really and truly are UTF-8 encoded? -- Eric This Is Almost Too Much For Me Morgan
Re: reading and writing of utf-8 with marc::batch [double encoding]
I use MarcEdit to view records and check if the mnemonic form of a diacritic (e.g. {eacute}) appears or not and what the LDR/09 value is. That's the best way I've come up with so far. MarcEdit is pretty good at guessing what the character encoding is without relying on the LDR/09 value. I think there are some perl modules you could use that guess what the encoding is of a character but I've never used them. I'm interested in finding out other methods (preferably automated) for detecting wrong or mixed character encodings in a MARC record. Shelley - Original Message - From: Eric Lease Morgan emor...@nd.edu To: perl4lib@perl.org Sent: Wednesday, March 27, 2013 2:11:26 PM Subject: Re: reading and writing of utf-8 with marc::batch [double encoding] On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan emor...@nd.edu wrote: When it calls as_usmarc, I think MARC::Batch tries to honor the value set in position #9 of the leader. In other words, if the leader is empty, then it tries to output records as MARC-8, and when the leader is a value of a, it tries to encode the data as UTF-8. How can I figure out whether or not a MARC record contains ONLY characters from the UTF-8 character set? Put another way, how can I determine whether or not position #9 of a given MARC leader is accurate? If position #9 is an a, then how can I read the balance of the record to determine whether or not all the characters really and truly are UTF-8 encoded? -- Eric This Is Almost Too Much For Me Morgan
Re: reading and writing of utf-8 with marc::batch [double encoding]
Hi, On Wed, Mar 27, 2013 at 2:11 PM, Eric Lease Morgan emor...@nd.edu wrote: Put another way, how can I determine whether or not position #9 of a given MARC leader is accurate? If position #9 is an a, then how can I read the balance of the record to determine whether or not all the characters really and truly are UTF-8 encoded? The following program will read a file of MARC records from standard input and classify each as either being valid UTF-8 or not. ___START #!/usr/bin/perl use Encode; binmode STDIN, ':bytes'; $/ = \035; # MARC record terminator my $i = 0; while () { $i++; my $bytes = $_; eval { my $utf8str = Encode::decode('UTF-8', $bytes, Encode::FB_CROAK); }; if ($@) { print Record $i is valid UTF-8\n; } else { print Record $i definitely not valid UTF-8\n; } } ___END Regards, Galen -- Galen Charlton gmcha...@gmail.com