Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-28 Thread Ashley Sanders
Eric,

 How can I figure out whether or not a MARC record contains ONLY characters 
 from the UTF-8 character set?

You can use a regex to check if a string is utf-8. There are various examples
floating around the internet. An example is the one here:

   http://www.w3.org/International/questions/qa-forms-utf-8

You'll need to add the MARC control characters ^_, ^^, and ^] to the ASCII part
of the expression in the above page. (I think the w3c example is aimed at XML1.0
in which the MARC control characters are not allowed.)

Ashley.
--
Ashley Sanders a.sand...@manchester.ac.uk
http://copac.ac.uk -- A Mimas service funded by JISC at the University of 
Manchester



Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Eric Lease Morgan

A number of people have alluded to the problem of double encoding, and I'm 
beginning to think this is true. 

I have isolated a number of problem records. They all contain diacritics, but 
they do not have an a in position #9 of the leader -- 
http://dh.crc.nd.edu/tmp/original.marc  Can someone verify that the file 
contains UTF-8 characters for me?

For these same records I have also added an a in position #9 and created a 
similar file -- http://dh.crc.nd.edu/tmp/fixed.marc  

Is it true that original.marc is not denoted correctly, but fixed.marc is 
denoted correctly?

-- 
Eric Morgan



Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Galen Charlton
Hi,

On Wed, Mar 27, 2013 at 11:20 AM, Eric Lease Morgan emor...@nd.edu wrote:

 I have isolated a number of problem records. They all contain diacritics,
 but they do not have an a in position #9 of the leader --
 http://dh.crc.nd.edu/tmp/original.marc  Can someone verify that the file
 contains UTF-8 characters for me?


I've eyeballed it and confirm that the encoding of that file is UTF-8.

For these same records I have also added an a in position #9 and created
 a similar file -- http://dh.crc.nd.edu/tmp/fixed.marc


I've looked this over as well.


 Is it true that original.marc is not denoted correctly, but fixed.marc is
 denoted correctly?


Yes.  The Leader/09 must be set to 'a' if the character encoding in use is
UTF-8.

Regards,

Galen
-- 
Galen Charlton
gmcha...@gmail.com


Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Eric Lease Morgan

On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan emor...@nd.edu wrote:

 When it calls as_usmarc, I think MARC::Batch tries to honor the value set in 
 position #9 of the leader. In other words, if the leader is empty, then it 
 tries to output records as MARC-8, and when the leader is a value of a, it 
 tries to encode the data as UTF-8.

How can I figure out whether or not a MARC record contains ONLY characters from 
the UTF-8 character set?

Put another way, how can I determine whether or not position #9 of a given MARC 
leader is accurate? If position #9 is an a, then how can I read the balance 
of the record to determine whether or not all the characters really and truly 
are UTF-8 encoded?

--
Eric This Is Almost Too Much For Me Morgan



Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Shelley Doljack
I use MarcEdit to view records and check if the mnemonic form of a diacritic 
(e.g. {eacute}) appears or not and what the LDR/09 value is. That's the best 
way I've come up with so far. MarcEdit is pretty good at guessing what the 
character encoding is without relying on the LDR/09 value. I think there are 
some perl modules you could use that guess what the encoding is of a 
character but I've never used them. I'm interested in finding out other methods 
(preferably automated) for detecting wrong or mixed character encodings in a 
MARC record. 

Shelley

- Original Message -
 From: Eric Lease Morgan emor...@nd.edu
 To: perl4lib@perl.org
 Sent: Wednesday, March 27, 2013 2:11:26 PM
 Subject: Re: reading and writing of utf-8 with marc::batch [double encoding]
 
 
 On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan emor...@nd.edu
 wrote:
 
  When it calls as_usmarc, I think MARC::Batch tries to honor the
  value set in position #9 of the leader. In other words, if the
  leader is empty, then it tries to output records as MARC-8, and
  when the leader is a value of a, it tries to encode the data as
  UTF-8.
 
 How can I figure out whether or not a MARC record contains ONLY
 characters from the UTF-8 character set?
 
 Put another way, how can I determine whether or not position #9 of a
 given MARC leader is accurate? If position #9 is an a, then how
 can I read the balance of the record to determine whether or not all
 the characters really and truly are UTF-8 encoded?
 
 --
 Eric This Is Almost Too Much For Me Morgan
 
 


Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Galen Charlton
Hi,


On Wed, Mar 27, 2013 at 2:11 PM, Eric Lease Morgan emor...@nd.edu wrote:

 Put another way, how can I determine whether or not position #9 of a given
 MARC leader is accurate? If position #9 is an a, then how can I read the
 balance of the record to determine whether or not all the characters really
 and truly are UTF-8 encoded?


The following program will read a file of MARC records from standard input
and classify each as either being valid UTF-8 or not.

___START
#!/usr/bin/perl

use Encode;

binmode STDIN, ':bytes';

$/ = \035; # MARC record terminator
my $i = 0;
while () {
$i++;
my $bytes = $_;
eval {
my $utf8str = Encode::decode('UTF-8', $bytes, Encode::FB_CROAK);
};
if ($@) {
print Record $i is valid UTF-8\n;
} else {
print Record $i definitely not valid UTF-8\n;
}
}
___END

Regards,

Galen
-- 
Galen Charlton
gmcha...@gmail.com