Hi,

On Wed, Mar 27, 2013 at 2:11 PM, Eric Lease Morgan <emor...@nd.edu> wrote:

> Put another way, how can I determine whether or not position #9 of a given
> MARC leader is accurate? If position #9 is an "a", then how can I read the
> balance of the record to determine whether or not all the characters really
> and truly are UTF-8 encoded?
>

The following program will read a file of MARC records from standard input
and classify each as either being valid UTF-8 or not.

___START____
#!/usr/bin/perl

use Encode;

binmode STDIN, ':bytes';

$/ = "\035"; # MARC record terminator
my $i = 0;
while (<>) {
    $i++;
    my $bytes = $_;
    eval {
        my $utf8str = Encode::decode('UTF-8', $bytes, Encode::FB_CROAK);
    };
    if ($@) {
        print "Record $i is valid UTF-8\n";
    } else {
        print "Record $i definitely not valid UTF-8\n";
    }
}
___END____

Regards,

Galen
-- 
Galen Charlton
gmcha...@gmail.com

Reply via email to