> Put another way, how can I determine whether or not position #9 of a given
> MARC leader is accurate? If position #9 is an "a", then how can I read the
> balance of the record to determine whether or not all the characters really
> and truly are UTF-8 encoded?

The following program will read a file of MARC records from standard input
and classify each as either being valid UTF-8 or not.


use Encode;

binmode STDIN, ':bytes';

$/ = "\035"; # MARC record terminator
my $i = 0;
while (<>) {
    my $bytes = $_;
    eval {
        my $utf8str = Encode::decode('UTF-8', $bytes, Encode::FB_CROAK);
    if ($@) {
        print "Record $i is valid UTF-8\n";
    } else {
        print "Record $i definitely not valid UTF-8\n";


Galen Charlton

