In Perl, something like this might do the trick:

# Fix non-UTF-8 characters with two highest bits set (we assume they are actually ISO-8859-1) # Rule: there can't be a single byte with the high bits set followed by a byte in range 00-7F or C0-FF

$str =~ s/([\xC0-\xFF])(?=[\x00-\x7f\xC0-\xFF])/chr(0xC0 + (ord($1) >> 6)) . chr(0x80 + (ord($1) & 0x3F))/seg;

No wrapping there to keep it single-line. :)

--Ere

On 7.10.2010 14:56, Cowles, Esme wrote:
Eric-

I don't know the original source of those MARC files, but I've worked
with files from an III system where diacritics had to be entered as
character code escapes like "Muse{226}e du Louvre" (where 226 is the
ANSEL code for a combining acute accent).  So if somebody made a typo
and entered something like "Muse{22}6e du Louvre" instead, you'd get
some bogus invalid character.  I was working with MARCXML files in
Java, so I wrote a FilterReader class that removed any characters
that were invalid in UTF-8 XML.  I assume you could do something
similar in Perl (probably with a fancy one-line regex).

-Esme -- Esme Cowles<[email protected]>

"We've all heard that a million monkeys banging on a million
typewriters will eventually reproduce the works of Shakespeare. Now,
thanks to the Internet, we know this is not true." -- Robert
Wilensky

On Oct 7, 2010, at 6:51 AM, Eric Lease Morgan wrote:

How do I trap for unwanted (bogus) characters in MARC records?

I have a set of Internet Archive identifiers, and have written the
followoing Perl loop to get the MARC records associated with each
one:

# process each identifier my $ua = LWP::UserAgent->new( agent =>
AGENT ); while (<DATA>  ) {

# get the identifier chop; my $identifier = $_; print $identifier,
"\n";

# get its corresponding MARC record my $response = $ua->get( ROOT .
"$identifier/$identifier" . "_meta.mrc" ); if ( !
$response->is_success ) {

warn $response->status_line; next;

}

# save it open MARC, ">  $identifier.mrc" or die "Can't open
$identifier.mrc: $!\n"; binmode MARC, ":utf8"; print MARC
$response->content; close MARC;

}

I then use the venerable marcdump to see the fruits of my labors:
marcdump *.mrc. Unfortunately, marcdump returns the following error
against (at least) one of my files:

bienfaitsducatho00pina.mrc utf8 "\xC3" does not map to Unicode at
/System/Library/ Perl/5.10.0/darwin-thread-multi-2level/Encode.pm
line 162.

What is going on here? Am I saving my files incorrectly? Is the
original MARC data inherintly incorrect? Is there some way I can
fix the MARC record in question?

-- Eric Lease Morgan



--
Ere Maijala
Kansalliskirjasto

Reply via email to