Re: Finding non-unicode characters

Patrick Hochstenbach Mon, 30 Jun 2014 08:02:46 -0700

Hi

You can use this regular expression to see if there might be non valid UTF8 
errors in a piece of text (but can’t check for correctness of the unicode)


perl -l -ne '/
 ^( ([\x00-\x1D])             # 1-byte pattern
   |([\x1F-\x7F])              # 1-byte pattern
   |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
   
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))
 # 3-byte pattern
   
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))
       # 4-byte pattern
  )*$ /x or print'  | od –c

Cheers
Patrick


From: Anne Highsmith <hism...@library.tamu.edu<mailto:hism...@library.tamu.edu>>
Date: Monday 30 June 2014 16:51
To: "perl4lib@perl.org<mailto:perl4lib@perl.org>" 
<perl4lib@perl.org<mailto:perl4lib@perl.org>>
Subject: Finding non-unicode characters

Can someone suggest a way to identify if a MARC record, coded at LDR/09 = ‘a’ 
has non-unicode characters in it? I tried the following, kind of grasping at 
straws, against a record that I know has non-unicode characters. It didn’t 
report any errors.

      # $bib_id is defined as 001 field
       my $bib_marc = [subroutine defined elsewhere to get a marc record 
string];
        eval {
                $bib_rec = MARC::Record->new_from_usmarc($bib_marc);
        } ;

        if ($@) {
                print ERRORS "$bib_id\t$@\n";
                next;
        }

We have a group of records in our database that are mostly Unicode but have 
some erroneous characters. I’d like to have a script to run against them to see 
if they’ve been completely cleaned up after the catalogers work on them.


Anne L. Highsmith

Director of Consortia Systems

Texas A&M University

5000 TAMU

College Station, TX   77843-5000

Phone: 979 862 4234

Fax: 979 845 6238

Email: hism...@tamu.edu<mailto:hism...@tamu.edu>

Re: Finding non-unicode characters

Reply via email to