Hi You can use this regular expression to see if there might be non valid UTF8 errors in a piece of text (but can’t check for correctness of the unicode)
perl -l -ne '/ ^( ([\x00-\x1D]) # 1-byte pattern |([\x1F-\x7F]) # 1-byte pattern |([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern )*$ /x or print' | od –c Cheers Patrick From: Anne Highsmith <hism...@library.tamu.edu<mailto:hism...@library.tamu.edu>> Date: Monday 30 June 2014 16:51 To: "perl4lib@perl.org<mailto:perl4lib@perl.org>" <perl4lib@perl.org<mailto:perl4lib@perl.org>> Subject: Finding non-unicode characters Can someone suggest a way to identify if a MARC record, coded at LDR/09 = ‘a’ has non-unicode characters in it? I tried the following, kind of grasping at straws, against a record that I know has non-unicode characters. It didn’t report any errors. # $bib_id is defined as 001 field my $bib_marc = [subroutine defined elsewhere to get a marc record string]; eval { $bib_rec = MARC::Record->new_from_usmarc($bib_marc); } ; if ($@) { print ERRORS "$bib_id\t$@\n"; next; } We have a group of records in our database that are mostly Unicode but have some erroneous characters. I’d like to have a script to run against them to see if they’ve been completely cleaned up after the catalogers work on them. Anne L. Highsmith Director of Consortia Systems Texas A&M University 5000 TAMU College Station, TX 77843-5000 Phone: 979 862 4234 Fax: 979 845 6238 Email: hism...@tamu.edu<mailto:hism...@tamu.edu>