What encoding is your data in? utf8? Single-byte encoding? Marc8? That information matters a lot to determine whether your idea would work. If it is in a single-byte encoding there is often no way to determine the script the character belongs to.
Wouter Kool Metadata Specialist · OCLC B.V. Schipholweg 99 · P.O. Box 876 · 2300 AW Leiden · The Netherlands t +31-(0)71-524 6500 wouter.k...@oclc.org<mailto:wouter.k...@oclc.org> · www.oclc.org<http://www.oclc.org/> [Volg @OCLC_NL op Twitter]<https://twitter.com/OCLC_NL> [Volg OCLC (Nederland) op LinkedIn] <https://www.linkedin.com/company/oclc-nederland-> [Abonneer op OCLCVideo] <https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO> [https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C000000227Uz&oid=00D80000000ZRv8&lastMod=1409843680000]<http://www.oclc.org/> From: George Milten [mailto:george.mil...@gmail.com] Sent: dinsdag 10 februari 2015 13:27 To: perl4lib@perl.org Subject: UNICODE character identification Hello friendly folks, follows what i am trying to do, and i am looking for your help in order to find the most clever way to achieve this: We have records, that include typos like this: we have a word say Plato, where the last o is inputted with the keyboard set to Greek language, so we need something that would parse all metadata in a per character basis, check against what is the script language that the majority of characters the word belongs to have, and return the odd characters, the script they belong, and the record identifier they were found in, so as to be able to correct them thank you in advance