What encoding is your data in? utf8? Single-byte encoding? Marc8? That
information matters a lot to determine whether your idea would work. If it is
in a single-byte encoding there is often no way to determine the script the
character belongs to.
Wouter Kool
Metadata Specialist · OCLC B.V.
that most characters match and you look for the
exceptions. Would that help?
From: George Milten [mailto:george.mil...@gmail.com]
Sent: dinsdag 10 februari 2015 15:56
To: Kool,Wouter
Cc: perl4lib@perl.org
Subject: Re: UNICODE character identification
utf-8,
thank you
2015-02-10 16:54 GMT+02:00
Perhaps theWWW::Babelfish module helps. It seems to support connecting to the
Google and Yahoo services as well. Haven't tried it but seems interesting:
http://search.cpan.org/~durist/WWW-Babelfish-0.16/Babelfish.pm
Wouter
From: Eileen Pinto [mailto:epi...@library.berkeley.edu]
Sent:
At OCLC we have some good results detecting frequent encodings and recurring
encoding problems using Naïve Bayesian classification. You have to have
training data for the classes you want to detect. And language comes into play,
because the distribution of characters is dependent on it. No