RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
What encoding is your data in? utf8? Single-byte encoding? Marc8? That information matters a lot to determine whether your idea would work. If it is in a single-byte encoding there is often no way to determine the script the character belongs to. Wouter Kool Metadata Specialist · OCLC B.V.

RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
that most characters match and you look for the exceptions. Would that help? From: George Milten [mailto:george.mil...@gmail.com] Sent: dinsdag 10 februari 2015 15:56 To: Kool,Wouter Cc: perl4lib@perl.org Subject: Re: UNICODE character identification utf-8, thank you 2015-02-10 16:54 GMT+02:00

RE: Options for translating languages within perl scripts

2015-02-27 Thread Kool,Wouter
Perhaps theWWW::Babelfish module helps. It seems to support connecting to the Google and Yahoo services as well. Haven't tried it but seems interesting: http://search.cpan.org/~durist/WWW-Babelfish-0.16/Babelfish.pm Wouter From: Eileen Pinto [mailto:epi...@library.berkeley.edu] Sent:

RE: identify encoding from a file

2016-02-08 Thread Kool,Wouter
At OCLC we have some good results detecting frequent encodings and recurring encoding problems using Naïve Bayesian classification. You have to have training data for the classes you want to detect. And language comes into play, because the distribution of characters is dependent on it. No