RE: identify encoding from a file

2016-02-08 Thread Kool,Wouter
At OCLC we have some good results detecting frequent encodings and recurring 
encoding problems using Naïve Bayesian classification. You have to have 
training data for the classes you want to detect. And language comes into play, 
because the distribution of characters is dependent on it. No silver bullet 
yet...
That said, you might check the recurrence of this problem. For instance using 
Algorithm::NaiveBayes  or another classifier algorithm.

Wouter

-Original Message-
From: Thomas Krichel [mailto:kric...@openlib.org] 
Sent: zaterdag 6 februari 2016 18:52
To: Marios lyberak
Cc: perl4lib@perl.org
Subject: Re: identify encoding from a file

  Marios lyberak writes

> i have a file which is generated out of an old Paradox database,
>
> and i try to figure out what is the encoding of these strangely represented
> characters

  I know of no way to automate this, and I don't think anybody else
  does. You just simply need to read the file with various encodings
  set at parsing, and manually inspect whether you get the right
  output.

  Your Paradox manual may be of help to reduce the number of candidate
  character sets.

-- 

  Cheers,

  Thomas Krichel  http://openlib.org/home/krichel
  skype:thomaskrichel


RE: Options for translating languages within perl scripts

2015-02-27 Thread Kool,Wouter
Perhaps theWWW::Babelfish module helps. It seems to support connecting to the 
Google and Yahoo services as well. Haven't tried it but seems interesting:
http://search.cpan.org/~durist/WWW-Babelfish-0.16/Babelfish.pm

Wouter 

From: Eileen Pinto [mailto:epi...@library.berkeley.edu] 
Sent: donderdag 26 februari 2015 23:29
To: perl4lib@perl.org
Subject: Options for translating languages within perl scripts

Hi,

I've been tasked with massaging a large batch of French-language MARC records 
from a vendor.  Aside from the usual MARC field manipulation/cleanup we usually 
do with perl, I've been asked to run the 520 field through a translation 
routine/API, etc. to convert (possibly crudely) from French to English.  I 
thought that Babelfish or 
http://api.yandex.com/translate/doc/dg/reference/translate.xml might be 
options, but Babelfish appears to be dead, and when I clicked to get the 
required key for the yandex API, the link led to a dead end.  

Is anyone incorporating POST queries or other methods to translate fields in 
MARC records?  I'd appreciate any leads or pointers.

Thanks in advance,

Eileen Pinto
Library Systems Office
University of California, Berkeley
Berkeley, CA  94720-6000



RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
What encoding is your data in? utf8? Single-byte encoding? Marc8? That 
information matters a lot to determine whether your idea would work. If it is 
in a single-byte encoding there is often no way to determine the script the 
character belongs to.


Wouter Kool
Metadata Specialist · OCLC B.V.
Schipholweg 99 · P.O. Box 876 · 2300 AW Leiden · The Netherlands
t +31-(0)71-524 6500
wouter.k...@oclc.orgmailto:wouter.k...@oclc.org · 
www.oclc.orghttp://www.oclc.org/
[Volg @OCLC_NL op Twitter]https://twitter.com/OCLC_NL [Volg OCLC (Nederland) 
op LinkedIn] https://www.linkedin.com/company/oclc-nederland- [Abonneer op 
OCLCVideo] 
https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO
[https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uzoid=00D8000ZRv8lastMod=140984368]http://www.oclc.org/





From: George Milten [mailto:george.mil...@gmail.com]
Sent: dinsdag 10 februari 2015 13:27
To: perl4lib@perl.org
Subject: UNICODE character identification

Hello friendly folks,

follows what i am trying to do, and i am looking for your help in order to find 
the most clever way to achieve this:

We have records, that include typos like this: we have a word say Plato, where 
the last o is inputted with the keyboard set to Greek language, so we need 
something that would parse all metadata in a per character basis, check against 
what is the script language that the majority of characters the word belongs to 
have, and return the odd characters, the script they belong, and the record 
identifier they were found in, so as to be able to correct them

thank you in advance


RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
Apologies, I missed the subject line...

Then you might use the regex character classes. For instance $text =~ 
m/\p{Hiragana}/;  matches any Japanese Hiragana character. I have not tested 
it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you 
find the character class that most characters match and you look for the 
exceptions. Would that help?




From: George Milten [mailto:george.mil...@gmail.com]
Sent: dinsdag 10 februari 2015 15:56
To: Kool,Wouter
Cc: perl4lib@perl.org
Subject: Re: UNICODE character identification

utf-8,

thank you

2015-02-10 16:54 GMT+02:00 Kool,Wouter 
wouter.k...@oclc.orgmailto:wouter.k...@oclc.org:
What encoding is your data in? utf8? Single-byte encoding? Marc8? That 
information matters a lot to determine whether your idea would work. If it is 
in a single-byte encoding there is often no way to determine the script the 
character belongs to.


Wouter Kool
Metadata Specialist · OCLC B.V.
Schipholweg 99 · P.O. Box 876 · 2300 AW Leiden · The Netherlands
t +31-(0)71-524 6500tel:%2B31-%280%2971-524%206500
wouter.k...@oclc.orgmailto:wouter.k...@oclc.org · 
www.oclc.orghttp://www.oclc.org/
[Volg @OCLC_NL op Twitter]https://twitter.com/OCLC_NL[Volg OCLC (Nederland) 
op LinkedIn]https://www.linkedin.com/company/oclc-nederland-[Abonneer op 
OCLCVideo]https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO
[https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uzoid=00D8000ZRv8lastMod=140984368]http://www.oclc.org/





From: George Milten 
[mailto:george.mil...@gmail.commailto:george.mil...@gmail.com]
Sent: dinsdag 10 februari 2015 13:27
To: perl4lib@perl.orgmailto:perl4lib@perl.org
Subject: UNICODE character identification

Hello friendly folks,

follows what i am trying to do, and i am looking for your help in order to find 
the most clever way to achieve this:

We have records, that include typos like this: we have a word say Plato, where 
the last o is inputted with the keyboard set to Greek language, so we need 
something that would parse all metadata in a per character basis, check against 
what is the script language that the majority of characters the word belongs to 
have, and return the odd characters, the script they belong, and the record 
identifier they were found in, so as to be able to correct them

thank you in advance