RE: identify encoding from a file

2016-02-08 Thread Kool,Wouter
At OCLC we have some good results detecting frequent encodings and recurring 
encoding problems using Naïve Bayesian classification. You have to have 
training data for the classes you want to detect. And language comes into play, 
because the distribution of characters is dependent on it. No silver bullet 
yet...
That said, you might check the recurrence of this problem. For instance using 
Algorithm::NaiveBayes  or another classifier algorithm.

Wouter

-Original Message-
From: Thomas Krichel [mailto:kric...@openlib.org] 
Sent: zaterdag 6 februari 2016 18:52
To: Marios lyberak
Cc: perl4lib@perl.org
Subject: Re: identify encoding from a file

  Marios lyberak writes

> i have a file which is generated out of an old Paradox database,
>
> and i try to figure out what is the encoding of these strangely represented
> characters

  I know of no way to automate this, and I don't think anybody else
  does. You just simply need to read the file with various encodings
  set at parsing, and manually inspect whether you get the right
  output.

  Your Paradox manual may be of help to reduce the number of candidate
  character sets.

-- 

  Cheers,

  Thomas Krichel  http://openlib.org/home/krichel
  skype:thomaskrichel


Re: identify encoding from a file

2016-02-06 Thread Thomas Krichel
  Marios lyberak writes

> i have a file which is generated out of an old Paradox database,
>
> and i try to figure out what is the encoding of these strangely represented
> characters

  I know of no way to automate this, and I don't think anybody else
  does. You just simply need to read the file with various encodings
  set at parsing, and manually inspect whether you get the right
  output.

  Your Paradox manual may be of help to reduce the number of candidate
  character sets.

-- 

  Cheers,

  Thomas Krichel  http://openlib.org/home/krichel
  skype:thomaskrichel


Re: identify encoding from a file

2016-02-06 Thread Galen Charlton
Hi,

On Sat, Feb 6, 2016 at 7:39 AM, Marios lyberak  wrote:
> in 
>
> ̡觴ݲ -> Μαθητές
>
> and in
>
> 
>
> ʡ解紝 -> Καθητητές

Based on the fact that the output of "iconv -f iso-8859-7
LibGroup.xml" shows some of the expected Greek characters, I suspect
that the original Paradox database was using the ISO-8859-7 or
Windows-1253 character encoding, although whatever export routine
generated the file obviously mishandled its attempt to convert it to
UTF8.

Regards,

Galen
-- 
Galen Charlton
gmcha...@gmail.com