Re: Text Encoding is driving me crazy

joe Fri, 23 Feb 2007 08:27:19 -0800

On Feb 23, 2007, at 16:08 UTC, Joe Huber wrote:

> You either have to know apriori what encoding was used by the app 
> that created the text, or you can make various levels of educated 
> guesses depending on how much analysis you want to do on the text.


Maybe this would be a good addition to the StringUtils module: a
GuessEncoding method, which analyzes the string given and (we hope)
does a decent job of guessing what encoding it is.

Let's brainstorm what sort of clues it might use...

1. Look for a BOM at the beginning of the string (indicates Unicode and
tells us which kind and what byte order).
2. Check for no byte values over 127 (indicates ASCII).
3. Look for lots of nulls found every other byte (suggests UTF-16 and
indicates byte order).
4. Look for common accented letters (Ã©, Ã¡, Ã¶, etc.) in various common
encodings, and if we find them, only count them if their case matches
the case of the letter around them (since incorrect guessing often
results in the wrong case as well as the wrong character).  Suggests
whatever encoding produces the most reasonable matches.
5. Use a dictionary of common words containing non-ASCII characters in
various languages, and again, check for these in various encodings. 
Suggests whatever encoding produces the most matches.
6. Check for invalid UTF-8 sequences (these are pretty easy to detect).
If found, indicates NOT UTF-8, and might suggest SystemDefault.

It's a tricky problem, but 1, 2, 3, and 6 would probably cover the most
common cases.  Adding 4 or 5 would make it slower but smarter, if we do
a good job of it anyway.

Any other thoughts on this problem?

Best,
- Joe



--
Joe Strout -- [EMAIL PROTECTED]
Verified Express, LLC     "Making the Internet a Better Place"
http://www.verex.com/

_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives:
<http://support.realsoftware.com/listarchives/lists.html>

Re: Text Encoding is driving me crazy

Reply via email to