On Feb 23, 2007, at 16:08 UTC, Joe Huber wrote: > You either have to know apriori what encoding was used by the app > that created the text, or you can make various levels of educated > guesses depending on how much analysis you want to do on the text.
Maybe this would be a good addition to the StringUtils module: a GuessEncoding method, which analyzes the string given and (we hope) does a decent job of guessing what encoding it is. Let's brainstorm what sort of clues it might use... 1. Look for a BOM at the beginning of the string (indicates Unicode and tells us which kind and what byte order). 2. Check for no byte values over 127 (indicates ASCII). 3. Look for lots of nulls found every other byte (suggests UTF-16 and indicates byte order). 4. Look for common accented letters (é, á, ö, etc.) in various common encodings, and if we find them, only count them if their case matches the case of the letter around them (since incorrect guessing often results in the wrong case as well as the wrong character). Suggests whatever encoding produces the most reasonable matches. 5. Use a dictionary of common words containing non-ASCII characters in various languages, and again, check for these in various encodings. Suggests whatever encoding produces the most matches. 6. Check for invalid UTF-8 sequences (these are pretty easy to detect). If found, indicates NOT UTF-8, and might suggest SystemDefault. It's a tricky problem, but 1, 2, 3, and 6 would probably cover the most common cases. Adding 4 or 5 would make it slower but smarter, if we do a good job of it anyway. Any other thoughts on this problem? Best, - Joe -- Joe Strout -- [EMAIL PROTECTED] Verified Express, LLC "Making the Internet a Better Place" http://www.verex.com/
_______________________________________________ Unsubscribe or switch delivery mode: <http://www.realsoftware.com/support/listmanager/> Search the archives: <http://support.realsoftware.com/listarchives/lists.html>
