Another battle tested piece of code would be Mozilla's sniffer, if external 
libraries and it's license suit you. 

This document is out of date, bur explains the ideas. 

http://www.mozilla.org/projects/intl/detectorsrc.html



On Apr 26, 2011, at 3:39 PM, John Pannell <[email protected]> wrote:

> Hi Laurent-
> 
> I have an app that collects a lot of text off the web; my string creation 
> algorithm is something like the following:
> 
> 1.  Attempt to create an NSString with NSUTF8StringEncoding.
> 2.  If the string is nil, attempt to create the string using the encoding 
> returned from the server.
> 3.  If string is still nil, ask the Text Encoding Conversion Manager to sniff 
> out the encoding from the data.
>    3a.  This returns an array of likely encodings.  For each item in the 
> array:
>    3b.  Attempt to create a string with the encoding.
> 
> There was a little too much code associated with this to copy/paste into 
> email, but I'd be happy to share... I have a wrapper object for the needed 
> interaction with the Text Encoding Conversion Manager.  Some more about it:
> 
> http://developer.apple.com/library/mac/#documentation/Carbon/Reference/Text_Encodin_sion_Manager/Reference/reference.html%23//apple_ref/doc/uid/TP30000123
> 
> Hope this helps!
> 
> John
> 
> 
> John Pannell
> http://www.positivespinmedia.com
> 
> On Apr 26, 2011, at 12:53 PM, Nick Zitzmann wrote:
> 
>> 
>> On Apr 26, 2011, at 12:49 PM, Laurent Daudelin wrote:
>> 
>>>> TextEdit's encoding guesser just uses the built-in NSAttributedString 
>>>> method -initWithURL:options:documentAttributes:error:, which will guess 
>>>> the file's encoding when opening it. But it has been mentioned that 
>>>> heuristics are not infallible, and this method's heuristics are no 
>>>> exception. It does a good job overall, but I've found that it usually 
>>>> misinterprets UTF-8 format text.
>>> 
>>> Yes, I know that all the guess jobs can fail. I was starting to be excited 
>>> when started reading your reply but if it usually misinterprets UTF-8, 
>>> that's a pretty significant problem...
>> 
>> That was a long time ago, so it may have been fixed. But if it's still 
>> happening, then one workaround would be to try and open the file as UTF-8 
>> first, and if that fails, then fall back on the above method. The UTF-8 
>> parser often returns nil on text that is not in UTF-8 format IIRC.
>> 
> 
> _______________________________________________
> 
> Cocoa-dev mailing list ([email protected])
> 
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
> 
> Help/Unsubscribe/Update your Subscription:
> http://lists.apple.com/mailman/options/cocoa-dev/lordpixel%40mac.com
> 
> This email sent to [email protected]
_______________________________________________

Cocoa-dev mailing list ([email protected])

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [email protected]

Reply via email to