Re: NSString to bit pattern

Alastair Houghton Fri, 08 May 2009 16:14:17 -0700

On 8 May 2009, at 23:13, Robert Claeson wrote:

On 8 May 2009, at 23:00, Andrew Farmer wrote:
On 08 May 09, at 08:47, Greg Guerin wrote:
A string is a sequence of characters. Retrieve each character,determine its bit-pattern, then append that pattern to anNSMutableString. Now you have to figure out how to turn acharacter into its bit-pattern. So break that down.
One extra complication: By Cocoa's standards, a string is not asequence of bytes: it's a sequence of Unicode codepoints.* To treata string as a "bag of bytes", you will first need to choose a textencoding to treat the text as, then convert it using the NSStringdataUsingEncoding: method.
The UTF encoding that will allow you to treat a string as bag ofwords (not bytes), where each Unicode codepoint takes exactly thesame space, is UTF32.

It's a mistake to think that way. Even with UTF-32, there is not aone to one correspondence between what the user will think of as asingle "character" (which IIRC in "Unicode speak" is now referred toas a grapheme cluster) and a Unicode code point. There are plenty ofobvious examples, like combining accents, but there are much morecomplicated examples, for instance in some of the Indic scripts.

UTF-32 is largely pointless, actually. UTF-16 (which is what Cocoauses, and it's also what is used by ICU, the Unicode referenceimplementation) is never any larger, and is usually only half thesize. Moreover, UTF-16 is only twice as large as "necessary" for U.S.ASCII (not four times like UTF-32), and for non-Latin languages inparticular, UTF-16 is often smaller than UTF-8.

UTF32 is also what C++ expects for its std::wstring type under Unix.

No. C++ doesn't "expect" any particular encoding for std::wstring, inthe same way that C doesn't "expect" any particular encoding for awchar_t.

It is also worth pointing out that both the C wide character APIs andtheir C++ brethren are ill suited to Unicode. They were designedunder the assumption that one wide character really was one end-user"character", and that each unit can be treated separately with noconsideration of context. This is true for some of the older widestring encodings that were used historically in Asia, which is whatwchar_t et al. were actually designed for. It is *not* the case forany encoding of Unicode.

Now it is true that on OS X and Linux, wchar_t is usually used to holdUTF-32, and that is often the case on other platforms also. It isalso true that on Windows it normally holds UCS-2 or UTF-16, dependingon the APIs you're calling. People would of course immediately say"but UTF-16 breaks the spec, so Windows is wrong", and that's true,but so does UTF-32 because of combining characters and the like.

Anyway, unless you have special knowledge on a particular platform,the *only* things you can do with the C or C++ wide strings andcharacters are:


-  You can use them with the wide string/character APIs.

- You can convert them to or from the system's multibyte stringformat using e.g. wcstombs() or mbstowcs(). Incidentally, you can'tportably make assumptions about that format either.

- You can pass them to other functions that accept C/C++ wide stringsor wide characters.

You cannot portably make *any* assumption about the meaning of a widecharacter in C or C++. For all you know, your code could be runningon a Japanese system using some kind of JIS encoding rather thanUnicode.

I have a framework for Unicode conversion and transformation thatinternally uses all UTF32 for ease of processing. It isunfortunately not open source at the present time.

There is a very comprehensive Open Source Unicode library called ICU.It actually ships as part of Mac OS X, though the headers aren'tinstalled by default. If you can't do what you need with Cocoa orCore Foundation, or you need portability, I strongly recommend thatyou use ICU.

However, before jumping straight for ICU, check out both Cocoa *and*Core Foundation. Chances are that what you need is already in there.


Kind regards,

Alastair.

--
http://alastairs-place.net



_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSString to bit pattern

Reply via email to