On 8 May 2009, at 23:13, Robert Claeson wrote:

On 8 May 2009, at 23:00, Andrew Farmer wrote:

On 08 May 09, at 08:47, Greg Guerin wrote:
A string is a sequence of characters. Retrieve each character, determine its bit-pattern, then append that pattern to an NSMutableString. Now you have to figure out how to turn a character into its bit-pattern. So break that down.

One extra complication: By Cocoa's standards, a string is not a sequence of bytes: it's a sequence of Unicode codepoints.* To treat a string as a "bag of bytes", you will first need to choose a text encoding to treat the text as, then convert it using the NSString dataUsingEncoding: method.

The UTF encoding that will allow you to treat a string as bag of words (not bytes), where each Unicode codepoint takes exactly the same space, is UTF32.

It's a mistake to think that way. Even with UTF-32, there is not a one to one correspondence between what the user will think of as a single "character" (which IIRC in "Unicode speak" is now referred to as a grapheme cluster) and a Unicode code point. There are plenty of obvious examples, like combining accents, but there are much more complicated examples, for instance in some of the Indic scripts.

UTF-32 is largely pointless, actually. UTF-16 (which is what Cocoa uses, and it's also what is used by ICU, the Unicode reference implementation) is never any larger, and is usually only half the size. Moreover, UTF-16 is only twice as large as "necessary" for U.S. ASCII (not four times like UTF-32), and for non-Latin languages in particular, UTF-16 is often smaller than UTF-8.

UTF32 is also what C++ expects for its std::wstring type under Unix.

No. C++ doesn't "expect" any particular encoding for std::wstring, in the same way that C doesn't "expect" any particular encoding for a wchar_t.

It is also worth pointing out that both the C wide character APIs and their C++ brethren are ill suited to Unicode. They were designed under the assumption that one wide character really was one end-user "character", and that each unit can be treated separately with no consideration of context. This is true for some of the older wide string encodings that were used historically in Asia, which is what wchar_t et al. were actually designed for. It is *not* the case for any encoding of Unicode.

Now it is true that on OS X and Linux, wchar_t is usually used to hold UTF-32, and that is often the case on other platforms also. It is also true that on Windows it normally holds UCS-2 or UTF-16, depending on the APIs you're calling. People would of course immediately say "but UTF-16 breaks the spec, so Windows is wrong", and that's true, but so does UTF-32 because of combining characters and the like.

Anyway, unless you have special knowledge on a particular platform, the *only* things you can do with the C or C++ wide strings and characters are:

-  You can use them with the wide string/character APIs.

- You can convert them to or from the system's multibyte string format using e.g. wcstombs() or mbstowcs(). Incidentally, you can't portably make assumptions about that format either.

- You can pass them to other functions that accept C/C++ wide strings or wide characters.

You cannot portably make *any* assumption about the meaning of a wide character in C or C++. For all you know, your code could be running on a Japanese system using some kind of JIS encoding rather than Unicode.

I have a framework for Unicode conversion and transformation that internally uses all UTF32 for ease of processing. It is unfortunately not open source at the present time.

There is a very comprehensive Open Source Unicode library called ICU. It actually ships as part of Mac OS X, though the headers aren't installed by default. If you can't do what you need with Cocoa or Core Foundation, or you need portability, I strongly recommend that you use ICU.

However, before jumping straight for ICU, check out both Cocoa *and* Core Foundation. Chances are that what you need is already in there.

Kind regards,

Alastair.

--
http://alastairs-place.net



_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to