Agh, this pointless rabbit-hole was what I was trying to avoid with my suggestion to take it off-list. :(
On Tue, Nov 26, 2013 at 10:31 AM, Jonathan Schleifer <[email protected]> wrote: > Am 26.11.2013 um 19:23 schrieb Jean-Daniel Dupas <[email protected]>: > >> That's a rather strange way to express it. UTF-16 is not more a workaround >> than UTF-32 or UTF-8. They are all first class encodings. >> Cocoa supports all Unicode planes and encode them using UTF-16 (or even >> ASCII internally) which is generally far more space efficient than using >> UTF-32. >> >> FWIW, it is even possible to use emoji in constant NSString generated at >> compilation time. So telling that Cocoa can only handle UCS-2 is plainly >> wrong. > > How can a single unichar (which is typedef'd to unsigned short) store more > than UCS-2? We are talking about the type for a single character (unichar vs. > of_unichar_t) here. Strings internally use UTF-8 in ObjFW, but if you use > characterAtIndex:, you get the whole character and not a surrogate. With > Cocoa, you get a surrogate, as a single character can only be UCS-2. Try it > yourself: > > [@"😄" length] returns 2 in Cocoa. The same returns 1 in ObjFW, because it is > one of_unichar_t. > [@"😄" characterAtIndex: 0] returns the surrogate in Cocoa. In ObjFW, it > returns a single character 😄, because it fits into one of_unichar_t. > > Try this: > NSLog(@"%C", [@"😄" characterAtIndex: 0]); > It won't output 😄. > > OTOH with ObjFW this: > of_log(@"%C", [@"😄" characterAtIndex: 0]); > will output 😄. > > But in order to make this work, Clang may not assume that ObjFW is Cocoa and > thus reject the format string. > > And yet, the internal representation is not UTF-32 in ObjFW. So this has > nothing to do with internal representation, but with how you export a single > Unicode character - it's part of the API. And Cocoa decided to export a > single Unicode characters as surrogates if necessary, because a unichar is an > unsigned short and 😄 doesn't fit. > > So where is it wrong what I said? It can handle UTF-16, sure. But it can't > handle UCS-4 in a single character. > > -- > Jonathan > > _______________________________________________ > cfe-commits mailing list > [email protected] > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits > _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
