On Sat, Mar 24, 2012 at 7:16 PM, Johan Tibell <johan.tib...@gmail.com> wrote: > On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis > <g...@integrable-solutions.net> wrote: >> Hmm, std::u16string, std::u23string, and std::wstring are C++ standard >> types to process Unicode texts. > > Note that at least u16string is too small to encode all of Unicode and > wstring might be as 16 bits is not enough to encode all of Unicode. >
I think there is a confusion here. A Unicode character is an abstract entity. For it to exist in some concrete form in a program, you need an encoding. The fact that char16_t is 16-bit wide is irrelevant to whether it can be used in a representation of a Unicode text, just like uint8_t (e.g. 'unsigned char') can be used to encode Unicode string despite it being only 8-bit wide. You do not need to make the character type exactly equal to the type of the individual element in the text representation. Now, if you want to make a one-to-one correspondence between individual elements in a std::basic_string and a Unicode character, you would of course go for char32_t, which might be wasteful depending on the circumstances. Text processing languages like Perl have long decided to de-emphasize one-character-at-a-time processing. For most common cases, it is just inefficient. But, I also understand that the efficiency argument may not be strong in the context of Haskell. However, I believe a particular attention must be paid to the correctness of the semantics. Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient as far as text processing goes; you also need a localization at the minimum. It is the combination of the two that gives some meaning to text representation and operations. I have been following the discussion, but I don't see anything said about locales. -- Gaby _______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime