In message <[EMAIL PROTECTED]> Jonathan Cast <[EMAIL PROTECTED]> writes: > On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:
> > If UTF-16 is what's used by everyone else (how about Java? Python?) I > > think that's a strong reason to use it. I don't know Unicode well > > enough to say otherwise. > > I disagree. I realize I'm a dissenter in this regard, but my position > is: excellent Unix support first, portability second, excellent support > for Win32/MacOS a distant third. That seems to be the opposite of every > language's position. Unix absolutely needs UTF-8 for backward > compatibility. I think you're talking about different things, internal vs external representations. Certainly we must support UTF-8 as an external representation. The choice of internal representation is independent of that. It could be [Char] or some memory efficient packed format in a standard encoding like UTF-8,16,32. The choice depends mostly on ease of implementation and performance. Some formats are easier/faster to process but there are also conversion costs so in some use cases there is a performance benefit to the internal representation being the same as the external representation. So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8 has the advantage of being the same as a common external representation so conversion is cheap (only need to validate rather than copy). UTF-8 is more compact for western languages but less compact for eastern languages compared to UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the common case UTF-16 is effectively fixed width. According to the ICU implementors this has speed advantages (probably due to branch prediction and smaller code size). One solution is to do both and benchmark them. Duncan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe