Hi Simon, 2009/11/3 Simon Michael <[email protected]>: > Hi Reinier, thanks a lot for working on this. I haven't studied the patch or > the issue page, so please take this for what it's worth.
I'm interested in your experiences, but I do not understand entirely what you mean at some places. Below I highlight the places where I am confused, and I explain what I did and why I did it. > I have a project > that stored content as utf-8 for some years. We were able to stamp out most > encoding/decoding-related errors, but not all of them, and there were > certain features we could never support, like searching for partial unicode > strings. What do you mean by a "partial Unicode string"? > For the last while we have been converting to unicode as the > internal format, with strict decoding of incoming data and encoding of > outgoing. What do you mean by "format"? Unicode is a /character set/, but it doet not specify an encoding. There are multiple possible encodings for characters of the Unicode character set. UTF-8 is one of them. In some places people use "Unicode" to mean UTF-16 or UCS-2, which are other encodings for the Unicode character set. They do so mostly for legacy reasons. Our Unicode support is not legacy - it's actually so brand new it's not even merged in. So let's keep Unicode and UTF-16 separate. Then the question becomes: what can you do with UTF-16 that you can't do with UTF-8? AFAIK, UTF-16 is more space efficient in storing Greek, Cyrillic, Chinese, Korean or Japanese text, while UTF-8 is more space efficient in storing Latin text. Also, if you use only characters from the Basic Multilingual Plane of Unicode, you can treat UTF-16 as a fixed-width 16-bit encoding. UTF-8 on the other hand is sort-of backwards compatible with ASCII, so that you can feed UTF-8 to tools that expect ASCII and still get something sensible out. Generally, Unix prefers UTF-8 and Windows prefers UTF-16 (which it calls "Unicode"). For this reason of backwards compatibility, and because darcs is more Unixy than Windowsy, I chose UTF-8 over UTF-16. Also, all strings are stored normalized to NFC to make comparison of strings easier and/or more meaningful (that is, darcs will treat every 'ü' the same, even though there are multiple ways to express that letter in Unicode). Regards, Reinier _______________________________________________ darcs-users mailing list [email protected] http://lists.osuosl.org/mailman/listinfo/darcs-users
