[darcs-users] Fwd: [patch37] Store textual patch metadata encoded in UTF-8

Reinier Lamers Wed, 04 Nov 2009 01:27:53 -0800

Hi Simon,

2009/11/3 Simon Michael <[email protected]>:
> Hi Reinier, thanks a lot for working on this. I haven't studied the patch or
> the issue page, so please take this for what it's worth.


I'm interested in your experiences, but I do not understand entirely
what you mean at some places. Below I highlight the places where I am
confused, and I explain what I did and why I did it.

> I have a project
> that stored content as utf-8 for some years. We were able to stamp out most
> encoding/decoding-related errors, but not all of them, and there were
> certain features we could never support, like searching for partial unicode
> strings.

What do you mean by a "partial Unicode string"?

> For the last while we have been converting to unicode as the
> internal format, with strict decoding of incoming data and encoding of
> outgoing.

What do you mean by "format"? Unicode is a /character set/, but it
doet not specify an encoding.

There are multiple possible encodings for characters of the Unicode
character set. UTF-8 is one of them. In some places people use
"Unicode" to mean UTF-16 or UCS-2, which are other encodings for the
Unicode character set.  They do so mostly for legacy reasons. Our
Unicode support is not legacy - it's actually so brand new it's not
even merged in. So let's keep Unicode and UTF-16 separate.

Then the question becomes: what can you do with UTF-16 that you can't
do with UTF-8? AFAIK, UTF-16 is more space efficient in storing Greek,
Cyrillic, Chinese, Korean or Japanese text, while UTF-8 is more space
efficient in storing Latin text. Also, if you use only characters from
the Basic Multilingual Plane of Unicode, you can treat UTF-16 as a
fixed-width 16-bit encoding. UTF-8 on the other hand is sort-of
backwards compatible with ASCII, so that you can feed UTF-8 to tools
that expect ASCII and still get something sensible out.

Generally, Unix prefers UTF-8 and Windows prefers UTF-16 (which it
calls "Unicode").

For this reason of backwards compatibility, and because darcs is more
Unixy than Windowsy, I chose UTF-8 over UTF-16.

Also, all strings are stored normalized to NFC to make comparison of
strings easier and/or more meaningful (that is, darcs will treat every
 'ü' the same, even though there are multiple ways to express that
letter in Unicode).

Regards,
Reinier
_______________________________________________
darcs-users mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-users

[darcs-users] Fwd: [patch37] Store textual patch metadata encoded in UTF-8

Reply via email to