Re: Why UTF-8/16 character encodings?

Joakim Fri, 24 May 2013 13:40:26 -0700

On Friday, 24 May 2013 at 17:43:03 UTC, Peter Alexander wrote:

Simple: backwards compatibility with all ASCII APIs (e.g. mostC libraries), and because I don't want my strings to consumemultiple bytes per character when I don't need it.

And yet here we are today, where an early decision made solely toaccommodate the authors of then-dominant all-ASCII APIs has nowfoisted an unnecessarily complex encoding on all of us, withreduced performance as the result. You do realize that myencoding would encode almost all languages' characters in singlebytes, unlike UTF-8, right? Your latter argument is one againstUTF-8.

Your language header idea is no good for at least three reasons:
1. What happens if I want to take a substring slice of yourstring? I'll need to allocate a new string to add the header in.

Good point. The solution that comes to mind right now is thatyou'd parse my format and store it in memory as a String class,storing the chars in an internal array with the header strippedout and the language stored in a property. That way, even aslice could be made to refer to the same language, by referringto the language of the containing array.

Strictly speaking, this solution could also be implemented withUTF-8, simply by changing the format for the data structure youuse in memory to the one I've outlined, as opposed to using thethe UTF-8 encoding for both transmission and processing. But ifyou're going to use my format for processing, you might as welluse it for transmission also, since it is much smaller fornon-ASCII text.

Before you ridicule my solution as somehow unworkable, let meremind you of the current monstrosity. Currently, the languageis stored in every single UTF-8 character, by having the lengthvary from one to four bytes depending on the language. Thisleads to Phobos converting every UTF-8 string to UTF-32, so thatit can easily run its algorithms on a constant-width 32-bitcharacter set, and the resulting performance penalties. Perhapsthe biggest loss is that programmers everywhere are pushed towrap their heads around this mess, predictably leading to eitherignorance or broken code.


Which seems more unworkable to you?

2. What if I have a long string with the ASCII header and wantto append a non-ASCII character on the end? I'll need toreallocate the whole string and widen it with the new header.

How often does this happen in practice? I suspect that thisalmost never happens. But if it does, it would be solved by theString class I outlined above, as the header isn't stored in thearray anymore.

3. Even if I have a string that is 99% ASCII then I have to payextra bytes for every character just because 1% wasn't ASCII.With UTF-8, I only pay the extra bytes when needed.

I don't understand what you mean here. If your string has athousand non-ASCII characters, the UTF-8 version will have one ortwo thousand more characters, ie 1 or 2 KB more. My format wouldadd a couple bytes in the header for each non-ASCII languagecharacter used, that's it. It's a clear win for my format.

In any case, I just came up with the simplest format I could offthe top of my head, maybe there are gaping holes in it. But mypoint is that we should be able to come up with such a muchsimpler format, which keeps most characters to a single byte, notthat my format is best. All I want to argue is that UTF-8 is theworst. ;)

Re: Why UTF-8/16 character encodings?

Reply via email to