On Friday, 24 May 2013 at 17:43:03 UTC, Peter Alexander wrote:
Simple: backwards compatibility with all ASCII APIs (e.g. most C libraries), and because I don't want my strings to consume multiple bytes per character when I don't need it.
And yet here we are today, where an early decision made solely to accommodate the authors of then-dominant all-ASCII APIs has now foisted an unnecessarily complex encoding on all of us, with reduced performance as the result. You do realize that my encoding would encode almost all languages' characters in single bytes, unlike UTF-8, right? Your latter argument is one against UTF-8.

Your language header idea is no good for at least three reasons:

1. What happens if I want to take a substring slice of your string? I'll need to allocate a new string to add the header in.
Good point. The solution that comes to mind right now is that you'd parse my format and store it in memory as a String class, storing the chars in an internal array with the header stripped out and the language stored in a property. That way, even a slice could be made to refer to the same language, by referring to the language of the containing array.

Strictly speaking, this solution could also be implemented with UTF-8, simply by changing the format for the data structure you use in memory to the one I've outlined, as opposed to using the the UTF-8 encoding for both transmission and processing. But if you're going to use my format for processing, you might as well use it for transmission also, since it is much smaller for non-ASCII text.

Before you ridicule my solution as somehow unworkable, let me remind you of the current monstrosity. Currently, the language is stored in every single UTF-8 character, by having the length vary from one to four bytes depending on the language. This leads to Phobos converting every UTF-8 string to UTF-32, so that it can easily run its algorithms on a constant-width 32-bit character set, and the resulting performance penalties. Perhaps the biggest loss is that programmers everywhere are pushed to wrap their heads around this mess, predictably leading to either ignorance or broken code.

Which seems more unworkable to you?

2. What if I have a long string with the ASCII header and want to append a non-ASCII character on the end? I'll need to reallocate the whole string and widen it with the new header.
How often does this happen in practice? I suspect that this almost never happens. But if it does, it would be solved by the String class I outlined above, as the header isn't stored in the array anymore.

3. Even if I have a string that is 99% ASCII then I have to pay extra bytes for every character just because 1% wasn't ASCII. With UTF-8, I only pay the extra bytes when needed.
I don't understand what you mean here. If your string has a thousand non-ASCII characters, the UTF-8 version will have one or two thousand more characters, ie 1 or 2 KB more. My format would add a couple bytes in the header for each non-ASCII language character used, that's it. It's a clear win for my format.

In any case, I just came up with the simplest format I could off the top of my head, maybe there are gaping holes in it. But my point is that we should be able to come up with such a much simpler format, which keeps most characters to a single byte, not that my format is best. All I want to argue is that UTF-8 is the worst. ;)

Reply via email to