On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all?

Simple: backwards compatibility with all ASCII APIs (e.g. most C libraries), and because I don't want my strings to consume multiple bytes per character when I don't need it.

Your language header idea is no good for at least three reasons:

1. What happens if I want to take a substring slice of your string? I'll need to allocate a new string to add the header in.

2. What if I have a long string with the ASCII header and want to append a non-ASCII character on the end? I'll need to reallocate the whole string and widen it with the new header.

3. Even if I have a string that is 99% ASCII then I have to pay extra bytes for every character just because 1% wasn't ASCII. With UTF-8, I only pay the extra bytes when needed.

Reply via email to