Re: Why UTF-8/16 character encodings?

Joakim Sat, 25 May 2013 08:00:39 -0700

On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleevwrote:

On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
Are you sure _you_ understand it properly? Both encodingshave to check every single character to test for whitespace,but the single-byte encoding simply has to load each byte inthe string and compare it against the whitespace-signifyingbytes, while the variable-length code has to first load andparse potentially 4 bytes before it can compare, because ithas to go through the state machine that you linked to above.Obviously the constant-width encoding will be faster. Did Ireally need to explain this?
It looks like you've missed an important property of UTF-8:lower ASCII remains encoded the same, and UTF-8 code unitsencoding non-ASCII characters cannot be confused with ASCIIcharacters. Code that does not need Unicode code points cantreat UTF-8 strings as ASCII strings, and does not need todecode each character individually - because a 0x20 byte willmean "space" regardless of context. That's why a function thatsplits a string by ASCII whitespace does NOT need do performUTF-8 decoding.
I hope this clears up the misunderstanding :)

OK, you got me with this particular special case: it is notnecessary to decode every UTF-8 character if you are simplycomparing against ASCII space characters. My mixup is because Iwas unaware if every language used its own space character inUTF-8 or if they reuse the ASCII space character, apparently it'sthe latter.

However, my overall point stands. You still have to check 2-4times as many bytes if you do it the way Peter suggests, asopposed to a single-byte encoding. There is a shortcut: youcould also check the first byte to see if it's ASCII or not andthen skip the right number of ensuing bytes in a character'sencoding if it isn't ASCII, but at that point you have begunpartially decoding the UTF-8 encoding, which you claimed wasn'tnecessary and which will degrade performance anyway.


On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:

I suggest you read up on UTF-8. You really don't understand it.There is no need to decode, you just treat the UTF-8 string asif it is an ASCII string.

Not being aware of this shortcut doesn't mean not understandingUTF-8.

This code will count all spaces in a string whether it isencoded as ASCII or UTF-8:
int countSpaces(const(char)* c)
{
    int n = 0;
    while (*c)
        if (*c == ' ')
            ++n;
    return n;
}
I repeat: there is no need to decode. Please read up on UTF-8.You do not understand it. The reason you don't need to decodeis because UTF-8 is self-synchronising.

Not quite. The reason you don't need to decode is because of theparticular encoding scheme chosen for UTF-8, a side effect ofASCII backwards compatibility and reusing the ASCII spacecharacter; it has nothing to do with whether it'sself-synchronizing or not.

The code above tests for spaces only, but it works the samewhen searching for any substring or single character. It is noslower than fixed-width encoding for these operations.

It doesn't work the same "for any substring or single character,"it works the same for any single ASCII character.

Of course it's slower than a fixed-width single-byte encoding.You have to check every single byte of a non-ASCII character inUTF-8, whereas a single-byte encoding only has to check a singlebyte per language character. There is a shortcut if youpartially decode the first byte in UTF-8, mentioned above, butyou seem dead-set against decoding. ;)

Again, I urge you, please read up on UTF-8. It is very welldesigned.

I disagree. It is very badly designed, but the ASCIIcompatibility does hack in some shortcuts like this, which stilldon't save its performance.

Re: Why UTF-8/16 character encodings?

Reply via email to