On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev wrote:
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this?

It looks like you've missed an important property of UTF-8: lower ASCII remains encoded the same, and UTF-8 code units encoding non-ASCII characters cannot be confused with ASCII characters. Code that does not need Unicode code points can treat UTF-8 strings as ASCII strings, and does not need to decode each character individually - because a 0x20 byte will mean "space" regardless of context. That's why a function that splits a string by ASCII whitespace does NOT need do perform UTF-8 decoding.

I hope this clears up the misunderstanding :)
OK, you got me with this particular special case: it is not necessary to decode every UTF-8 character if you are simply comparing against ASCII space characters. My mixup is because I was unaware if every language used its own space character in UTF-8 or if they reuse the ASCII space character, apparently it's the latter.

However, my overall point stands. You still have to check 2-4 times as many bytes if you do it the way Peter suggests, as opposed to a single-byte encoding. There is a shortcut: you could also check the first byte to see if it's ASCII or not and then skip the right number of ensuing bytes in a character's encoding if it isn't ASCII, but at that point you have begun partially decoding the UTF-8 encoding, which you claimed wasn't necessary and which will degrade performance anyway.

On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string.
Not being aware of this shortcut doesn't mean not understanding UTF-8.

This code will count all spaces in a string whether it is encoded as ASCII or UTF-8:

int countSpaces(const(char)* c)
{
    int n = 0;
    while (*c)
        if (*c == ' ')
            ++n;
    return n;
}

I repeat: there is no need to decode. Please read up on UTF-8. You do not understand it. The reason you don't need to decode is because UTF-8 is self-synchronising.
Not quite. The reason you don't need to decode is because of the particular encoding scheme chosen for UTF-8, a side effect of ASCII backwards compatibility and reusing the ASCII space character; it has nothing to do with whether it's self-synchronizing or not.

The code above tests for spaces only, but it works the same when searching for any substring or single character. It is no slower than fixed-width encoding for these operations.
It doesn't work the same "for any substring or single character," it works the same for any single ASCII character.

Of course it's slower than a fixed-width single-byte encoding. You have to check every single byte of a non-ASCII character in UTF-8, whereas a single-byte encoding only has to check a single byte per language character. There is a shortcut if you partially decode the first byte in UTF-8, mentioned above, but you seem dead-set against decoding. ;)

Again, I urge you, please read up on UTF-8. It is very well designed.
I disagree. It is very badly designed, but the ASCII compatibility does hack in some shortcuts like this, which still don't save its performance.

Reply via email to