On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev
wrote:
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
Are you sure _you_ understand it properly? Both encodings
have to check every single character to test for whitespace,
but the single-byte encoding simply has to load each byte in
the string and compare it against the whitespace-signifying
bytes, while the variable-length code has to first load and
parse potentially 4 bytes before it can compare, because it
has to go through the state machine that you linked to above.
Obviously the constant-width encoding will be faster. Did I
really need to explain this?
It looks like you've missed an important property of UTF-8:
lower ASCII remains encoded the same, and UTF-8 code units
encoding non-ASCII characters cannot be confused with ASCII
characters. Code that does not need Unicode code points can
treat UTF-8 strings as ASCII strings, and does not need to
decode each character individually - because a 0x20 byte will
mean "space" regardless of context. That's why a function that
splits a string by ASCII whitespace does NOT need do perform
UTF-8 decoding.
I hope this clears up the misunderstanding :)
OK, you got me with this particular special case: it is not
necessary to decode every UTF-8 character if you are simply
comparing against ASCII space characters. My mixup is because I
was unaware if every language used its own space character in
UTF-8 or if they reuse the ASCII space character, apparently it's
the latter.
However, my overall point stands. You still have to check 2-4
times as many bytes if you do it the way Peter suggests, as
opposed to a single-byte encoding. There is a shortcut: you
could also check the first byte to see if it's ASCII or not and
then skip the right number of ensuing bytes in a character's
encoding if it isn't ASCII, but at that point you have begun
partially decoding the UTF-8 encoding, which you claimed wasn't
necessary and which will degrade performance anyway.
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
I suggest you read up on UTF-8. You really don't understand it.
There is no need to decode, you just treat the UTF-8 string as
if it is an ASCII string.
Not being aware of this shortcut doesn't mean not understanding
UTF-8.
This code will count all spaces in a string whether it is
encoded as ASCII or UTF-8:
int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}
I repeat: there is no need to decode. Please read up on UTF-8.
You do not understand it. The reason you don't need to decode
is because UTF-8 is self-synchronising.
Not quite. The reason you don't need to decode is because of the
particular encoding scheme chosen for UTF-8, a side effect of
ASCII backwards compatibility and reusing the ASCII space
character; it has nothing to do with whether it's
self-synchronizing or not.
The code above tests for spaces only, but it works the same
when searching for any substring or single character. It is no
slower than fixed-width encoding for these operations.
It doesn't work the same "for any substring or single character,"
it works the same for any single ASCII character.
Of course it's slower than a fixed-width single-byte encoding.
You have to check every single byte of a non-ASCII character in
UTF-8, whereas a single-byte encoding only has to check a single
byte per language character. There is a shortcut if you
partially decode the first byte in UTF-8, mentioned above, but
you seem dead-set against decoding. ;)
Again, I urge you, please read up on UTF-8. It is very well
designed.
I disagree. It is very badly designed, but the ASCII
compatibility does hack in some shortcuts like this, which still
don't save its performance.