25-May-2013 12:58, Vladimir Panteleev пишет:
On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.
I call BS on this.  There's no way working on a variable-width
encoding can be as "full speed" as a constant-width encoding. Perhaps
you mean that the slowdown is minimal, but I doubt that also.

For the record, I noticed that programmers (myself included) that had an
incomplete understanding of Unicode / UTF exaggerate this point, and
sometimes needlessly assume that their code needs to operate on
individual characters (code points), when it is in fact not so - and
that code will work just fine as if it was written to handle ASCII. The
example Walter quoted (regex - assuming you don't want Unicode ranges or
case-insensitivity) is one such case.

+1
BTW regex even with Unicode ranges and case-insensitivity is doable just not easy (yet).

Another thing I noticed: sometimes when you think you really need to
operate on individual characters (and that your code will not be correct
unless you do that), the assumption will be incorrect due to the
existence of combining characters in Unicode. Two of the often-quoted
use cases of working on individual code points is calculating the string
width (assuming a fixed-width font), and slicing the string - both of
these will break with combining characters if those are not accounted
for.  I believe the proper way to approach such tasks is to implement the
respective Unicode algorithms for it, which I believe are non-trivial
and for which the relative impact for the overhead of working with a
variable-width encoding is acceptable.

Another plus one. Algorithms defined on code point basis are quite complex so that benefit of not decoding won't be that large. The benefit of transparently special-casing ASCII in UTF-8 is far larger.

Can you post some specific cases where the benefits of a constant-width
encoding are obvious and, in your opinion, make constant-width encodings
more useful than all the benefits of UTF-8?

Also, I don't think this has been posted in this thread. Not sure if it
answers your points, though:

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/


--
Dmitry Olshansky

Reply via email to