>
> UTF-16 is much faster in many situations than UTF-8.
>

an encoding is not a speed. it is a format. Both formats are
variable-length encodings, and therefore both algorithms have the same time
and space complexity (although the implementation of UTF16 does appear to
be simpler from the length of the Julia decoding functions)

> as well as bloat the size of any buffers you need - because you'll need
to allocate 50% more space than for UTF-16, to be sure you can hold the
same # of characters.

The unicode wchar_t is 32-bits, which defines the maximum space needed to
be certain of the ability to hold a block of a certain number of unknown
unicode codepoints*. That quantity is not format-dependent.

What is certain, is that for the ASCII subset, the UTF16 encoding requires
exactly double the space of the UTF8 encoding. For any other measurement,
you would need to first define a representative data sample.

* my understanding is that unicode doesn't have a definition for character,
per se., and that codepoint is the more accurate term for indicating a
particular index into the code page

> UTF-16, but with no surrogate pairs, when there are any characters >
0xff, but none > 0xffff

Isn't that technically UCS-2, not UTF-16?


On Sun, Sep 27, 2015 at 4:29 PM Scott Jones <scott.paul.jo...@gmail.com>
wrote:

> UTF-16 is much faster in many situations than UTF-8.
>
It really depends a lot on just what you are doing, and the data you are
> processing.
> If it is mainly in North/South America, Western Europe, or Australia/NZ,
> UTF-8 does OK.
> UTF-8 is great for data interchange, but can really slow things down if
> you have many non-ASCII characters
> (as well as bloat the size of any buffers you need - because you'll need
> to allocate 50% more space than for UTF-16, to be sure you can hold the
> same # of characters).
>

> UTF-16 is used by Windows APIs, but also ICU, Java, C++ UnicodeString.
> Python 3 actually picks a 1,2,4 byte representation depending on what
> characters are in the string (so UTF-16, but with no surrogate pairs, when
> there are any characters > 0xff, but none > 0xffff).
>
> Scott
>

Reply via email to