On 16/08/17 11:05, Juha Manninen via Lazarus wrote:
On Mon, Aug 14, 2017 at 4:21 PM, Tony Whyman via Lazarus
<lazarus@lists.lazarus-ide.org> wrote:
UTF-16/Unicode can only store 65,536 characters while the Unicode standard
(that covers UTF8 as well) defines 136,755 characters.
UTF-16/Unicode's main advantage seems to be for rapid indexing of large
strings.
That shows complete ignorance from your side about Unicode.
You consider UTF-16 as a fixed-width encoding.  :(
Unfortunately many other programmers had the same wrong idea or they
were just lazy. The result anyway is a lot of broken UTF-16 code out
there.
You do like to use the word "ignorance" don't you. You can if you want take the view that all the "other programmers" that got the wrong idea are "stupid monkeys that don't know any better" or, alternatively, that they just wanted a nice cup of tea rather than the not quite tea drink that was served up.

Wikipedia sums the problem up nicely: "The early 2-byte encoding was usually called "Unicode", but is now called "UCS-2". UCS-2 differs from UTF-16 by being a constant length encoding and only capable of encoding characters of BMP, it is supported by many programs."

This is where the problem starts. The definitive of "Unicode" was changed (foolishly in my opinion) after it had been accepted by the community and the result is confusion. Hence my first point about not even using it. In using "UTF16/Unicode" I was attempting to convey the common use of the term which is to see UTF-16 as what is now defined as UCS-2. This is because hardly anyone I know uses UCS-2 and instead says "Unicode". Perhaps I just spend too much time amongst the ignorant.

Wikipedia also makes the wonderful point that "The UTF-16 encoding scheme was developed as a compromise to resolve this impasse in version 2.0". The impasse having resulted from "4 bytes per character wasted a lot of disk space and memory, and because some manufacturers were already heavily invested in 2-byte-per-character technology".

Finally: "In UTF-16, code points greater or equal to 2^16 are encoded using /two/ 16-bit code units. The standards organizations chose the largest block available of un-allocated 16-bit code points to use as these code units (since most existing UCS-2 data did not use these code points and would be valid UTF-16). Unlike UTF-8 they did not provide a means to encode these code points".

Which is from where I get my own view that UTF-16, as defined by the standards, is pointless. If you keep it to a UCS-2 (like) subset then you can get rapid indexing of character arrays. But as soon as you introduce the possibility of some characters being encoded as two 16-bit units then you lose rapid indexing and I can see no advantage over UTF-8 - plus you get all the fun of worrying about byte order.

Indeed, I believe those lazy programmers that you referred to, are actually making a conscious decision to prefer to work with a 16-bit code point only UTF-16 subset (i.e. the Basic Multilingual Plan) precisely so that they can do rapid indexing. As soon as you bring in 2 x 16-bit code unit code points, you lose that benefit - and perhaps you should be using UTF-32.

IMHO, Linux has got it right by using UTF-8 as the standard for character encoding and one of Lazarus's USPs is that it follows that lead - even for Windows. I can see why a program that does intensive text scanning will use a UTF-16 constrained to the BMP (i.e. 16-bit only), but not why anyone would prefer an unconstrained UTF-16 over UTF-8.

-- 
_______________________________________________
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus

Reply via email to