Re: [Lazarus] String vs WideString

Tony Whyman via Lazarus Thu, 17 Aug 2017 03:42:12 -0700

On 16/08/17 11:05, Juha Manninen via Lazarus wrote:

On Mon, Aug 14, 2017 at 4:21 PM, Tony Whyman via Lazarus
<lazarus@lists.lazarus-ide.org> wrote:

UTF-16/Unicode can only store 65,536 characters while the Unicode standard
(that covers UTF8 as well) defines 136,755 characters.
UTF-16/Unicode's main advantage seems to be for rapid indexing of large
strings.

That shows complete ignorance from your side about Unicode.
You consider UTF-16 as a fixed-width encoding.  :(
Unfortunately many other programmers had the same wrong idea or they
were just lazy. The result anyway is a lot of broken UTF-16 code out
there.

You do like to use the word "ignorance" don't you. You can if you wanttake the view that all the "other programmers" that got the wrong ideaare "stupid monkeys that don't know any better" or, alternatively, thatthey just wanted a nice cup of tea rather than the not quite tea drinkthat was served up.

Wikipedia sums the problem up nicely: "The early 2-byte encoding wasusually called "Unicode", but is now called "UCS-2". UCS-2 differs fromUTF-16 by being a constant length encoding and only capable of encodingcharacters of BMP, it is supported by many programs."

This is where the problem starts. The definitive of "Unicode" waschanged (foolishly in my opinion) after it had been accepted by thecommunity and the result is confusion. Hence my first point about noteven using it. In using "UTF16/Unicode" I was attempting to convey thecommon use of the term which is to see UTF-16 as what is now defined asUCS-2. This is because hardly anyone I know uses UCS-2 and instead says"Unicode". Perhaps I just spend too much time amongst the ignorant.

Wikipedia also makes the wonderful point that "The UTF-16 encodingscheme was developed as a compromise to resolve this impasse in version2.0". The impasse having resulted from "4 bytes per character wasted alot of disk space and memory, and because some manufacturers werealready heavily invested in 2-byte-per-character technology".

Finally: "In UTF-16, code points greater or equal to 2^16 are encodedusing /two/ 16-bit code units. The standards organizations chose thelargest block available of un-allocated 16-bit code points to use asthese code units (since most existing UCS-2 data did not use these codepoints and would be valid UTF-16). Unlike UTF-8 they did not provide ameans to encode these code points".

Which is from where I get my own view that UTF-16, as defined by thestandards, is pointless. If you keep it to a UCS-2 (like) subset thenyou can get rapid indexing of character arrays. But as soon as youintroduce the possibility of some characters being encoded as two 16-bitunits then you lose rapid indexing and I can see no advantage over UTF-8- plus you get all the fun of worrying about byte order.

Indeed, I believe those lazy programmers that you referred to, areactually making a conscious decision to prefer to work with a 16-bitcode point only UTF-16 subset (i.e. the Basic Multilingual Plan)precisely so that they can do rapid indexing. As soon as you bring in 2x 16-bit code unit code points, you lose that benefit - and perhaps youshould be using UTF-32.

IMHO, Linux has got it right by using UTF-8 as the standard forcharacter encoding and one of Lazarus's USPs is that it follows thatlead - even for Windows. I can see why a program that does intensivetext scanning will use a UTF-16 constrained to the BMP (i.e. 16-bitonly), but not why anyone would prefer an unconstrained UTF-16 over UTF-8.

-- 
_______________________________________________
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus

Re: [Lazarus] String vs WideString

Reply via email to