On 2017-05-04 09:56, Tony Whyman via Lazarus wrote: > I don't believe that string indexing even works for UTF8 strings at > present - at least not in a simple s[i] way.
It's simple, STOP using index arrays into strings. It doesn't work for Unicode! Use specialised code-point iterators or something similar instead. If you expect a Byte value from s[i] then fine, but if you expect a "character" (like something you see on the screen), then no it will never work. Why? See below: * UTF-16 will return a 2-byte value which isn't big enough to cover the full Unicode range BMP and above. * UTF-8 will return a 1-byte value which again isn't big enough to cover all possible code points in Unicode. For UTF-8 it could be anything from 1-4 bytes. * A "character seen on the screen" could be made up of multiple code points. eg: U+0065 (e) + U+0302 (^) gives you ê. So it might look like one "character", it is *not*. How is arraying indexing into a string supposed to handle this? It can't, unless it first normalises all Unicode strings, but even that will not work in all cases - because not all combining code points can be normalised. Regards, Graeme -- fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal http://fpgui.sourceforge.net/ My public PGP key: http://tinyurl.com/graeme-pgp -- _______________________________________________ Lazarus mailing list [email protected] http://lists.lazarus-ide.org/listinfo/lazarus
