John, the problem is that in Unicode "single character" is meaningless unless you have performed some pre-processing to GIVE that term some meaning. There are some standard forms for such processing, called "Normalisations".
The problem is that a single "character" to your eyes, e.g. an accented "a", could be represented in a Unicode string in at least two ways: 1. A single codepoint represented that accented "a" 2. TWO codepoints - the first representing "a" and the second a diacritic codepoint for the accent > Iterating over a string is for the purpose of doing something with each > individual character That's fine, but in Unicode what you have is a string not of characters but of codepoints. The concept of a "character" is not synonymous with "codepoint" in Unicode in the same way that it is with ASCII or even ANSI. So you have compounded complications: a. Depending on encoding, a single codepoint (32-bit value) may be encoded in 1, 2, or more bytes. Each byte may represent a whole codepoint or only part of a codepoint encoding. b. Each codepoint may represent a whole character or only PART of a character encoding. Complication 'a' can be avoided by adopting UTF-32 encoding - 4 bytes for EVERY codepoint. That is hugely wasteful in terms of memory/storage for most applications. UTF-16 - the encoding used by Delphi and indeed by Windows natively itself - is a compromise. It is less efficient than ANSI for ASCII, but more efficient that UTF-32 for ANSI characters sets represented in the BMP. For applications working entirely in the BMP UTF-16 is also relatively easy to process - for NORMALISED strings, each codepoint IS a character (in the BMP). But for non-normalised data that is still not necessarily the case. > could I build a string like this? > setlength(String1,7); > string1[1] := 'f'; > string1[2] := 'i'; > string1[3] := 'a'; > string1[4] := 'n'; > string1[5] := 'c'; > string1[6] := 'e'; > string1[7] := 'e'; //I would want the full e acute here Yes, you can. But you might also *receive* from another source, a string that is apparently the same at the visual representation level, but different at the data level, where: string1[1] = 'f'; string1[2] = 'i'; string1[3] = 'a'; string1[4] = 'n'; string1[5] = 'c'; string1[6] = 'e'; string1[7] = 'e'; // Normal 'e' character, i.e. identical to string1[6] string1[8] = U+0301; // Combining acute diacritic When displayed on screen this string will appear identical to your string, but it is represented in the data in a different way. > hence I want to be able to go > for i :=1 to length(string1) do > begin > .. > end > Now everything Jolyon are saying and Cary also implies that this is > not going to work. This looks to be a real nuisance! I don't know what gave you that impression from what I said. Yes, Unicode is/can be a real nuisance - *properly* supporting it is a lot more work than people think - but what you want to do here can be done. > Now I think the e acute could be one unicode character (as there is likely > to be a representation using one character, one code point and one code > unit) or as one character, two code units, 2*2 bytes - a surrogate pair - > where eg one supplies the e and one the acute. NO!!! This is NOT what a surrogate pair is. A surrogate pair is encountered ONLY in UTF-16, and is found when you have a codepoint that is not in the BMP. i.e. a value > 65535 that cannot be encoded in a 16-bit value. These are typically CJVK characters (Chinese/Japanese/Vietnamese/Korean) sometimes called Han or Kanji character sets. The first 16-bit value indicates a "page" in the non-BMP. The following 16-bit value then identifies an entry in that "page". To obtain the codepoint that the PAIR of VALUES represents, you have to apply a transform, combining the page selector with the page entry. But what you get is a single codepoint. (you don't have to do this - there are routines to do it for you, but you have to invoke them as appropriate). A Surrogate Pair is a representation of a single codepoint, NOT a relationship between TWO codepoints. When you have a visual character encoded as a codepoint + a following, combining codepoint, that is simply TWO Unicode codepoints that are combined to form one VISUAL "character". That is NOT a surrogate pair however. It is merely two codepoints that have to be combined. _______________________________________________ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe