As I understand it iterating over a string with Chars does get around the problem of surrogate pairs
It depends what you mean by "get around the problem". for c in string do WorkWith( c ); Will iterate once for each c (WIDECHAR) in s. Some of those c's may be in surrogate pairs, but you will get only 1 of each half of each pair at a time. So if your WorkWith() routine simply ignores surrogate pairs then yes, you got around the problem. But if WorkWith() needs to work on discrete codepoints beyond the BMP then you have some extra work to do before you can call WorkWith(), and you must call it with a UTF32 parameter, NOT a UTF16 WideChar (unless WorkWith() has some way of keeping track of calls made to it, and doing the job of combining surrogates for itself - which is unlikely I think). But crucially, for c in s is absolutely no different from: for i := 1 to Length(s) do WorkWith( s[i] ); They do exactly the same thing - namely iterate over each widechar in the string. as any character you are currently on might be either 1,2 or more bytes if it contains surrogate pairs, but just one unicode character This makes no sense. *Every* character (WIDECHAR) that you "are on" will be 2 bytes. No more. No Less. The number of the bytes shall be 2, and 2 shall be the number. What those 2 bytes represent may be either a complete Unicode codepoint (in the BMP) or one of either a hi/lo char in a surrogate pair, which must be combined to derive the codepoint they represent. what do you use instead of length to get the number of characters in the string in general? Length(s) returns the number of WIDEChars. The number of "n" for which s[n] is valid. length is not the number of characters, its the umber of code-points (including surrogate pairs counted separately) if I understand correctly. Nope - you understand incorrectly. J Separate issue - I understand that if one wants to iterate over the bytes of a string then one uses byte rather than char, and then one does have to investigate each byte to see if it is part of a surrogate pair. No, this is what you have to do with WideChars in a string. You use bytes if you don't care about the characters at all and simply want to work with the raw byte data. Unlikely in the context of the questions you are asking here, I would add.
_______________________________________________ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe