> Yes, this is precisely my point - 'one or more'. The string-length with > invalid embedded sequences is not guaranteed to be consistent, which seems > like a problem. Doing a decode to ensure all points are valid - even if in > the undefined sequences - seems to be a good idea to prevent secondary issues.
The validation is done in "utf8->string". Once a string from some other, unknown source has been created as an internal string object, any subsequent modifications will use valid UTF-8 sequences, unless you explicitly inject U+DCxx characters (the latter should probably be disallowed). felix
