Re: What does it mean to not be a valid string in Unicode?
On 2013/01/08 14:43, Stephan Stiller wrote: Wouldn't the clean way be to ensure valid strings (only) when they're built Of course, the earlier erroneous data gets caught, the better. The problem is that error checking is expensive, both in lines of code and in execution time (I think there is data showing that in any real-life programs, more than 50% or 80% or so is error checking, but I forgot the details). So indeed as Ken has explained with a very good example, it doesn't make sense to check at every corner. and then make sure that string algorithms (only) preserve well-formedness of input? Perhaps this is how the system grew, but it seems to be that it's yet another legacy of C pointer arithmetic and about convenience of implementation rather than a safety or performance issue. Convenience of implementation is an important aspect in programming. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Sorry, but I have to disagree here. If a list of strings contains items with lone surrogates (garbage), then sorting them doesn't make the garbage go away, even if the items may be sorted in correct order according to some criterion. Regards, Martin.
Re: What does it mean to not be a valid string in Unicode?
Wouldn't the clean way be to ensure valid strings (only) when they're built Of course, the earlier erroneous data gets caught, the better. The problem is that error checking is expensive, both in lines of code and in execution time (I think there is data showing that in any real-life programs, more than 50% or 80% or so is error checking, but I forgot the details). So indeed as Ken has explained with a very good example, it doesn't make sense to check at every corner. What I meant: The idea was to check only when a string is constructed. As soon as it's been fed into a collation/whatever algorithm, the algorithm should assume the original input was well-formed and shouldn't do any more error-checking, yes. Not having facilities for dealing with ill-formed values (U+D800 .. U+DFFF) in an algorithm will surely make *something* faster, even if it's just some table that's being used indirectly having fewer entries. What I had in mind is a library where the public interface only ever allows Unicode scalar values to be in- and output. This will lead to a cleaner interface. A data structure that can hold surrogate values can and should be used algorithm-*internally*, if that makes things more efficient, safer, etc. Convenience of implementation is an important aspect in programming. For a user yes, but not for a library writer/maintainer, I would suggest. The STL uses red-black trees; these are annoyingly difficult to implement but invisible to the user. Stephan
RE: What does it mean to not be a valid string in Unicode?
Sorry, but I have to disagree here. If a list of strings contains items with lone surrogates (garbage), then sorting them doesn't make the garbage go away, even if the items may be sorted in correct order according to some criterion. Well, yeah, I wasn't claiming that the principled, correct output made the garbage go away. Let me put it this way: if my choices are 1) garbage in, garbage reliably sorted out into garbage bin, versus 2) garbage in, sorting fails with exception, then I'll pick #1. ;-) To give a concrete example, my implementation of UCA reliably passes the SHIFTED test cases in the conformance test, even though those test cases (deliberately) contain some ill-formed strings. If I instead did validation testing on input strings in my base implementation, it would be slower, *and* to pass the conformance test I would have to add a separate preprocessing stage that probed all the input data for ill-formed strings and filtered those cases out before engaging the test, so that it wouldn't fail with an exception when it hit the bad data. --Ken
Re: What does it mean to not be a valid string in Unicode?
Unicode libraries commonly provide functions that take a code point and return a value, for example a property value. Such a function normally accepts the whole range 0..10 (and may even return a default value for out-of-range inputs). Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. markus
RE: What does it mean to not be a valid string in Unicode?
Markus Scherer markus dot icu at gmail dot com wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. But still non-conformant. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
Re: What does it mean to not be a valid string in Unicode?
On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote: Markus Scherer markus dot icu at gmail dot com wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. But still non-conformant. Not really, that's why there is a definition of a 16-bit Unicode string in the standard. markus
Re: What does it mean to not be a valid string in Unicode?
But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is non-conformant, you have to be very clear what it is non-conformant *TO* . Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). - That *is* conformant for *Unicode 16-bit strings.* - That is *not* conformant for *UTF-16*. There is an important difference. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote: But still non-conformant.
RE: What does it mean to not be a valid string in Unicode?
You're right, and I stand corrected. I read Markus's post too quickly. Mark Davis ☕ mark at macchiato dot com wrote: But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is non-conformant, you have to be very clear what it is non-conformant TO. Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). + That is conformant for Unicode 16-bit strings. + That is not conformant for UTF-16. There is an important difference. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
Re: What does it mean to not be a valid string in Unicode?
Well then I don't know why you need a definition of an Unicode 16-bit string. For me it just means exactly the same as 16-bit string, and the encoding in it is not relevant given you can put anything in it without even needing to be conformant to Unicode. So a Java string is exactly the same, a 16-bit string. The same also as Windows API 16-bit strings, or wide strings in a C compiler where wide is mapped by a compiler option to 16-bit code units for wchar_t (or short but more safely as UINT16 if you don't want to be dependant of compiler options or OS environments when compiling, when you need to manage the exact memory allocation), or the same as a U-string in Perl. Only UTF-16 (not UTF-16BE and UTF-16LE which are encoding schemes with concreate byte orders, without any leading BOM) is relevant to Unicode because a 16-bit string does not itself specify any encoding scheme or byte order. One confusion comes with the name UTF-16 when it is also used as an encoding scheme with a possible leading BOM and implied default UTF-16LE determined by guesses on the first few characters : this encoding scheme (with support of BOM and implicit guess of byte order if it's missing) should have been given a distinct encoding name like 'UTF-16XE. Reserving UTF-16 for what the stadnard discusses as a 16-bit string, except that it should still require UTF-16 conformance (no unpaired surrogates and no non-characters) plus **no** BOM supported for this level (which is still not materialized by a concrete byte order or by an implicit size in storage bits, as long as it can store distinctly the whole range of code units 0x..0x minus the few non-characters, and enforces all surrogates to be paired, but does not enforce any character to be allocated). Note that such relaxed version of UTF-16 would still allow an internal alternate representation of 0x for interoperating with various APIs without changing the storage requirement : 0x could perfectly be used to replace 0x if that last code units plays a special role as a string terminator. But even if this is done, a storage unit like 0x would still be percied as if it was really the code unit 0x. In other words, the concept of completely relaxed Unicode 16-bit string is unneeded, given that it's single requirement is to make sure that it defines a length in terms of 16-bit code units, and code units being large enough to store any unsigned 16-bit value (internally it could still be 18-bit on systems with 6-bit or 9-bit addressable memory cells ; the sizeof() property of this code units could still be 2, or 3 or other, as long as it is large enough to store the value. On some devices (not so exotic...) there are memory areas that is 4-bit addressable or even 1-bit addressable (in that later case the sizeof() property for the code unit type would return 16, not 2). Some devices only have 16-bit or 32-bit addressable memory and sizeof() would return 1 (and the C types char and wchar_t would most likely be the same). 2013/1/7 Doug Ewell d...@ewellic.org: You're right, and I stand corrected. I read Markus's post too quickly. Mark Davis ☕ mark at macchiato dot com wrote: But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is non-conformant, you have to be very clear what it is non-conformant TO. Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). + That is conformant for Unicode 16-bit strings. + That is not conformant for UTF-16. There is an important difference. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
RE: What does it mean to not be a valid string in Unicode?
Philippe Verdy said: Well then I don't know why you need a definition of an Unicode 16-bit string. For me it just means exactly the same as 16-bit string, and the encoding in it is not relevant given you can put anything in it without even needing to be conformant to Unicode. So a Java string is exactly the same, a 16-bit string. The same also as Windows API 16-bit strings, or wide strings in a C compiler where wide is mapped by a compiler option to 16-bit code units for wchar_t ... And elaborating on Mark's response a little: [0x0061,0x0062,0x4E00,0x,0x0410] Is a Unicode 16-bit string. It contains a, b, a Han character, a noncharacter, and a Cyrillic character. Because it is also well-formed as UTF-16, it is also a UTF-16 string, by the definitions in the standard. [0x0061,0xD800,0x4E00,0x,0x0410] Is a Unicode 16-bit string. It contains a, a high-surrogate code unit, a Han character, a noncharacter, and a Cyrillic character. Because an unpaired high-surrogate code unit is not allowed in UTF-16, this is *NOT* a UTF-16 string. On the other hand, consider: [0x0061,0x0062,0x88EA,0x8440] That is *NOT* a Unicode 16-bit string. It contains a, b, a Han character, and a Cyrillic character. How do I know? Because I know the character set context. It is a wchar_t implementation of the Shift-JIS code page 932. The difference is the declaration of the standard one uses to interpret what the 16-bit units mean. In a Unicode 16-bit string I go to the Unicode Standard to figure out how to interpret the numbers. In a wide code Page 932 string I go to the specification of Code Page 932 to figure out how to interpret the numbers. This is no different, really, than talking about a Latin-1 string versus a KOI-8 string. --Ken
RE: What does it mean to not be a valid string in Unicode?
Philippe also said: ... Reserving UTF-16 for what the stadnard discusses as a 16-bit string, except that it should still require UTF-16 conformance (no unpaired surrogates and no non-characters) ... For those following along, conformance to UTF-16 does *NOT* require no non-characters. Noncharacters are perfectly valid in UTF-16. --Ken
Re: What does it mean to not be a valid string in Unicode?
On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. Regards, Martin.
Re: What does it mean to not be a valid string in Unicode?
That's not the point (see successive messages). Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 4:59 PM, Martin J. Dürst due...@it.aoyama.ac.jpwrote: On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. Regards, Martin.
RE: What does it mean to not be a valid string in Unicode?
Martin, The kind of situation Markus is talking about is illustrated particularly well in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to this issue,: http://www.unicode.org/reports/tr10/#Handline_Illformed When weighting Unicode 16-bit strings for collation, you can, of course, always detect an unpaired surrogate and return an error code or throw an exception, but that may not be the best strategy for an implementation. The problem derives in part from the fact that for sorting, the comparison routine is generally buried deep down as a primitive comparison function in what may be a rather complicated sorting algorithm. Those algorithms often assume that the comparison routine is analogous to strcmp(), and will always return -1/0/1 (or negative/0/positive), and that it is not going to fail because it decides that some byte value in an input string is not valid in some particular character encoding. (Of course, the calling code needs to ensure it isn't handing off null pointers or unallocated objects, but that is par for the course for any string handling.) Now if I want to adopt a particular sorting algorithm so it uses a UCA-compliant, multi-level collation algorithm for the actual string comparison, then by far the easiest way to do so is to build a function essentially comparable to strcmp() in structure, e.g. UCA_strcmp(context, string1, string2), which also always returns -1/0/1 for any two Unicode 16-bit strings. If I introduce a string validation aspect to this comparison routine, and return an error code or raise an exception, then I run the risk of marginally slowing down the most time-critical part of the sorting loop, as well as complicating the adaptation of the sorting code, to deal with extra error conditions. It is faster, more reliable and robust, and easier to adapt the code, if I simply specify for the weighting exactly what happens to any isolated surrogate in input strings, and compare accordingly. Hence the two alternative strategies suggested in Section 7.1.1 of UTS #10: either weight each maximal ill-for! med subsequence as if it were U+FFFD (with a primary weight), or weight each surrogate code point with a generated implicit weight, as if it were an unassigned code point. Either strategy works. And in fact, the conformance tests in CollationTest.zip for UCA include some ill-formed strings in the test data, so that implementations can test their handling of them, if they choose. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) --Ken -Original Message- On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. Regards, Martin.
RE: What does it mean to not be a valid string in Unicode?
http://www.unicode.org/reports/tr10/#Handline_Illformed Grrr. http://www.unicode.org/reports/tr10/#Handling_Illformed I seem unable to handle ill-formed spelling today. :( --Ken
Re: What does it mean to not be a valid string in Unicode?
Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Wouldn't the clean way be to ensure valid strings (only) when they're built and then make sure that string algorithms (only) preserve well-formedness of input? Perhaps this is how the system grew, but it seems to be that it's yet another legacy of C pointer arithmetic and about convenience of implementation rather than a safety or performance issue. Stephan
Re: What does it mean to not be a valid string in Unicode?
In practice and by design, treating isolated surrogates the same as reserved code points in processing, and then cleaning up on conversion to UTFs works just fine. It is a tradeoff that is up to the implementation. It has nothing to do with a legacy of C pointer arithmetic. It does represent a pragmatic choice some time ago, but there is no need getting worked up about it. Human scripts and their representation on computers is quite complex enough; in the grand scheme of things the handling of surrogates in implementations pales in significance. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 9:43 PM, Stephan Stiller stephan.stil...@gmail.comwrote: Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Wouldn't the clean way be to ensure valid strings (only) when they're built and then make sure that string algorithms (only) preserve well-formedness of input? Perhaps this is how the system grew, but it seems to be that it's yet another legacy of C pointer arithmetic and about convenience of implementation rather than a safety or performance issue. Stephan
Re: What does it mean to not be a valid string in Unicode?
Some of this is simply historical: had Unicode been designed from the start with 8 and 16 bit forms in mind, some of this could be avoided. But that is water long under the bridge. Here is a simple example of why we have both UTFs and Unicode Strings. Java uses Unicode 16-bit Strings. The following code is copying all the code units from string to buffer. StringBuilder buffer = new StringBuilder(); for (int i = 0; i string.length(); ++i) { buffer.append(i.charAt(i)); } If Java always enforced well-formedness of strings, then 1. The above code would break, since there is an intermediate step where buffer is ill-formed (when just the first of a surrogate pair has been copied). 2. It would involve extra checks in all of the low-level string code, with some impact on performance. Newer implementations of strings, such as Python's, can avoid these issues because they use a Uniform Model, always dealing in code points. For more information, see also http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html (There are many, many discussions of this in the Unicode email archives if you have more questions.) Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Sat, Jan 5, 2013 at 11:14 PM, Stephan Stiller stephan.stil...@gmail.comwrote: If for example I sit on a committee that devises a new encoding form, I would need to be concerned with the question which *sequences of Unicode code points* are sound. If this is the same as sequences of Unicode scalar values, I would need to exclude surrogates, if I read the standard correctly (this wasn't obvious to me on first inspection btw). If for example I sit on a committee that designs an optimized compression algorithm for Unicode strings (yep, I do know about SCSU), I might want to first convert them to some canonical internal form (say, my array of non-negative integers). If U+surrogate values can be assumed to not exist, there are 2048 fewer values a code point can assume; that's good for compression, and I'll subtract 2048 from those large scalar values in a first step. Etc etc. So I do think there are a number of very general use cases where this question arises. In fact, these questions have arisen in the past and have found answers then. A present-day use case is if I author a programming language and need to decide which values for val I accept in a statement like this: someEncodingFormIndependentUnicodeStringType str = val, specified in some PL-specific way I've looked at the Standard, and I must admit I'm a bit perplexed. Because of C1, which explicitly states A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. I do not know why surrogate values are defined as code points in the first place. It seems to me that surrogates are (or should be) an encoding form–specific notion, whereas I have always thought of code points as encoding form–independent. Turns out this was wrong. I have always been thinking that code point conceptually meant Unicode scalar value, which is explicitly forbidden to have a surrogate value. Is this only terminological confusion? I would like to ask: Why do we need the notion of a surrogate code point; why isn't the notion of surrogate code units [in some specific encoding form] enough? Conceptually surrogate values are byte sequences used in encoding forms (modulo endianness). Why would one define an expression (Unicode code point) that conceptually lumps Unicode scalar value (an encoding form–independent notion) and surrogate code point (a notion that I wouldn't expect to exist outside of specific encoding forms) together? An encoding form maps only Unicode scalar values (that is all Unicode code points excluding the surrogate code points), by definition. D80 and what follows (Unicode string and Unicode X-bit string) exist, as I understand it, *only* in order for us to be able to have terminology for discussing ill-formed code unit sequences in the various encoding forms; but all of this talk seems to me to be encoding form–dependent. I think the answer to the question I had in mind is that the legal sequences of Unicode scalar values are (by definition) ({U+, ..., U+10} \ {U+D800, ..., U+DFFF})* . But then there is the notion of Unicode string, which is conceptually different, by definition. Maybe this is a terminological issue only. But is there an expression in the Standard that is defined as sequence of Unicode scalar values, a notion that seems to me to be conceptually important? I can see that the Standard defines the various well-formed encoding form code unit sequence. Have I overlooked something? Why is it even possible to store a surrogate value in something like the icu::UnicodeString datatype? In other words, why are we concerned with storing Unicode *code points* in data structures instead
Re: What does it mean to not be a valid string in Unicode?
On Sun, Jan 6, 2013 at 12:34 PM, Mark Davis ☕ m...@macchiato.com wrote: [...] What you write and that the UTFs have historical artifact in their design makes sense to me. (There are many, many discussions of this in the Unicode email archives if you have more questions.) Okay. I am fine with ending this thread. *But ...* I do want to rephrase what baffled me just now. After sleeping over this, it's clearer what the issue was: Most Unicode discourse is about code points and talks about them, with the implication (everywhere, pretty much) that we're encoding *code points* in encoding forms. Maybe I've just read this into the discourse, but if Unicode discussions used the expression scalar value more, there would be no potential for such misunderstanding. (1) Any expression containing surrogate *should* be relevant only for UTF-16. (2) The notion of code point covers scalar values *plus* U+surrogate value. (3) The expression code point is used in an encoding form–independent context, for the most part. (4) So, it's very confusing to ever write surrogate values (say, D813_hex) in U+-notation. Surrogate values are UTF-16-internal byte values. Nobody should be thinking about them outside of UTF-16. Now the terminology is a jumble. Stephan
Re: What does it mean to not be a valid string in Unicode?
If for example I sit on a committee that devises a new encoding form, I would need to be concerned with the question which *sequences of Unicode code points* are sound. If this is the same as sequences of Unicode scalar values, I would need to exclude surrogates, if I read the standard correctly (this wasn't obvious to me on first inspection btw). If for example I sit on a committee that designs an optimized compression algorithm for Unicode strings (yep, I do know about SCSU), I might want to first convert them to some canonical internal form (say, my array of non-negative integers). If U+surrogate values can be assumed to not exist, there are 2048 fewer values a code point can assume; that's good for compression, and I'll subtract 2048 from those large scalar values in a first step. Etc etc. So I do think there are a number of very general use cases where this question arises. In fact, these questions have arisen in the past and have found answers then. A present-day use case is if I author a programming language and need to decide which values for val I accept in a statement like this: someEncodingFormIndependentUnicodeStringType str = val, specified in some PL-specific way I've looked at the Standard, and I must admit I'm a bit perplexed. Because of C1, which explicitly states A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. I do not know why surrogate values are defined as code points in the first place. It seems to me that surrogates are (or should be) an encoding form–specific notion, whereas I have always thought of code points as encoding form–independent. Turns out this was wrong. I have always been thinking that code point conceptually meant Unicode scalar value, which is explicitly forbidden to have a surrogate value. Is this only terminological confusion? I would like to ask: Why do we need the notion of a surrogate code point; why isn't the notion of surrogate code units [in some specific encoding form] enough? Conceptually surrogate values are byte sequences used in encoding forms (modulo endianness). Why would one define an expression (Unicode code point) that conceptually lumps Unicode scalar value (an encoding form–independent notion) and surrogate code point (a notion that I wouldn't expect to exist outside of specific encoding forms) together? An encoding form maps only Unicode scalar values (that is all Unicode code points excluding the surrogate code points), by definition. D80 and what follows (Unicode string and Unicode X-bit string) exist, as I understand it, *only* in order for us to be able to have terminology for discussing ill-formed code unit sequences in the various encoding forms; but all of this talk seems to me to be encoding form–dependent. I think the answer to the question I had in mind is that the legal sequences of Unicode scalar values are (by definition) ({U+, ..., U+10} \ {U+D800, ..., U+DFFF})* . But then there is the notion of Unicode string, which is conceptually different, by definition. Maybe this is a terminological issue only. But is there an expression in the Standard that is defined as sequence of Unicode scalar values, a notion that seems to me to be conceptually important? I can see that the Standard defines the various well-formed encoding form code unit sequence. Have I overlooked something? Why is it even possible to store a surrogate value in something like the icu::UnicodeString datatype? In other words, why are we concerned with storing Unicode *code points* in data structures instead of Unicode *scalar values* (which can be serialized via encoding forms)? Stephan
What does it mean to not be a valid string in Unicode?
Hi Folks, In the book, Fonts Encodings (p. 61, first paragraph) it says: ... we select a substring that begins with a combining character, this new string will not be a valid string in Unicode. What does it mean to not be a valid string in Unicode? /Roger
RE: What does it mean to not be a valid string in Unicode?
Yannis' use of the terminology not ... a valid string in Unicode is a little confusing there. A Unicode string with the sequence, say, U+0300, U+0061 (a combining grave mark, followed by a), is valid Unicode in the sense that it just consists of two Unicode characters in a sequence. It is aberrant, certainly, but the way to describe that aberrancy is that the string starts with a defective combining character sequence (a combining mark, with no base character to apply to). And it would be non-conformant to the standard to claim that that sequence actually represented (or was equivalent to) the Latin small letter a-grave. (à) There is a second potential issue, which is whether any particular Unicode string is ill-formed or not. That issue comes up when examining actual code units laid out in memory in a particular encoding form. A Unicode string in UTF-8 encoding form could be ill-formed if the bytes don't follow the specification for UTF-8, for example. That is a separate issue from whether the string starts with a defective combining character sequence. For defective combining character sequence, see D57 in the standard. (p. 81) For ill-formed, see D84 in the standard. (p. 91) http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf --Ken In the book, Fonts Encodings (p. 61, first paragraph) it says: ... we select a substring that begins with a combining character, this new string will not be a valid string in Unicode. What does it mean to not be a valid string in Unicode? /Roger
Re: What does it mean to not be a valid string in Unicode?
What does it mean to not be a valid string in Unicode? Is there a concise answer in one place? For example, if one uses the noncharacters just mentioned by Ken Whistler (intended for process-internal uses, but [...] not permitted for interchange), what precisely does that mean? /Naively/, all strings over the alphabet {U+, ..., U+10} seem valid, but section 16.7 clarifies that noncharacters are forbidden for use in open interchange of Unicode text data. I'm assuming there is a set of isValidString(...)-type ICU calls that deals with this? Yes, I'm sure this has been asked before and ICU documentation has an answer, but this page http://www.unicode.org/faq/utf_bom.html contains lots of distributed factlets where it's imo unclear how to add them up. An implementation can use characters that are invalid in interchange, but I wouldn't expect implementation-internal aspects of anything to be subject to any standard in the first place (so, why write this?). Also it makes me wonder about the runtime of the algorithm checking for valid Unicode strings of a particular length. Of course the answer is linear complexity-wise, but as it or a variation of it (depending on how one treats holes and noncharacters) will be dependent on the positioning of those special characters, how fast does this function perform in practice? This also relates to Markus Scherer's reply to the holes thread just now. Stephan
Re: What does it mean to not be a valid string in Unicode?
A Unicode string in UTF-8 encoding form could be ill-formed if the bytes don't follow the specification for UTF-8, for example. Given that answer, add in UTF-32 to my email just now, for simplicity's sake. Or let's simply assume we're dealing with some sort of sequence of abstract integers from hex+0 to hex+10, to abstract away from encoding form issues. Stephan
RE: What does it mean to not be a valid string in Unicode?
One of the reasons why the Unicode Standard avoids the term “valid string”, is that it immediate begs the question, valid *for what*? The Unicode string U+0061, U+, U+0062 is just a sequence of 3 Unicode characters. It is valid *for* use in internal processing, because for my own processing I can decide I need to use the noncharacter value U+ for some internal sentinel (or whatever). It is not, however, valid *for* open interchange, because there is no conformant way by the standard (by design) for me to communicate to you how to interpret U+ in that string. However, the string U+0061, U+, U+0062 is valid *as* a NFC-normalized Unicode string, because the normalization algorithm must correctly process all Unicode code points, including noncharacters. The Unicode string U+0061, U+E000, U+0062 contains a private use character U+E. That is valid *for* open interchange, but it is not interpretable according the standard itself. It requires an external agreement as to the interpretation of U+E000. The Unicode string U+0061, U+002A, U+0062 (“a*b”) is not valid *as* an identifier, because it contains a pattern-syntax character, the asterisk. However, it is certainly valid *for* use as an expression, for example. And so on up the chain of potential uses to which a Unicode string could be put. People (and particularly programmers) should not get too hung up on the notion of validity of a Unicode string, IMO. It is not some absolute kind of condition which should be tested in code with a bunch of assert() conditions every time a string hits an API. That way lies bad implementations of bad code. ;-) Essentially, most Unicode string handling APIs just pass through string pointers (or string objects) the same way old ASCII-based programs passed around ASCII strings. Checks for “validity” are only done at points where they make sense, and where the context is available for determining what the conditions for validity actually are. For example, a character set conversion API absolutely should be checking for ill-formedness for UTF-8, for example, and have appropriate error-handling, as well as checking for uninterpretable conversions (mapping not in the table), again with appropriate error-handling. But, on the other hand, an API which converts Unicode strings between UTF-8 and UTF-16, for example, absolutely should not – must not – concern itself with the presence of a defective combining character sequence. If it doesn’t convert the defective combining character sequence in UTF-8 into the corresponding defective combining character sequence in UTF-16, then the API is just broken. Never mind the fact that the defective combining character sequence itself might not then be valid *for* some other operation, say a display algorithm which detects that as an unacceptable edge condition and inserts a virtual base for the combining mark in order not to break the display. --Ken What does it mean to not be a valid string in Unicode? Is there a concise answer in one place? For example, if one uses the noncharacters just mentioned by Ken Whistler (intended for process-internal uses, but [...] not permitted for interchange), what precisely does that mean? Naively, all strings over the alphabet {U+, ..., U+10} seem valid, but section 16.7 clarifies that noncharacters are forbidden for use in open interchange of Unicode text data. I'm assuming there is a set of isValidString(...)-type ICU calls that deals with this? Yes, I'm sure this has been asked before and ICU documentation has an answer, but this page http://www.unicode.org/faq/utf_bom.html contains lots of distributed factlets where it's imo unclear how to add them up. An implementation can use characters that are invalid in interchange, but I wouldn't expect implementation-internal aspects of anything to be subject to any standard in the first place (so, why write this?). Also it makes me wonder about the runtime of the algorithm checking for valid Unicode strings of a particular length. Of course the answer is linear complexity-wise, but as it or a variation of it (depending on how one treats holes and noncharacters) will be dependent on the positioning of those special characters, how fast does this function perform in practice? This also relates to Markus Scherer's reply to the holes thread just now. Stephan
Re: What does it mean to not be a valid string in Unicode?
To assess whether a string is invalid, it all depends on what the string is supposed to be. 1. As Ken says, if a string is supposed to be in a given encoding form (UTF), but it consists of an ill-formed sequence of code units for that encoding form, it would be invalid. So an isolated surrogate (eg 0xD800) in UTF-16 or any surrogate (eg 0xD800) in UTF-32 would make the string invalid. For example, a Java String may be an invalid UTF-16 string. See http://www.unicode.org/glossary/#unicode_encoding_form 2. However, a Unicode X-bit string does not have the same restrictions: it may contain sequences that would be ill-formed in the corresponding UTF-X encoding form. So a Java String is always a valid Unicode 16-bit string. See http://www.unicode.org/glossary/#unicode_string 3. Noncharacters are also valid in interchange, depending on the sense of interchange. The TUS says In effect, noncharacters can be thought of as application-internal private-use code points. If I couldn't interchange them ever, even internal to my application, or between different modules that compose my application, they'd be pointless. They are, however, strongly discouraged in *public* interchange. The glossary entry and some of the standard text is a bit old here, and needs to be clarified. 4. The quotation we select a substring that begins with a combining character, this new string will not be a valid string in Unicode. is wrong. It *is* a valid Unicode string. It isn't particularly useful in isolation, but it is valid. For some *specific purpose*, any particular string might be invalid. For example, the string mark#d might be invalid in some systems as a password, where # is disallowed, or where passwords might be required to be 8 characters long. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, Jan 4, 2013 at 3:10 PM, Stephan Stiller stephan.stil...@gmail.comwrote: A Unicode string in UTF-8 encoding form could be ill-formed if the bytes don't follow the specification for UTF-8, for example. Given that answer, add in UTF-32 to my email just now, for simplicity's sake. Or let's simply assume we're dealing with some sort of sequence of abstract integers from hex+0 to hex+10, to abstract away from encoding form issues. Stephan
Re: What does it mean to not be a valid string in Unicode?
Thanks for all the information. Is there a most general sense in which there are constraints beyond all characters being from within the range U+ ... U+10? If one is concerned with computer security, oddities that are absolute should raise a flag; somebody could be messing with my system. Perhaps, for internal purposes, I have stored my Unicode string in an array of non-negative integers, and now I'm passing around this array. I don't know anything else about that string besides it being a Unicode string. There are no /absolute/ constraints against having any of those 1114112_dec (11_hex) code points appearing anywhere, correct? Oh wait, actually there are the surrogates (D800 ... DFFF); perhaps I need to exclude them. So what else might I have overlooked? For example, the original C datatype named string, as it is understood and manipulated by the C standard library, has an /absolute/ prohibition against U+ anywhere inside. UTF-32 has an /absolute/ prohibition against anything above 10. UTF-16 has an /absolute/ prohibition against broken surrogate pairs. (Or so is my understanding. Mark Davis mentioned Unicode X-bit strings, but D76 (in sec. 3.9 of the standard) suggests that there is no place for surrogate values outside of an encoding form; that is: a surrogate is not a Unicode scalar value. Perhaps Unicode X-bit string should be outside of this discussion then, or I'll need to read up on this more.) Mark Davis' quote (In effect, noncharacters can be thought of as application-internal private-use code points.) would really suggest that there are really no absolute constraints. I'm just checking that my understanding of the matter is correct. Stephan
Re: What does it mean to not be a valid string in Unicode?
On Fri, Jan 4, 2013 at 6:08 PM, Stephan Stiller stephan.stil...@gmail.comwrote: Is there a most general sense in which there are constraints beyond all characters being from within the range U+ ... U+10? If one is concerned with computer security, oddities that are absolute should raise a flag; somebody could be messing with my system. If you are concerned with computer security, then I suggest you read http://www.unicode.org/reports/tr36/ Unicode Security Considerations. For example, the original C datatype named string, as it is understood and manipulated by the C standard library, has an *absolute* prohibition against U+ anywhere inside. That's not as much a prohibition as an artifact of NUL-termination of strings. In more modern libraries, the string contents and its explicit length are stored together, and you can store a 00 byte just fine, for example in a C++ string. markus
Re: What does it mean to not be a valid string in Unicode?
If you are concerned with computer security If for example I sit on a committee that devises a new encoding form, I would need to be concerned with the question which /sequences of Unicode code points/ are sound. If this is the same as sequences of Unicode scalar values, I would need to exclude surrogates, if I read the standard correctly (this wasn't obvious to me on first inspection btw). If for example I sit on a committee that designs an optimized compression algorithm for Unicode strings (yep, I do know about SCSU), I might want to first convert them to some canonical internal form (say, my array of non-negative integers). If U+surrogate values can be assumed to not exist, there are 2048 fewer values a code point can assume; that's good for compression, and I'll subtract 2048 from those large scalar values in a first step. Etc etc. So I do think there are a number of very general use cases where this question arises. For example, the original C datatype named string, as it is understood and manipulated by the C standard library, has an /absolute/ prohibition against U+ anywhere inside. That's not as much a prohibition as an artifact of NUL-termination of strings. In more modern libraries, the string contents and its explicit length are stored together, and you can store a 00 byte just fine, for example in a C++ string. Yep. If my question is really underspecified or ill-formed, a listing of possible interpretations somewhere (with case-specific answers) might be useful. Stephan