Re: What does it mean to not be a valid string in Unicode?

2013-01-08 Thread Martin J. Dürst

On 2013/01/08 14:43, Stephan Stiller wrote:


Wouldn't the clean way be to ensure valid strings (only) when they're
built


Of course, the earlier erroneous data gets caught, the better. The 
problem is that error checking is expensive, both in lines of code and 
in execution time (I think there is data showing that in any real-life 
programs, more than 50% or 80% or so is error checking, but I forgot the 
details).


So indeed as Ken has explained with a very good example, it doesn't make 
sense to check at every corner.



and then make sure that string algorithms (only) preserve
well-formedness of input?

Perhaps this is how the system grew, but it seems to be that it's
yet another legacy of C pointer arithmetic and
about convenience of implementation rather than a
safety or performance issue.


Convenience of implementation is an important aspect in programming.

 Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.
 So in this kind of a case, what we are actually dealing with is:
 garbage in, principled, correct results out. ;-)

Sorry, but I have to disagree here. If a list of strings contains items 
with lone surrogates (garbage), then sorting them doesn't make the 
garbage go away, even if the items may be sorted in correct order 
according to some criterion.


Regards,   Martin.



Re: What does it mean to not be a valid string in Unicode?

2013-01-08 Thread Stephan Stiller
 Wouldn't the clean way be to ensure valid strings (only) when they're
 built


 Of course, the earlier erroneous data gets caught, the better. The problem
 is that error checking is expensive, both in lines of code and in execution
 time (I think there is data showing that in any real-life programs, more
 than 50% or 80% or so is error checking, but I forgot the details).

 So indeed as Ken has explained with a very good example, it doesn't make
 sense to check at every corner.


What I meant: The idea was to check only when a string is constructed. As
soon as it's been fed into a collation/whatever algorithm, the algorithm
should assume the original input was well-formed and shouldn't do any more
error-checking, yes.

Not having facilities for dealing with ill-formed values (U+D800 ..
U+DFFF) in an algorithm will surely make *something* faster, even if it's
just some table that's being used indirectly having fewer entries.

What I had in mind is a library where the public interface only ever allows
Unicode scalar values to be in- and output. This will lead to a cleaner
interface. A data structure that can hold surrogate values can and should
be used algorithm-*internally*, if that makes things more efficient, safer,
etc.

Convenience of implementation is an important aspect in programming.


For a user yes, but not for a library writer/maintainer, I would suggest.
The STL uses red-black trees; these are annoyingly difficult to implement
but invisible to the user.

Stephan


RE: What does it mean to not be a valid string in Unicode?

2013-01-08 Thread Whistler, Ken
 Sorry, but I have to disagree here. If a list of strings contains items
 with lone surrogates (garbage), then sorting them doesn't make the
 garbage go away, even if the items may be sorted in correct order
 according to some criterion.

Well, yeah, I wasn't claiming that the principled, correct output made the 
garbage go away.

Let me put it this way: if my choices are 1) garbage in, garbage reliably 
sorted out into garbage bin, versus 2) garbage in, sorting fails with 
exception, then I'll pick #1. ;-)

To give a concrete example, my implementation of UCA reliably passes the 
SHIFTED test cases in the conformance test, even though those test cases 
(deliberately) contain some ill-formed strings. If I instead did validation 
testing on input strings in my base implementation, it would be slower, *and* 
to pass the conformance test I would have to add a separate preprocessing stage 
that probed all the input data for ill-formed strings and filtered those cases 
out before engaging the test, so that it wouldn't fail with an exception when 
it hit the bad data. 

--Ken





Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Markus Scherer
Unicode libraries commonly provide functions that take a code point and
return a value, for example a property value. Such a function normally
accepts the whole range 0..10 (and may even return a default value for
out-of-range inputs).

Also, we commonly read code points from 16-bit Unicode strings, and
unpaired surrogates are returned as themselves and treated as such (e.g.,
in collation). That would not be well-formed UTF-16, but it's generally
harmless in text processing.

markus


RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Doug Ewell
Markus Scherer markus dot icu at gmail dot com wrote:

 Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such
 (e.g., in collation). That would not be well-formed UTF-16, but it's
 generally harmless in text processing.

But still non-conformant.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ­





Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Markus Scherer
On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote:

 Markus Scherer markus dot icu at gmail dot com wrote:

  Also, we commonly read code points from 16-bit Unicode strings, and
  unpaired surrogates are returned as themselves and treated as such
  (e.g., in collation). That would not be well-formed UTF-16, but it's
  generally harmless in text processing.

 But still non-conformant.


Not really, that's why there is a definition of a 16-bit Unicode string in
the standard.

markus


Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis ☕
 But still non-conformant.

That's incorrect.

The point I was making above is that in order to say that something is
non-conformant, you have to be very clear what it is non-conformant *TO*
.

 Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such
 (e.g., in collation).

   - That *is* conformant for *Unicode 16-bit strings.*
   - That is *not* conformant for *UTF-16*.

There is an important difference.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote:

 But still non-conformant.


RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Doug Ewell
You're right, and I stand corrected. I read Markus's post too quickly.

Mark Davis ☕ mark at macchiato dot com wrote:

 But still non-conformant.

 That's incorrect. 

 The point I was making above is that in order to say that something is 
 non-conformant, you have to be very clear what it is non-conformant TO.

 Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such
 (e.g., in collation).

 + That is conformant for Unicode 16-bit strings.

 + That is not conformant for UTF-16.

 There is an important difference.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ­





Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Philippe Verdy
Well then I don't know why you need a definition of an Unicode 16-bit
string. For me it just means exactly the same as 16-bit string, and
the encoding in it is not relevant given you can put anything in it
without even needing to be conformant to Unicode. So a Java string is
exactly the same, a 16-bit string. The same also as Windows API 16-bit
strings, or wide strings in a C compiler where wide is mapped by a
compiler option to 16-bit code units for wchar_t (or short but more
safely as UINT16 if you don't want to be dependant of compiler options
or OS environments when compiling, when you need to manage the exact
memory allocation), or the same as a U-string in Perl.

Only UTF-16 (not UTF-16BE and UTF-16LE which are encoding schemes with
concreate byte orders, without any leading BOM) is relevant to Unicode
because a 16-bit string does not itself specify any encoding scheme or
byte order.

One confusion comes with the name UTF-16 when it is also used as an
encoding scheme with a possible leading BOM and implied default
UTF-16LE determined by guesses on the first few characters : this
encoding scheme (with support of BOM and implicit guess of byte order
if it's missing) should have been given a distinct encoding name like
'UTF-16XE. Reserving UTF-16 for what the stadnard discusses as a
16-bit string, except that it should still require UTF-16
conformance (no unpaired surrogates and no non-characters) plus **no**
BOM supported for this level (which is still not materialized by a
concrete byte order or by an implicit size in storage bits, as long as
it can store distinctly the whole range of code units 0x..0x
minus the few non-characters, and enforces all surrogates to be
paired, but does not enforce any character to be allocated).

Note that such relaxed version of UTF-16 would still allow an internal
alternate representation of 0x for interoperating with various
APIs without changing the storage requirement : 0x could perfectly
be used to replace 0x if that last code units plays a special role
as a string terminator. But even if this is done, a storage unit like
0x would still be percied as if it was really the code unit
0x.

In other words, the concept of completely relaxed Unicode 16-bit
string is unneeded, given that it's single requirement is to make
sure that it defines a length in terms of 16-bit code units, and code
units being large enough to store any unsigned 16-bit value
(internally it could still be 18-bit on systems with 6-bit or 9-bit
addressable memory cells ; the sizeof() property of this code units
could still be 2, or 3 or other, as long as it is large enough to
store the value. On some devices (not so exotic...) there are memory
areas that is 4-bit addressable or even 1-bit addressable (in that
later case the sizeof() property for the code unit type would return
16, not 2). Some devices only have 16-bit or 32-bit addressable memory
and sizeof() would return 1 (and the C types char and wchar_t would
most likely be the same).


2013/1/7 Doug Ewell d...@ewellic.org:
 You're right, and I stand corrected. I read Markus's post too quickly.

 Mark Davis ☕ mark at macchiato dot com wrote:

 But still non-conformant.

 That's incorrect.

 The point I was making above is that in order to say that something is 
 non-conformant, you have to be very clear what it is non-conformant TO.

 Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such
 (e.g., in collation).

 + That is conformant for Unicode 16-bit strings.

 + That is not conformant for UTF-16.

 There is an important difference.

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell ­







RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Philippe Verdy said:

 Well then I don't know why you need a definition of an Unicode 16-bit
 string. For me it just means exactly the same as 16-bit string, and
 the encoding in it is not relevant given you can put anything in it
 without even needing to be conformant to Unicode. So a Java string is
 exactly the same, a 16-bit string. The same also as Windows API 16-bit
 strings, or wide strings in a C compiler where wide is mapped by a
 compiler option to 16-bit code units for wchar_t ...

And elaborating on Mark's response a little:

[0x0061,0x0062,0x4E00,0x,0x0410]

Is a Unicode 16-bit string. It contains a, b, a Han character, a 
noncharacter, and a Cyrillic character.

Because it is also well-formed as UTF-16, it is also a UTF-16 string, by the 
definitions in the standard. 

[0x0061,0xD800,0x4E00,0x,0x0410]

Is a Unicode 16-bit string. It contains a, a high-surrogate code unit, a 
Han character, a noncharacter, and a Cyrillic character.

Because an unpaired high-surrogate code unit is not allowed in UTF-16, this is 
*NOT* a UTF-16 string.

On the other hand, consider:

[0x0061,0x0062,0x88EA,0x8440]

That is *NOT* a Unicode 16-bit string. It contains a, b, a Han character, 
and a Cyrillic character. How do I know? Because I know the character set 
context. It is a wchar_t implementation of the Shift-JIS code page 932.

The difference is the declaration of the standard one uses to interpret what 
the 16-bit units mean. In a Unicode 16-bit string I go to the Unicode 
Standard to figure out how to interpret the numbers. In a wide code Page 932 
string I go to the specification of Code Page 932 to figure out how to 
interpret the numbers.

This is no different, really, than talking about a Latin-1 string versus a 
KOI-8 string.

--Ken






RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Philippe also said:

 ... Reserving UTF-16 for what the stadnard discusses as a
 16-bit string, except that it should still require UTF-16
 conformance (no unpaired surrogates and no non-characters) ...

For those following along, conformance to UTF-16 does *NOT* require no 
non-characters. Noncharacters are perfectly valid in UTF-16.

--Ken





Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Martin J. Dürst

On 2013/01/08 3:27, Markus Scherer wrote:


Also, we commonly read code points from 16-bit Unicode strings, and
unpaired surrogates are returned as themselves and treated as such (e.g.,
in collation). That would not be well-formed UTF-16, but it's generally
harmless in text processing.


Things like this are called garbage in, garbage-out (GIGO). It may be 
harmless, or it may hurt you later.


Regards,   Martin.



Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis ☕
That's not the point (see successive messages).


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 4:59 PM, Martin J. Dürst due...@it.aoyama.ac.jpwrote:

 On 2013/01/08 3:27, Markus Scherer wrote:

  Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such (e.g.,
 in collation). That would not be well-formed UTF-16, but it's generally
 harmless in text processing.


 Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.

 Regards,   Martin.




RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Martin,

The kind of situation Markus is talking about is illustrated particularly well 
in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to 
this issue,:

http://www.unicode.org/reports/tr10/#Handline_Illformed

When weighting Unicode 16-bit strings for collation, you can, of course, always 
detect an unpaired surrogate and return an error code or throw an exception, 
but that may not be the best strategy for an implementation.

The problem derives in part from the fact that for sorting, the comparison 
routine is generally buried deep down as a primitive comparison function in 
what may be a rather complicated sorting algorithm. Those algorithms often 
assume that the comparison routine is analogous to strcmp(), and will always 
return -1/0/1 (or negative/0/positive), and that it is not going to fail 
because it decides that some byte value in an input string is not valid in some 
particular character encoding. (Of course, the calling code needs to ensure it 
isn't handing off null pointers or unallocated objects, but that is par for the 
course for any string handling.)

Now if I want to adopt a particular sorting algorithm so it uses a 
UCA-compliant, multi-level collation algorithm for the actual string 
comparison, then by far the easiest way to do so is to build a function 
essentially comparable to strcmp() in structure, e.g. UCA_strcmp(context, 
string1, string2), which also always returns -1/0/1 for any two Unicode 16-bit 
strings. If I introduce a string validation aspect to this comparison routine, 
and return an error code or raise an exception, then I run the risk of 
marginally slowing down the most time-critical part of the sorting loop, as 
well as complicating the adaptation of the sorting code, to deal with extra 
error conditions. It is faster, more reliable and robust, and easier to adapt 
the code, if I simply specify for the weighting exactly what happens to any 
isolated surrogate in input strings, and compare accordingly. Hence the two 
alternative strategies suggested in Section 7.1.1 of UTS #10: either weight 
each maximal ill-for!
 med subsequence as if it were U+FFFD (with a primary weight), or weight each 
surrogate code point with a generated implicit weight, as if it were an 
unassigned code point. Either strategy works. And in fact, the conformance 
tests in CollationTest.zip for UCA include some ill-formed strings in the test 
data, so that implementations can test their handling of them, if they choose.

So in this kind of a case, what we are actually dealing with is: garbage in, 
principled, correct results out. ;-)

--Ken

 -Original Message-
 
 On 2013/01/08 3:27, Markus Scherer wrote:
 
  Also, we commonly read code points from 16-bit Unicode strings, and
  unpaired surrogates are returned as themselves and treated as such (e.g.,
  in collation). That would not be well-formed UTF-16, but it's generally
  harmless in text processing.
 
 Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.
 
 Regards,   Martin.





RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken

 
 http://www.unicode.org/reports/tr10/#Handline_Illformed

Grrr. 

http://www.unicode.org/reports/tr10/#Handling_Illformed

I seem unable to handle ill-formed spelling today. :(

--Ken






Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Stephan Stiller



Things like this are called garbage in, garbage-out (GIGO). It may be
harmless, or it may hurt you later.

So in this kind of a case, what we are actually dealing with is: garbage in, 
principled, correct results out. ;-)


Wouldn't the clean way be to ensure valid strings (only) when they're 
built and then make sure that string algorithms (only) preserve 
well-formedness of input?


Perhaps this is how the system grew, but it seems to be that it's
yet another legacy of C pointer arithmetic and
about convenience of implementation
rather than a
safety or
performance
issue.

Stephan




Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis ☕
In practice and by design, treating isolated surrogates the same as
reserved code points in processing, and then cleaning up on conversion to
UTFs works just fine. It is a tradeoff that is up to the implementation.

It has nothing to do with a legacy of C pointer arithmetic. It does
represent a pragmatic choice some time ago, but there is no need getting
worked up about it. Human scripts and their representation on computers is
quite complex enough; in the grand scheme of things the handling of
surrogates in implementations pales in significance.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 9:43 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


  Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.

 So in this kind of a case, what we are actually dealing with is: garbage
 in, principled, correct results out. ;-)


 Wouldn't the clean way be to ensure valid strings (only) when they're
 built and then make sure that string algorithms (only) preserve
 well-formedness of input?

 Perhaps this is how the system grew, but it seems to be that it's
 yet another legacy of C pointer arithmetic and
 about convenience of implementation
 rather than a
 safety or
 performance
 issue.

 Stephan





Re: What does it mean to not be a valid string in Unicode?

2013-01-06 Thread Mark Davis ☕
Some of this is simply historical: had Unicode been designed from the start
with 8 and 16 bit forms in mind, some of this could be avoided. But that is
water long under the bridge. Here is a simple example of why we have both
UTFs and Unicode Strings.

Java uses Unicode 16-bit Strings. The following code is copying all the
code units from string to buffer.

StringBuilder buffer = new StringBuilder();
for (int i = 0; i  string.length(); ++i) {
  buffer.append(i.charAt(i));
}

If Java always enforced well-formedness of strings, then

   1. The above code would break, since there is an intermediate step where
   buffer is ill-formed (when just the first of a surrogate pair has been
   copied).
   2. It would involve extra checks in all of the low-level string code,
   with some impact on performance.

Newer implementations of strings, such as Python's, can avoid these issues
because they use a Uniform Model, always dealing in code points. For more
information, see also
http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html

(There are many, many discussions of this in the Unicode email archives if
you have more questions.)


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Sat, Jan 5, 2013 at 11:14 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


 If for example I sit on a committee that devises a new encoding form, I
 would need to be concerned with the question which *sequences of Unicode
 code points* are sound. If this is the same as sequences of Unicode
 scalar values, I would need to exclude surrogates, if I read the standard
 correctly (this wasn't obvious to me on first inspection btw). If for
 example I sit on a committee that designs an optimized compression
 algorithm for Unicode strings (yep, I do know about SCSU), I might want to
 first convert them to some canonical internal form (say, my array of
 non-negative integers). If U+surrogate values can be assumed to not
 exist, there are 2048 fewer values a code point can assume; that's good for
 compression, and I'll subtract 2048 from those large scalar values in a
 first step. Etc etc. So I do think there are a number of very general use
 cases where this question arises.


 In fact, these questions have arisen in the past and have found answers
 then. A present-day use case is if I author a programming language and need
 to decide which values for val I accept in a statement like this:
 someEncodingFormIndependentUnicodeStringType str = val, specified in
 some PL-specific way

 I've looked at the Standard, and I must admit I'm a bit perplexed. Because
 of C1, which explicitly states

 A process shall not interpret a high-surrogate code point or a
 low-surrogate code point as an abstract character.

 I do not know why surrogate values are defined as code points in the
 first place. It seems to me that surrogates are (or should be) an encoding
 form–specific notion, whereas I have always thought of code points as
 encoding form–independent. Turns out this was wrong. I have always been
 thinking that code point conceptually meant Unicode scalar value, which
 is explicitly forbidden to have a surrogate value. Is this only
 terminological confusion? I would like to ask: Why do we need the notion of
 a surrogate code point; why isn't the notion of surrogate code units [in
 some specific encoding form] enough? Conceptually surrogate values are
 byte sequences used in encoding forms (modulo endianness). Why would one
 define an expression (Unicode code point) that conceptually lumps
 Unicode scalar value (an encoding form–independent notion) and surrogate
 code point (a notion that I wouldn't expect to exist outside of specific
 encoding forms) together?

 An encoding form maps only Unicode scalar values (that is all Unicode code
 points excluding the surrogate code points), by definition. D80 and what
 follows (Unicode string and Unicode X-bit string) exist, as I
 understand it, *only* in order for us to be able to have terminology for
 discussing ill-formed code unit sequences in the various encoding forms;
 but all of this talk seems to me to be encoding form–dependent.

 I think the answer to the question I had in mind is that the legal
 sequences of Unicode scalar values are (by definition)
 ({U+, ..., U+10} \ {U+D800, ..., U+DFFF})* .
 But then there is the notion of Unicode string, which is conceptually
 different, by definition. Maybe this is a terminological issue only. But is
 there an expression in the Standard that is defined as sequence of Unicode
 scalar values, a notion that seems to me to be conceptually important? I
 can see that the Standard defines the various well-formed encoding form
 code unit sequence. Have I overlooked something?

 Why is it even possible to store a surrogate value in something like the
 icu::UnicodeString datatype? In other words, why are we concerned with
 storing Unicode *code points* in data structures instead 

Re: What does it mean to not be a valid string in Unicode?

2013-01-06 Thread Stephan Stiller
On Sun, Jan 6, 2013 at 12:34 PM, Mark Davis ☕ m...@macchiato.com wrote:

 [...]


What you write and that the UTFs have historical artifact in their design
makes sense to me.

(There are many, many discussions of this in the Unicode email archives if
 you have more questions.)


Okay. I am fine with ending this thread. *But ...*

I do want to rephrase what baffled me just now. After sleeping over this,
it's clearer what the issue was: Most Unicode discourse is about code
points and talks about them, with the implication (everywhere, pretty much)
that we're encoding *code points* in encoding forms. Maybe I've just read
this into the discourse, but if Unicode discussions used the expression
scalar value more, there would be no potential for such misunderstanding.
(1) Any expression containing surrogate *should* be relevant only for
UTF-16.
(2) The notion of code point covers scalar values *plus* U+surrogate
value.
(3) The expression code point is used in an encoding form–independent
context, for the most part.
(4) So, it's very confusing to ever write surrogate values (say, D813_hex)
in U+-notation. Surrogate values are UTF-16-internal byte values. Nobody
should be thinking about them outside of UTF-16. Now the terminology is a
jumble.

Stephan


Re: What does it mean to not be a valid string in Unicode?

2013-01-05 Thread Stephan Stiller
 If for example I sit on a committee that devises a new encoding form, I
 would need to be concerned with the question which *sequences of Unicode
 code points* are sound. If this is the same as sequences of Unicode
 scalar values, I would need to exclude surrogates, if I read the standard
 correctly (this wasn't obvious to me on first inspection btw). If for
 example I sit on a committee that designs an optimized compression
 algorithm for Unicode strings (yep, I do know about SCSU), I might want to
 first convert them to some canonical internal form (say, my array of
 non-negative integers). If U+surrogate values can be assumed to not
 exist, there are 2048 fewer values a code point can assume; that's good for
 compression, and I'll subtract 2048 from those large scalar values in a
 first step. Etc etc. So I do think there are a number of very general use
 cases where this question arises.


In fact, these questions have arisen in the past and have found answers
then. A present-day use case is if I author a programming language and need
to decide which values for val I accept in a statement like this:
someEncodingFormIndependentUnicodeStringType str = val, specified in
some PL-specific way

I've looked at the Standard, and I must admit I'm a bit perplexed. Because
of C1, which explicitly states

A process shall not interpret a high-surrogate code point or a
low-surrogate code point as an abstract character.

I do not know why surrogate values are defined as code points in the
first place. It seems to me that surrogates are (or should be) an encoding
form–specific notion, whereas I have always thought of code points as
encoding form–independent. Turns out this was wrong. I have always been
thinking that code point conceptually meant Unicode scalar value, which
is explicitly forbidden to have a surrogate value. Is this only
terminological confusion? I would like to ask: Why do we need the notion of
a surrogate code point; why isn't the notion of surrogate code units [in
some specific encoding form] enough? Conceptually surrogate values are
byte sequences used in encoding forms (modulo endianness). Why would one
define an expression (Unicode code point) that conceptually lumps
Unicode scalar value (an encoding form–independent notion) and surrogate
code point (a notion that I wouldn't expect to exist outside of specific
encoding forms) together?

An encoding form maps only Unicode scalar values (that is all Unicode code
points excluding the surrogate code points), by definition. D80 and what
follows (Unicode string and Unicode X-bit string) exist, as I
understand it, *only* in order for us to be able to have terminology for
discussing ill-formed code unit sequences in the various encoding forms;
but all of this talk seems to me to be encoding form–dependent.

I think the answer to the question I had in mind is that the legal
sequences of Unicode scalar values are (by definition)
({U+, ..., U+10} \ {U+D800, ..., U+DFFF})* .
But then there is the notion of Unicode string, which is conceptually
different, by definition. Maybe this is a terminological issue only. But is
there an expression in the Standard that is defined as sequence of Unicode
scalar values, a notion that seems to me to be conceptually important? I
can see that the Standard defines the various well-formed encoding form
code unit sequence. Have I overlooked something?

Why is it even possible to store a surrogate value in something like the
icu::UnicodeString datatype? In other words, why are we concerned with
storing Unicode *code points* in data structures instead of Unicode *scalar
values* (which can be serialized via encoding forms)?

Stephan


What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Costello, Roger L.
Hi Folks,

In the book, Fonts  Encodings (p. 61, first paragraph) it says:

... we select a substring that begins
with a combining character, this new
string will not be a valid string in
 Unicode.

What does it mean to not be a valid string in Unicode?

/Roger




RE: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Whistler, Ken
Yannis' use of the terminology not ... a valid string in Unicode is a little 
confusing there.

A Unicode string with the sequence, say, U+0300, U+0061 (a combining grave 
mark, followed by a), is  valid Unicode in the sense that it just consists 
of two Unicode characters in a sequence. It is aberrant, certainly, but the way 
to describe that aberrancy is that the string starts with a defective combining 
character sequence (a combining mark, with no base character to apply to). And 
it would be non-conformant to the standard to claim that that sequence actually 
represented (or was equivalent to) the Latin small letter a-grave. (à)

There is a second potential issue, which is whether any particular Unicode 
string is ill-formed or not. That issue comes up when examining actual code 
units laid out in memory in a particular encoding form. A Unicode string in 
UTF-8 encoding form could be ill-formed if the bytes don't follow the 
specification for UTF-8, for example. That is a separate issue from whether the 
string starts with a defective combining character sequence.

For defective combining character sequence, see D57 in the standard. (p. 81)

For ill-formed, see D84 in the standard. (p. 91)

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

--Ken

 In the book, Fonts  Encodings (p. 61, first paragraph) it says:
 
 ... we select a substring that begins
 with a combining character, this new
 string will not be a valid string in
  Unicode.
 
 What does it mean to not be a valid string in Unicode?
 
 /Roger
 





Re: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Stephan Stiller



What does it mean to not be a valid string in Unicode?


Is there a concise answer in one place? For example, if one uses the 
noncharacters just mentioned by Ken Whistler (intended for 
process-internal uses, but [...] not permitted for interchange), what 
precisely does that mean? /Naively/, all strings over the alphabet 
{U+, ..., U+10} seem valid, but section 16.7 clarifies that 
noncharacters are forbidden for use in open interchange of Unicode text 
data. I'm assuming there is a set of isValidString(...)-type ICU calls 
that deals with this? Yes, I'm sure this has been asked before and ICU 
documentation has an answer, but this page

http://www.unicode.org/faq/utf_bom.html
contains lots of distributed factlets where it's imo unclear how to add 
them up. An implementation can use characters that are invalid in 
interchange, but I wouldn't expect implementation-internal aspects of 
anything to be subject to any standard in the first place (so, why write 
this?). Also it makes me wonder about the runtime of the algorithm 
checking for valid Unicode strings of a particular length. Of course the 
answer is linear complexity-wise, but as it or a variation of it 
(depending on how one treats holes and noncharacters) will be dependent 
on the positioning of those special characters, how fast does this 
function perform in practice? This also relates to Markus Scherer's 
reply to the holes thread just now.


Stephan



Re: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Stephan Stiller



A Unicode string in UTF-8 encoding form could be ill-formed if the bytes don't 
follow the specification for UTF-8, for example.
Given that answer, add in UTF-32 to my email just now, for 
simplicity's sake. Or let's simply assume we're dealing with some sort 
of sequence of abstract integers from hex+0 to hex+10, to abstract 
away from encoding form issues.


Stephan




RE: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Whistler, Ken
One of the reasons why the Unicode Standard avoids the term “valid string”, is 
that it immediate begs the question, valid *for what*?

The Unicode string U+0061, U+, U+0062 is just a sequence of 3 Unicode 
characters. It is valid *for* use in internal processing, because for my own 
processing I can decide I need to use the noncharacter value U+ for some 
internal sentinel (or whatever). It is not, however, valid *for* open 
interchange, because there is no conformant way by the standard (by design) for 
me to communicate to you how to interpret U+ in that string. However, the 
string U+0061, U+, U+0062 is valid *as* a NFC-normalized Unicode string, 
because the normalization algorithm must correctly process all Unicode code 
points, including noncharacters.

The Unicode string U+0061, U+E000, U+0062 contains a private use character 
U+E. That is valid *for* open interchange, but it is not interpretable 
according the standard itself. It requires an external agreement as to the 
interpretation of U+E000.

The Unicode string U+0061, U+002A, U+0062 (“a*b”) is not valid *as* an 
identifier, because it contains a pattern-syntax character, the asterisk. 
However, it is certainly valid *for* use as an expression, for example.

And so on up the chain of potential uses to which a Unicode string could be put.

People (and particularly programmers) should not get too hung up on the notion 
of validity of a Unicode string, IMO. It is not some absolute kind of condition 
which should be tested in code with a bunch of assert() conditions every time a 
string hits an API. That way lies bad implementations of bad code. ;-)

Essentially, most Unicode string handling APIs just pass through string 
pointers (or string objects) the same way old ASCII-based programs passed 
around ASCII strings. Checks for “validity” are only done at points where they 
make sense, and where the context is available for determining what the 
conditions for validity actually are. For example, a character set conversion 
API absolutely should be checking for ill-formedness for UTF-8, for example, 
and have appropriate error-handling, as well as checking for uninterpretable 
conversions (mapping not in the table), again with appropriate error-handling.

But, on the other hand, an API which converts Unicode strings between UTF-8 and 
UTF-16, for example, absolutely should not – must not – concern itself with the 
presence of a defective combining character sequence. If it doesn’t convert the 
defective combining character sequence in UTF-8 into the corresponding 
defective combining character sequence in UTF-16, then the API is just broken. 
Never mind the fact that the defective combining character sequence itself 
might not then be valid *for* some other operation, say a display algorithm 
which detects that as an unacceptable edge condition and inserts a virtual base 
for the combining mark in order not to break the display.

--Ken




What does it mean to not be a valid string in Unicode?

Is there a concise answer in one place? For example, if one uses the 
noncharacters just mentioned by Ken Whistler (intended for process-internal 
uses, but [...] not permitted for interchange), what precisely does that mean? 
Naively, all strings over the alphabet {U+, ..., U+10} seem valid, 
but section 16.7 clarifies that noncharacters are forbidden for use in open 
interchange of Unicode text data. I'm assuming there is a set of 
isValidString(...)-type ICU calls that deals with this? Yes, I'm sure this has 
been asked before and ICU documentation has an answer, but this page
http://www.unicode.org/faq/utf_bom.html
contains lots of distributed factlets where it's imo unclear how to add them 
up. An implementation can use characters that are invalid in interchange, but 
I wouldn't expect implementation-internal aspects of anything to be subject to 
any standard in the first place (so, why write this?). Also it makes me wonder 
about the runtime of the algorithm checking for valid Unicode strings of a 
particular length. Of course the answer is linear complexity-wise, but as it 
or a variation of it (depending on how one treats holes and noncharacters) will 
be dependent on the positioning of those special characters, how fast does this 
function perform in practice? This also relates to Markus Scherer's reply to 
the holes thread just now.

Stephan


Re: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Mark Davis ☕
To assess whether a string is invalid, it all depends on what the string is
supposed to be.

1. As Ken says, if a string is supposed to be in a given encoding form
(UTF), but it consists of an ill-formed sequence of code units for that
encoding form, it would be invalid. So an isolated surrogate (eg 0xD800) in
UTF-16 or any surrogate (eg 0xD800) in UTF-32 would make the string
invalid. For example, a Java String may be an invalid UTF-16 string. See
http://www.unicode.org/glossary/#unicode_encoding_form

2. However, a Unicode X-bit string does not have the same restrictions:
it may contain sequences that would be ill-formed in the corresponding UTF-X
encoding form. So a Java String is always a valid Unicode 16-bit string.
See http://www.unicode.org/glossary/#unicode_string

3. Noncharacters are also valid in interchange, depending on the sense of
interchange. The TUS says In effect, noncharacters can be thought of as
application-internal private-use code points. If I couldn't interchange
them ever, even internal to my application, or between different modules
that compose my application, they'd be pointless. They are, however,
strongly discouraged in *public* interchange. The glossary entry and some
of the standard text is a bit old here, and needs to be clarified.

4. The quotation we select a substring that begins with a combining
character, this new string will not be a valid string in Unicode. is
wrong. It *is* a valid Unicode string. It isn't particularly useful in
isolation, but it is valid. For some *specific purpose*, any particular
string might be invalid. For example, the string mark#d might be invalid in
some systems as a password, where # is disallowed, or where passwords might
be required to be 8 characters long.




Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Jan 4, 2013 at 3:10 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


  A Unicode string in UTF-8 encoding form could be ill-formed if the bytes
 don't follow the specification for UTF-8, for example.

 Given that answer, add in UTF-32 to my email just now, for simplicity's
 sake. Or let's simply assume we're dealing with some sort of sequence of
 abstract integers from hex+0 to hex+10, to abstract away from encoding
 form issues.

 Stephan





Re: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Stephan Stiller

Thanks for all the information.

Is there a most general sense in which there are constraints beyond all 
characters being from within the range U+ ... U+10? If one is 
concerned with computer security, oddities that are absolute should 
raise a flag; somebody could be messing with my system. Perhaps, for 
internal purposes, I have stored my Unicode string in an array of 
non-negative integers, and now I'm passing around this array. I don't 
know anything else about that string besides it being a Unicode string. 
There are no /absolute/ constraints against having any of those 
1114112_dec (11_hex) code points appearing anywhere, correct? Oh 
wait, actually there are the surrogates (D800 ... DFFF); perhaps I need 
to exclude them. So what else might I have overlooked? For example, the 
original C datatype named string, as it is understood and manipulated 
by the C standard library, has an /absolute/ prohibition against U+ 
anywhere inside. UTF-32 has an /absolute/ prohibition against anything 
above 10. UTF-16 has an /absolute/ prohibition against broken 
surrogate pairs. (Or so is my understanding. Mark Davis mentioned 
Unicode X-bit strings, but D76 (in sec. 3.9 of the standard) suggests 
that there is no place for surrogate values outside of an encoding form; 
that is: a surrogate is not a Unicode scalar value. Perhaps Unicode 
X-bit string should be outside of this discussion then, or I'll need to 
read up on this more.)


Mark Davis' quote (In effect, noncharacters can be thought of as 
application-internal private-use code points.) would really suggest 
that there are really no absolute constraints. I'm just checking that my 
understanding of the matter is correct.


Stephan



Re: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Markus Scherer
On Fri, Jan 4, 2013 at 6:08 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:

 Is there a most general sense in which there are constraints beyond all
 characters being from within the range U+ ... U+10? If one is
 concerned with computer security, oddities that are absolute should raise a
 flag; somebody could be messing with my system.


If you are concerned with computer security, then I suggest you read
http://www.unicode.org/reports/tr36/ Unicode Security Considerations.

For example, the original C datatype named string, as it is understood
 and manipulated by the C standard library, has an *absolute* prohibition
 against U+ anywhere inside.


That's not as much a prohibition as an artifact of NUL-termination of
strings. In more modern libraries, the string contents and its explicit
length are stored together, and you can store a 00 byte just fine, for
example in a C++ string.

markus


Re: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Stephan Stiller



If you are concerned with computer security


If for example I sit on a committee that devises a new encoding form, I 
would need to be concerned with the question which /sequences of Unicode 
code points/ are sound. If this is the same as sequences of Unicode 
scalar values, I would need to exclude surrogates, if I read the 
standard correctly (this wasn't obvious to me on first inspection btw). 
If for example I sit on a committee that designs an optimized 
compression algorithm for Unicode strings (yep, I do know about SCSU), I 
might want to first convert them to some canonical internal form (say, 
my array of non-negative integers). If U+surrogate values can be 
assumed to not exist, there are 2048 fewer values a code point can 
assume; that's good for compression, and I'll subtract 2048 from those 
large scalar values in a first step. Etc etc. So I do think there are a 
number of very general use cases where this question arises.



For example, the original C datatype named string, as it is
understood and manipulated by the C standard library, has an
/absolute/ prohibition against U+ anywhere inside.


That's not as much a prohibition as an artifact of NUL-termination of 
strings. In more modern libraries, the string contents and its 
explicit length are stored together, and you can store a 00 byte just 
fine, for example in a C++ string.


Yep.

If my question is really underspecified or ill-formed, a listing of 
possible interpretations somewhere (with case-specific answers) might be 
useful.


Stephan