Re: [I18n-sig] Re: Unicode surrogates: just say no!

2001-06-27 Thread Rick McGowan

Martin v. Loewis [EMAIL PROTECTED] wrote:

 It seems to be unclear to many, including myself, what exactly was
 clarified with Unicode 3.1. Where exactly does it say that processing
 a six-byte two-surrogates sequence as a single character is
 non-conforming?

It's not non-conforming, it's irregular. Please read the technical  
report (#27) that I pointed at yesterday (on the i18n-sig@python).  It  
gives detailed specifications for UTF-8.  Anything not in the table UTF-8  
Bit Distribution and accompanying description shown there is  
non-conforming.

Rule D36 specifies:

quote
(a) UTF-8 is the Unicode Transformation Format that serializes a Unicode  
code point as a sequence of one to four bytes, as specified in Table 3.1,  
UTF-8 Bit Distribution.
(b) An illegal UTF-8 code unit sequence is any byte sequence that does not  
match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences.
(c) An irregular UTF-8 code unit sequence is a six-byte sequence where the  
first three bytes correspond to a high surrogate, and the next three bytes  
correspond to a low surrogate. As a consequence of C12, these irregular  
UTF-8 sequences shall not be generated by a conformant process.
/quote

In other words, it is non-conforming to generate two 3-byte things for a  
surrogate pair.  However, it remains legal but irregular to interpret  
such a pair of 3-byte entities.  Why wasn't it just made non-conforming to  
interpret such things?  Because there are old implementations of UTF-8 in  
the world that pre-date the definition of surrogates, and if they ever  
encountered codepoints in that range, they would generate those pairs of  
3-byte sequences.  So it is legal for a process to recognize them and  
either raise an exception or try to fix the situation.

 What exactly does it say that the conforming behaviour
 should be?

TR27 says: Processes that require unique representation must not  
interpret irregular UTF code unit sequences as characters. They may, for  
example, reject or remove those sequences.

If I were going to implement a UTF-8 interpeter for Python, I would give  
it a hook to optionally return a specific error condition on irregular  
sequences.

If you still find the definitions and discussion in the technical report  
to be unclear, then the Unicode editorial committee would undoubtedly like  
to hear about it.

Rick




Re: [I18n-sig] Re: Unicode surrogates: just say no!

2001-06-27 Thread Peter_Constable


If you still find the definitions and discussion in the technical report
to be unclear, then the Unicode editorial committee would undoubtedly like
to hear about it.

There is no question that there are still things that are unclear and
things that are anachronistic in the definitions. I have been told that the
editorial *is* aware of these things and looking at them with the intent to
revise them for TUS 4.0.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]