Re: latin1 decoder implementation

2012-11-19 Thread Peter Krefting
Doug Ewell d...@ewellic.org: If he is targeting HTML5, then none of this matters, because HTML5 says that ISO 8859-1 is really Windows-1252. For example, there is no C1 control called NL in Windows-1252. There is only 0x85, which maps to U+2026 HORIZONTAL ELLIPSIS. Windows-1252, does,

Re: latin1 decoder implementation

2012-11-19 Thread Martin J. Dürst
| @DougEwell ­ From: Philippe Verdy Sent: Friday, November 16, 2012 17:35 To: Whistler, Ken Cc: Buck Golemon ; unicode@unicode.org Subject: Re: latin1 decoder implementation In fact not really, because Unicode DOES assign more precise semantics to a few of these controls, notably for those given

Re: latin1 decoder implementation

2012-11-19 Thread Philippe Verdy
2012/11/19 Martin J. Dürst due...@it.aoyama.ac.jp Note also that the W3C does not automatically endorses the Unicode and ISO/IEC 10646 standards as well (there's a delay before accepting newer releases of TUS and ISO/IEC 10646, and the W3C frequently adds now several restrictions). Can

Re: latin1 decoder implementation

2012-11-18 Thread Philippe Verdy
I think that Python will provide instead a factory that will return the appropriate concrete codec when given an encoding code and the standard body to which it must be conforming to : ISO, IETF (for MIME and the IANA database, as specified in RFC's), W3C (for HTML5), and possibly other private

Re: latin1 decoder implementation

2012-11-17 Thread Martin J. Dürst
Just in case it helps, Ruby (since version 1.9) also uses 3). Regards, Martin. On 2012/11/17 6:48, Buck Golemon wrote: When decoding bytes to unicode using the latin1 scheme, there are three options for bytes not defined in the ISO-8859-1 standard. 1) Throw an error. 2) Insert the

Re: latin1 decoder implementation

2012-11-17 Thread Martin J. Dürst
: Philippe Verdy Sent: Friday, November 16, 2012 17:35 To: Whistler, Ken Cc: Buck Golemon ; unicode@unicode.org Subject: Re: latin1 decoder implementation In fact not really, because Unicode DOES assign more precise semantics to a few of these controls, notably for those given whitespace and newline

Re: latin1 decoder implementation

2012-11-17 Thread Doug Ewell
Martin J. Dürst wrote: If he is targeting HTML5, then none of this matters, because HTML5 says that ISO 8859-1 is really Windows-1252. Yes. But unless Python wants to limit its use to HTML5, this should be handled on a separate level (mapping a iso-8859-1 label to the Windows-1252 decoder

Re: latin1 decoder implementation

2012-11-16 Thread Buck Golemon
When decoding bytes to unicode using the latin1 scheme, there are three options for bytes not defined in the ISO-8859-1 standard. 1) Throw an error. 2) Insert the replacement glyph (fffd), indicating an unknown character. 3) Insert the unicode character with equal value. This means that

Re: latin1 decoder implementation

2012-11-16 Thread Buck Golemon
That's my personal understanding as well, but can you help me find documentation that I can show to my skeptical workmates? On Fri, Nov 16, 2012 at 2:11 PM, Doug Ewell d...@ewellic.org wrote: Code points U+ through U+00FF in Unicode are identical to the corresponding code points 0x00

Re: latin1 decoder implementation

2012-11-16 Thread Doug Ewell
Buck Golemon wrote: Code points U+ through U+00FF in Unicode are identical to the corresponding code points 0x00 through 0xFF in ISO 8859-1. That's my personal understanding as well, but can you help me find documentation that I can show to my skeptical workmates? You can quote the

Re: latin1 decoder implementation

2012-11-16 Thread Michael Everson
On 16 Nov 2012, at 22:12, Buck Golemon b...@yelp.com wrote: That's my personal understanding as well, but can you help me find documentation that I can show to my skeptical workmates? It is the basis for most popular 8-bit character sets, including Windows-1252 and the first block of

Re: latin1 decoder implementation

2012-11-16 Thread Buck Golemon
I actually did quote that, to no avail. This seems to be the missing information though (from the wikipedia iso-8859-1 article http://en.wikipedia.org/wiki/ISO/IEC_8859-1): In 1992, the IANAhttp://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority registered the character map

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
The first 256 characters of the Unicode Standard *are* compatible with ISO/IEC 8859-1 (Latin-1), but you need to distinguish what happens for the graphic characters from what happens for the control codes. ISO 8859-1 defines *graphic* characters in the ranges 0x20..0x7E, 0xA0..0xFF. Those are

Re: latin1 decoder implementation

2012-11-16 Thread Michael Everson
On 16 Nov 2012, at 22:23, Buck Golemon b...@yelp.com wrote: I actually did quote that, to no avail. Your workmates have no reason to be sceptical. Michael Everson * http://www.evertype.com/

Re: latin1 decoder implementation

2012-11-16 Thread Jukka K. Korpela
2012-11-17 0:20, Michael Everson wrote: On 16 Nov 2012, at 22:12, Buck Golemon b...@yelp.com wrote: That's my personal understanding as well, but can you help me find documentation that I can show to my skeptical workmates? It is the basis for most popular 8-bit character sets, including

RE: latin1 decoder implementation

2012-11-16 Thread Phillips, Addison
Subject: Re: latin1 decoder implementation That's my personal understanding as well, but can you help me find documentation that I can show to my skeptical workmates? On Fri, Nov 16, 2012 at 2:11 PM, Doug Ewell d...@ewellic.orgmailto:d...@ewellic.org wrote: Code points U+ through U+00FF in Unicode

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
Actually, what Buck really needs is Section 16.1 Control Codes: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf That explains the situation for the *non* graphic characters in the range U+..U+00FF, which is the source of the concern for Buck's skeptical workmates, I'm sure. --Ken

Re: latin1 decoder implementation

2012-11-16 Thread Doug Ewell
To no avail? I’m not sure why your colleagues would not believe the statement taken right out of the standard, or the mapping file taken from the Unicode Web site, but would believe the Wikipedia article. If they think there is a mismatch, where do they think it is? -- Doug Ewell | Thornton,

Re: latin1 decoder implementation

2012-11-16 Thread Michael Everson
On 16 Nov 2012, at 22:31, Jukka K. Korpela jkorp...@cs.tut.fi wrote: That’s not really adequate. Wikipedia is neither authoritative nor accurate, and should not be cited in any dispute, except as opinions of unnamed people. No, but it isn't rocket science to look at a Latin 1 table and the

Re: latin1 decoder implementation

2012-11-16 Thread Buck Golemon
latin1 explicitly doesn't define characters (or control codes) in those ranges, but unicode does. It doesn't directly follow that decoding a byte in those undefined ranges produces a unicode-point of equal value. On Fri, Nov 16, 2012 at 2:36 PM, Doug Ewell d...@ewellic.org wrote: To no

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
A IANA-registered character *map* is a very different animal from a character encoding standard per se. The actual character encoding standard, ISO/IEC 8859-1:1998 does not define the C0 and C1 control codes (and never will). That was what I was quoting from. A mapping table, on the other

Re: latin1 decoder implementation

2012-11-16 Thread Doug Ewell
Buck Golemon wrote: latin1 explicitly doesn't define characters (or control codes) in those ranges, but unicode does. See Ken's comment about Chapter 16. Both ISO 8859-1 and Unicode defer the *actual interpretation* of control characters to ISO 6429, which is what you are looking for. --

Re: latin1 decoder implementation

2012-11-16 Thread Buck Golemon
Thanks all. My current understand is such: Latin1 explicitly gives no semantics to several byte values (for example 0x81), but acknowleges that other standards will define their semantics. Unicode provides code-points with equally-undefined semantics so that these bytes can pass through without

Re: latin1 decoder implementation

2012-11-16 Thread Doug Ewell
Buck Golemon wrote: Latin1 explicitly gives no semantics to several byte values (for example 0x81), but acknowleges that other standards will define their semantics. Unicode provides code-points with equally-undefined semantics so that these bytes can pass through without change. This allows a

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
No Unicode doesn't. But yes, is *does* follow that decoding C0/C1 control codes produces a Unicode code point of equal value. RTFM. TUS 6.2, p. 544: There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
Yep. --Ken Latin1 explicitly gives no semantics to several byte values (for example 0x81), but acknowleges that other standards will define their semantics. Unicode provides code-points with equally-undefined semantics so that these bytes can pass through without change. This allows a

Re: latin1 decoder implementation

2012-11-16 Thread Philippe Verdy
In fact not really, because Unicode DOES assign more precise semantics to a few of these controls, notably for those given whitespace and newline properties (notably TAB, LF, CR in C0 controls and NL in C1 controls, with a few additional constraints for the CR+LF sequence) as they are part of

Re: latin1 decoder implementation

2012-11-16 Thread Doug Ewell
://www.ewellic.org | @DougEwell ­ From: Philippe Verdy Sent: Friday, November 16, 2012 17:35 To: Whistler, Ken Cc: Buck Golemon ; unicode@unicode.org Subject: Re: latin1 decoder implementation In fact not really, because Unicode DOES assign more precise semantics to a few of these controls, notably

Re: latin1 decoder implementation

2012-11-16 Thread Philippe Verdy
­ From: Philippe Verdy Sent: Friday, November 16, 2012 17:35 To: Whistler, Ken Cc: Buck Golemon ; unicode@unicode.org Subject: Re: latin1 decoder implementation In fact not really, because Unicode DOES assign more precise semantics to a few of these controls, notably for those given