Doug Ewell d...@ewellic.org:
If he is targeting HTML5, then none of this matters, because HTML5 says
that ISO 8859-1 is really Windows-1252.
For example, there is no C1 control called NL in Windows-1252. There is
only 0x85, which maps to U+2026 HORIZONTAL ELLIPSIS.
Windows-1252, does,
| @DougEwell
From: Philippe Verdy
Sent: Friday, November 16, 2012 17:35
To: Whistler, Ken
Cc: Buck Golemon ; unicode@unicode.org
Subject: Re: latin1 decoder implementation
In fact not really, because Unicode DOES assign more precise semantics to
a few of these controls, notably for those given
2012/11/19 Martin J. Dürst due...@it.aoyama.ac.jp
Note also that the W3C
does not automatically endorses the Unicode and ISO/IEC 10646 standards as
well (there's a delay before accepting newer releases of TUS and ISO/IEC
10646, and the W3C frequently adds now several restrictions).
Can
I think that Python will provide instead a factory that will return the
appropriate concrete codec when given an encoding code and the standard
body to which it must be conforming to : ISO, IETF (for MIME and the IANA
database, as specified in RFC's), W3C (for HTML5), and possibly other
private
Just in case it helps, Ruby (since version 1.9) also uses 3).
Regards, Martin.
On 2012/11/17 6:48, Buck Golemon wrote:
When decoding bytes to unicode using the latin1 scheme, there are three
options for bytes not defined in the ISO-8859-1 standard.
1) Throw an error.
2) Insert the
: Philippe Verdy
Sent: Friday, November 16, 2012 17:35
To: Whistler, Ken
Cc: Buck Golemon ; unicode@unicode.org
Subject: Re: latin1 decoder implementation
In fact not really, because Unicode DOES assign more precise semantics
to a few of these controls, notably for those given whitespace and
newline
Martin J. Dürst wrote:
If he is targeting HTML5, then none of this matters, because HTML5
says that ISO 8859-1 is really Windows-1252.
Yes. But unless Python wants to limit its use to HTML5, this should be
handled on a separate level (mapping a iso-8859-1 label to the
Windows-1252 decoder
When decoding bytes to unicode using the latin1 scheme, there are three
options for bytes not defined in the ISO-8859-1 standard.
1) Throw an error.
2) Insert the replacement glyph (fffd), indicating an unknown character.
3) Insert the unicode character with equal value. This means that
That's my personal understanding as well, but can you help me find
documentation that I can show to my skeptical workmates?
On Fri, Nov 16, 2012 at 2:11 PM, Doug Ewell d...@ewellic.org wrote:
Code points U+ through U+00FF in Unicode are identical to the
corresponding code points 0x00
Buck Golemon wrote:
Code points U+ through U+00FF in Unicode are identical to the
corresponding code points 0x00 through 0xFF in ISO 8859-1.
That's my personal understanding as well, but can you help me find
documentation that I can show to my skeptical workmates?
You can quote the
On 16 Nov 2012, at 22:12, Buck Golemon b...@yelp.com wrote:
That's my personal understanding as well, but can you help me find
documentation that I can show to my skeptical workmates?
It is the basis for most popular 8-bit character sets, including Windows-1252
and the first block of
I actually did quote that, to no avail.
This seems to be the missing information though (from the wikipedia
iso-8859-1 article http://en.wikipedia.org/wiki/ISO/IEC_8859-1):
In 1992, the
IANAhttp://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority
registered
the character map
The first 256 characters of the Unicode Standard *are* compatible with ISO/IEC
8859-1 (Latin-1), but you need to distinguish what happens for the graphic
characters from what happens for the control codes.
ISO 8859-1 defines *graphic* characters in the ranges 0x20..0x7E, 0xA0..0xFF.
Those are
On 16 Nov 2012, at 22:23, Buck Golemon b...@yelp.com wrote:
I actually did quote that, to no avail.
Your workmates have no reason to be sceptical.
Michael Everson * http://www.evertype.com/
2012-11-17 0:20, Michael Everson wrote:
On 16 Nov 2012, at 22:12, Buck Golemon b...@yelp.com wrote:
That's my personal understanding as well, but can you help me find
documentation that I can show to my skeptical workmates?
It is the basis for most popular 8-bit character sets, including
Subject: Re: latin1 decoder implementation
That's my personal understanding as well, but can you help me find
documentation that I can show to my skeptical workmates?
On Fri, Nov 16, 2012 at 2:11 PM, Doug Ewell
d...@ewellic.orgmailto:d...@ewellic.org wrote:
Code points U+ through U+00FF in Unicode
Actually, what Buck really needs is Section 16.1 Control Codes:
http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
That explains the situation for the *non* graphic characters in the range
U+..U+00FF, which is the source of the concern for Buck's skeptical
workmates, I'm sure.
--Ken
To no avail? I’m not sure why your colleagues would not believe the statement
taken right out of the standard, or the mapping file taken from the Unicode Web
site, but would believe the Wikipedia article.
If they think there is a mismatch, where do they think it is?
--
Doug Ewell | Thornton,
On 16 Nov 2012, at 22:31, Jukka K. Korpela jkorp...@cs.tut.fi wrote:
That’s not really adequate. Wikipedia is neither authoritative nor accurate,
and should not be cited in any dispute, except as opinions of unnamed people.
No, but it isn't rocket science to look at a Latin 1 table and the
latin1 explicitly doesn't define characters (or control codes) in those
ranges, but unicode does.
It doesn't directly follow that decoding a byte in those undefined ranges
produces a unicode-point of equal value.
On Fri, Nov 16, 2012 at 2:36 PM, Doug Ewell d...@ewellic.org wrote:
To no
A IANA-registered character *map* is a very different animal from a character
encoding standard per se.
The actual character encoding standard, ISO/IEC 8859-1:1998 does not define the
C0 and C1 control codes (and never will). That was what I was quoting from.
A mapping table, on the other
Buck Golemon wrote:
latin1 explicitly doesn't define characters (or control codes) in
those ranges, but unicode does.
See Ken's comment about Chapter 16. Both ISO 8859-1 and Unicode defer
the *actual interpretation* of control characters to ISO 6429, which is
what you are looking for.
--
Thanks all. My current understand is such:
Latin1 explicitly gives no semantics to several byte values (for example
0x81), but acknowleges that other standards will define their semantics.
Unicode provides code-points with equally-undefined semantics so that these
bytes can pass through without
Buck Golemon wrote:
Latin1 explicitly gives no semantics to several byte values (for
example 0x81), but acknowleges that other standards will define their
semantics.
Unicode provides code-points with equally-undefined semantics so that
these bytes can pass through without change.
This allows a
No Unicode doesn't. But yes, is *does* follow that decoding C0/C1 control codes
produces a Unicode code point of equal value. RTFM. TUS 6.2, p. 544:
There are 65 code points set aside in the Unicode Standard for compatibility
with the C0 and C1 control codes defined in the ISO/IEC 2022
Yep.
--Ken
Latin1 explicitly gives no semantics to several byte values (for example 0x81),
but acknowleges that other standards will define their semantics.
Unicode provides code-points with equally-undefined semantics so that these
bytes can pass through without change.
This allows a
In fact not really, because Unicode DOES assign more precise semantics to a
few of these controls, notably for those given whitespace and newline
properties (notably TAB, LF, CR in C0 controls and NL in C1 controls, with
a few additional constraints for the CR+LF sequence) as they are part of
://www.ewellic.org | @DougEwell
From: Philippe Verdy
Sent: Friday, November 16, 2012 17:35
To: Whistler, Ken
Cc: Buck Golemon ; unicode@unicode.org
Subject: Re: latin1 decoder implementation
In fact not really, because Unicode DOES assign more precise semantics
to a few of these controls, notably
From: Philippe Verdy
Sent: Friday, November 16, 2012 17:35
To: Whistler, Ken
Cc: Buck Golemon ; unicode@unicode.org
Subject: Re: latin1 decoder implementation
In fact not really, because Unicode DOES assign more precise semantics to
a few of these controls, notably for those given
29 matches
Mail list logo