Re: cp1252 decoder implementation

2012-11-27 Thread Martin J. Dürst
On 2012/11/17 12:54, Buck Golemon wrote: On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewelld...@ewellic.org wrote: Buck Golemon wrote: Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to map it to the equally-non-semantic U+81 ? U+0081 (there are always at least four

RE: cp1252 decoder implementation

2012-11-24 Thread Shawn Steele
No-one would be more happy than me if we could just ditch all the legacy encodings and all switch to Unicode everywhere, but that will never happen. There is enough legacy content out there that will never be converted. That's sort of exactly the point: *NEW* content should be UTF-8 (or

RE: cp1252 decoder implementation

2012-11-24 Thread Shawn Steele
: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Masatoshi Kimura Sent: Wednesday, November 21, 2012 12:28 PM To: unicode@unicode.org Subject: Re: cp1252 decoder implementation (2012/11/22 1:58), Shawn Steele wrote: We aren’t going change names (since that’ll break anyone

Re: cp1252 decoder implementation

2012-11-23 Thread Buck Golemon
Mailing List; Buck Golemon *Subject:* RE: cp1252 decoder implementation ** ** Phillipe commented: “(even if later Microsoft decides to map some other characters in its own windows-1252 charset, like it did several times and notably when the Euro symbol was mapped)”. ** ** Personal

Re: cp1252 decoder implementation

2012-11-23 Thread Doug Ewell
Buck Golemon wrote: The status of these 5 characters is already in the best fit mappings document pointed to by the IANA registry entry for windows-1252, which is strong as I’m willing to go for them. I don't understand the relation between bestfit1252 and cp1252. Could you clarify it for me?

Re: cp1252 decoder implementation

2012-11-22 Thread Peter Krefting
Den 2012-11-21 19:30:50 skrev Doug Ewell d...@ewellic.org: My problem is with the double standard. In some people's minds, if IE does it, it's called moronic or brain-dead. If the software with the biggest market share does it, then everyone else will have to follow it, no matter what you

Re: cp1252 decoder implementation

2012-11-22 Thread Andrew Miller
Netscape 1.0 RC 1 is available here: http://www.oldversion.com/Netscape.html

Re: cp1252 decoder implementation

2012-11-21 Thread Martin J. Dürst
On 2012/11/21 16:23, Peter Krefting wrote: Doug Ewell d...@ewellic.org: Somewhat off-topic, I find it amusing that tolerance of poorly encoded input is considered justification for changing the underlying standards, The encoding work at W3C, at least as far as I see it, is not an attempt to

RE: cp1252 decoder implementation

2012-11-21 Thread Shawn Steele
-bou...@unicode.org] On Behalf Of Murray Sargent Sent: Tuesday, November 20, 2012 8:55 PM To: verd...@wanadoo.fr; Doug Ewell Cc: Unicode Mailing List; Buck Golemon Subject: RE: cp1252 decoder implementation Phillipe commented: “(even if later Microsoft decides to map some other characters in its own

RE: cp1252 decoder implementation

2012-11-21 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: But may be we could ask to Microsoft to map officially C1 controls on the remaining holes of windows-1252, to help improve the interoperability in HTML5 with a predictable and stable behavior across HTML5 applications. In that case

RE: cp1252 decoder implementation

2012-11-21 Thread Doug Ewell
Peter Krefting peter at opera dot com wrote: Somewhat off-topic, I find it amusing that tolerance of poorly encoded input is considered justification for changing the underlying standards, when Internet Explorer has been flamed for years and years for tolerating bad input. It's called

Re: cp1252 decoder implementation

2012-11-21 Thread Philippe Verdy
May be you've forgotten FrontPage, a product acquired by Microsoft and then developped by Microsoft and widely promoted as part of Office, that insisted in declaring webpages as ISO 8859-1 even if they contained characters that are only in windows-1252. Even if we edited the page externally to

Re: cp1252 decoder implementation

2012-11-21 Thread Masatoshi Kimura
(2012/11/22 1:58), Shawn Steele wrote: We aren’t going change names (since that’ll break anyone already using them), we probably won’t recognize new names (since anyone trying to use a new name wouldn’t work on millions of existing computers, so no one would add it). Hey, why Microsoft changed

RE: cp1252 decoder implementation

2012-11-21 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: May be you've forgotten FrontPage, a product acquired by Microsoft and then developped by Microsoft and widely promoted as part of Office, that insisted in declaring webpages as ISO 8859-1 even if they contained characters that are

Re: cp1252 decoder implementation

2012-11-20 Thread Doug Ewell
Buck Golemon buck at yelp dot com wrote: What effort has been spent? This is not an either/or type of proposition. If we can agree that it's an improvement (albeit small), let's update the mapping. Is it much harder than I believe it is? ISO/IEC 8859-1 is, uh, an ISO/IEC standard. CP1252 is

Re: cp1252 decoder implementation

2012-11-20 Thread Philippe Verdy
To solve the situation, it would be smarter if the W3C was not referencing the Microsoft standard itself but a standardized version of it, explaining explicitly how to handle the unassigned code positions. The W3C coud describe the expected mapping of these positions explicitly in its own

RE: cp1252 decoder implementation

2012-11-20 Thread Murray Sargent
Phillipe commented: (even if later Microsoft decides to map some other characters in its own windows-1252 charset, like it did several times and notably when the Euro symbol was mapped). Personal opinion, but I'd be very surprised if Microsoft ever changed the 1252 charset. The euro was added

Re: cp1252 decoder implementation

2012-11-20 Thread Philippe Verdy
But may be we could ask to Microsoft to map officially C1 controls on the remaining holes of windows-1252, to help improve the interoperability in HTML5 with a predictable and stable behavior across HTML5 applications. In that case the W3C needs not doing anything else and there's no need to

Re: cp1252 decoder implementation

2012-11-20 Thread Peter Krefting
Doug Ewell d...@ewellic.org: Somewhat off-topic, I find it amusing that tolerance of poorly encoded input is considered justification for changing the underlying standards, when Internet Explorer has been flamed for years and years for tolerating bad input. It's called adapting to

Re: cp1252 decoder implementation

2012-11-20 Thread Andrew Cunningham
Hi On 21 November 2012 16:42, Philippe Verdy verd...@wanadoo.fr wrote: But may be we could ask to Microsoft to map officially C1 controls on the remaining holes of windows-1252, to help improve the interoperability in HTML5 with a predictable and stable behavior across HTML5 applications. In

Re: cp1252 decoder implementation

2012-11-18 Thread Philippe Verdy
The same chapter makes a normative reference to ISO/IEC 2022 for C0 controls, it does not say that this concerns ISO/IEC 8859 (which does not reference itself ISO/IEC 2022 as being normative, but only informational just to day that it is compatible with it, as well as with ISO 6429, and a wide

Re: cp1252 decoder implementation

2012-11-18 Thread Buck Golemon
I find these to be true statements, but I don't see how they support or refute that which came before. On Sun, Nov 18, 2012 at 3:58 PM, Philippe Verdy verd...@wanadoo.fr wrote: The same chapter makes a normative reference to ISO/IEC 2022 for C0 controls, it does not say that this concerns

Re: cp1252 decoder implementation

2012-11-18 Thread Buck Golemon
On Sat, Nov 17, 2012 at 10:52 AM, Shawn Steele shawn.ste...@microsoft.comwrote: IMO this isn’t worth the effort being spent on it. MOST encodings have all sorts of interesting quirks, variations, OEM or App specific behavior, etc. These are a few code points that haven’t really caused much

RE: cp1252 decoder implementation

2012-11-18 Thread Shawn Steele
What effort has been spent? This is not an either/or type of proposition. If we can agree that it's an improvement (albeit small), let's update the mapping. Is it much harder than I believe it is? What if some application's treating it as undefined? And now the code page gets updated to

Re: cp1252 decoder implementation

2012-11-17 Thread Buck Golemon
So don't say that there are one-for-one equivalences. I was just quoting this section of the standard: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf There is a simple, one-to-one mapping between 7-bit (and 8-bit) control codes and the Unicode control codes: every 7-bit (or 8-bit)

RE: cp1252 decoder implementation

2012-11-17 Thread Shawn Steele
: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Buck Golemon Sent: Saturday, November 17, 2012 8:35 AM To: verd...@wanadoo.fr Cc: Doug Ewell; unicode Subject: Re: cp1252 decoder implementation So don't say that there are one-for-one equivalences. I was just quoting

cp1252 decoder implementation

2012-11-16 Thread Buck Golemon
cp1252 (aka windows-1252) defines 27 characters which iso-8859-1 does not. This leaves five bytes with undefined semantics. Currently the python cp1252 decoder allows us to ignore/replace/error on these bytes, but there's no facility for allowing these unknown bytes to round-trip through the

Re: cp1252 decoder implementation

2012-11-16 Thread Buck Golemon
So I find that the unicode.org cp1252 file leaves those bytes undefined as well, so the issue stems from there. ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to map it to the equally-non-semantic

Re: cp1252 decoder implementation

2012-11-16 Thread Doug Ewell
Buck Golemon wrote: Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to map it to the equally-non-semantic U+81 ? This would allow systems that follow the html5 standard and use cp1252 in place of latin1 to continue to be binary-faithful and reversible. This isn't quite

Re: cp1252 decoder implementation

2012-11-16 Thread martin
Zitat von Buck Golemon b...@yelp.com: cp1252 (aka windows-1252) defines 27 characters which iso-8859-1 does not. This leaves five bytes with undefined semantics. Currently the python cp1252 decoder allows us to ignore/replace/error on these bytes, but there's no facility for allowing these

RE: cp1252 decoder implementation

2012-11-16 Thread Shawn Steele
Golemon; unicode Subject: Re: cp1252 decoder implementation Buck Golemon wrote: Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to map it to the equally-non-semantic U+81 ? This would allow systems that follow the html5 standard and use cp1252 in place of latin1

Re: cp1252 decoder implementation

2012-11-16 Thread Buck Golemon
1) I did this and was criticized for inventing my own frankensteined encoding, although I believe it's conceptually consistent with the idea that cp1252 is to be used as a superset of latin1. It's true that what I wrote is not consistent with the unicode.orgdefinition:

Re: cp1252 decoder implementation

2012-11-16 Thread Buck Golemon
On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell d...@ewellic.org wrote: Buck Golemon wrote: Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to map it to the equally-non-semantic U+81 ? This would allow systems that follow the html5 standard and use cp1252 in place of

Re: cp1252 decoder implementation

2012-11-16 Thread Doug Ewell
Buck Golemon wrote: This isn't quite as black-and-white as the question about Latin-1. If you are targeting HTML5, you are probably safe in treating an incoming 0x81 (for example) as either U+0081 or U+FFFD, or throwing some kind of error. Why do you make this conditional on targeting html5?

Re: cp1252 decoder implementation

2012-11-16 Thread Philippe Verdy
If you are thinking about byte values you are working at the encoding scheme level (in fact another lower level which defines a protocol presentation layer, e.g. transport syntaxes in MIME). Unicode codepoints are conceptually not an encoding scheme, just a coded character set (independant of the