Re: latin1 decoder implementation

2012-11-17 Thread Martin J. Dürst
Just in case it helps, Ruby (since version 1.9) also uses 3). Regards, Martin. On 2012/11/17 6:48, Buck Golemon wrote: When decoding bytes to unicode using the latin1 scheme, there are three options for bytes not defined in the ISO-8859-1 standard. 1) Throw an error. 2) Insert the

Re: latin1 decoder implementation

2012-11-17 Thread Martin J. Dürst
On 2012/11/17 9:45, Doug Ewell wrote: If he is targeting HTML5, then none of this matters, because HTML5 says that ISO 8859-1 is really Windows-1252. Yes. But unless Python wants to limit its use to HTML5, this should be handled on a separate level (mapping a iso-8859-1 label to the

Re: cp1252 decoder implementation

2012-11-17 Thread Buck Golemon
So don't say that there are one-for-one equivalences. I was just quoting this section of the standard: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf There is a simple, one-to-one mapping between 7-bit (and 8-bit) control codes and the Unicode control codes: every 7-bit (or 8-bit)

Re: latin1 decoder implementation

2012-11-17 Thread Doug Ewell
Martin J. Dürst wrote: If he is targeting HTML5, then none of this matters, because HTML5 says that ISO 8859-1 is really Windows-1252. Yes. But unless Python wants to limit its use to HTML5, this should be handled on a separate level (mapping a iso-8859-1 label to the Windows-1252 decoder

RE: cp1252 decoder implementation

2012-11-17 Thread Shawn Steele
IMO this isn't worth the effort being spent on it. MOST encodings have all sorts of interesting quirks, variations, OEM or App specific behavior, etc. These are a few code points that haven't really caused much confusion, and other code pages are much more confusing (like the CJK ones in