On 2022-08-17, Barry <ba...@barrys-emacs.org> wrote: >> On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list >> <python-list@python.org> wrote: >> On 2022-08-17, Tobiah <t...@tobiah.org> wrote: >>> I get data from various sources; client emails, spreadsheets, and >>> data from web applications. I find that I can do >>> some_string.decode('latin1') >>> to get unicode that I can use with xlsxwriter, >>> or put <meta charset="latin1"> in the header of a web page to display >>> European characters correctly. But normally UTF-8 is recommended as >>> the encoding to use today. latin1 works correctly more often when I >>> am using data from the wild. It's frustrating that I have to play >>> a guessing game to figure out how to use incoming text. I'm just wondering >>> if there are any thoughts. What if we just globally decided to use utf-8? >>> Could that ever happen? >> >> That has already been decided, as much as it ever can be. UTF-8 is >> essentially always the correct encoding to use on output, and almost >> always the correct encoding to assume on input absent any explicit >> indication of another encoding. (e.g. the HTML "standard" says that >> all HTML files must be UTF-8.) >> >> If you are finding that your specific sources are often encoded with >> latin-1 instead then you could always try something like: >> >> try: >> text = data.decode('utf-8') >> except UnicodeDecodeError: >> text = data.decode('latin-1') >> >> (I think latin-1 text will almost always fail to be decoded as utf-8, >> so this would work fairly reliably assuming those are the only two >> encodings you see.) > > Only if a reserved byte is used in the string. > It will often work in either.
Because it's actually ASCII and hence there's no difference between interpreting it as utf-8 or iso-8859-1? In which case, who cares? > For web pages it cannot be assumed that markup saying it’s utf-8 is > correct. Many pages are I fact cp1252. Usually you find out because > of a smart quote that is 0xa0 is cp1252 and illegal in utf-8. Hence what I said above. But if a source explicitly states an encoding and it's false then these days I see little need for sympathy. -- https://mail.python.org/mailman/listinfo/python-list