D'Arcy J.M. Cain wrote: > On Sat, 12 Dec 2015 21:35:36 +0100 > Peter Otten <__pete...@web.de> wrote: >> def read_file(filename): >> for encoding in ["utf-8", "iso-8859-1"]: >> try: >> with open(filename, encoding=encoding) as f: >> return f.read() >> except UnicodeDecodeError: >> pass >> raise AssertionError("unreachable") > > I replaced this in my test and it works. However, I still have a > problem with my actual code. The point of this code was that I expect > all the files that I am reading to be either ASCII, UTF-8 or LATIN-1 > and I want to normalize my input. My problem may actually be elsewhere. > > My application is a web page of my wife's recipes. She has hundreds of > files with a recipe in each one. Often she simply typed them in but > sometimes she cuts and pastes from another source and gets non-ASCII > characters. So far they seem to fit in the three categories above. > > I added test prints to sys.stderr so that I can see what is happening. > In one particular case I have this "73 61 75 74 c3 a9" in the file. > When I open the file with > "open(filename, "r", encoding="utf-8").read()" I get what appears to be > a latin-1 string.
No, you get unicode. The escape code for the 'LATIN SMALL LETTER E WITH ACUTE' codepoint just happens to be the same as its latin-1 value: >>> print(ascii("é")) '\xe9' >>> print("é".encode("latin1")) b'\xe9' > I print it to stderr and view it in the web log. > The above string prints as "saut\xe9". The last is four actual > characters in the file. > > When I try to print it to the web page it fails because the \xe9 > character is not valid ASCII. Can you give some code that reproduces the error? What is the traceback? > However, my default encoding is utf-8. That doesn't matter. sys.stout.encoding/sys.stderr.encoding are relevant. > Other web pages on the same server display fine. > > I have the following in the Apache config by the way. > > SetEnv PYTHONIOENCODING utf8 > > So, my file is utf-8, I am reading it as utf-8, my Apache server output > is set to utf-8. How is ASCII sneaking in? I don't know. Have you verified that python "sees" the setting, e. g. with import os import sys ioencoding = os.environ.get("PYTHONIOENCODING") assert ioencoding == "utf8" assert sys.stdout.encoding == ioencoding assert sys.stderr.endoding == ioencoding Have you tried setting LANG as Oscar suggested in the other thread? -- https://mail.python.org/mailman/listinfo/python-list