D'Arcy J.M. Cain wrote: > More Unicode bafflement. What I am trying to do is pretty simple I > think. I have a bunch of files that I am pretty sure are either utf-8 > or iso-8859-1. I try utf-8 and fall back to iso-8859-1 if it throws a > UnicodeError. Here is my test. > > #! /usr/pkg/bin/python3.4 > # Running on a NetBSD 7.0 server > # Installed with pkgsrc > > import codecs > test_file = "StreamRecoder.txt" > > def read_file(fn): > try: return open(fn, "r", encoding='utf-8').read() > except UnicodeError: > return codecs.StreamRecoder(open(fn),
A recoder converts bytes to bytes, so you have to open the file in binary mode. However, ... > codecs.getencoder('utf-8'), > codecs.getdecoder('utf-8'), > codecs.getreader('iso-8859-1'), > codecs.getwriter('iso-8859-1'), "r").read() > > # plain ASCII > open(test_file, 'wb').write(b'abc - cents\n') > print(read_file(test_file)) > > # utf-8 > open(test_file, 'wb').write(b'abc - \xc2\xa2\n') > print(read_file(test_file)) > > # iso-8859-1 > open(test_file, 'wb').write(b'abc - \xa2\n') > print(read_file(test_file)) ...when the recoder kicks in read_file() will return bytes which is probably not what you want. Why not just try the two encodings as in def read_file(filename): for encoding in ["utf-8", "iso-8859-1"]: try: with open(filename, encoding=encoding) as f: return f.read() except UnicodeDecodeError: pass raise AssertionError("unreachable") > > I expected all three to return UTF-8 strings but here is my output: > > abc - cents > > abc - ยข > > Traceback (most recent call last): > File "./StreamRecoder_test", line 9, in read_file > try: return open(fn, "r", encoding='utf-8').read() > File "/usr/pkg/lib/python3.4/codecs.py", line 319, in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 6: > invalid start byte > > During handling of the above exception, another exception occurred: > > Traceback (most recent call last): > File "./StreamRecoder_test", line 27, in <module> > print(read_file(test_file)) > File "./StreamRecoder_test", line 15, in read_file > codecs.getwriter('iso-8859-1'), "r").read() > File "/usr/pkg/lib/python3.4/codecs.py", line 798, in read > data = self.reader.read(size) > File "/usr/pkg/lib/python3.4/codecs.py", line 489, in read > newdata = self.stream.read() > File "/usr/pkg/lib/python3.4/encodings/ascii.py", line 26, in decode > return codecs.ascii_decode(input, self.errors)[0] > UnicodeDecodeError: 'ascii' codec can't decode byte 0xa2 in position 6: > ordinal not in range(128) > -- https://mail.python.org/mailman/listinfo/python-list