New submission from RalfM: I have an utf-8 encoded file containing single surrogates. Reading this file, specifying surrgatepass, works fine when I read the whole file with .read(), but raises an UnicodeDecodeError when I read the file line by line:
----- start of demo ----- Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AM D64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f: ... s = f.read() ... >>> "\ud900" in s True >>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f: ... for line in f: ... pass ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python\34x64\lib\codecs.py", line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 8190: inval id continuation byte >>> ----- end of demo ----- I attached the file used for the demo such that you can reproduce the problem. If I change all 0xED bytes in the file to 0xEC (i.e. effectively change all surrogates to non-surrogates), the problem disappears. The original file I noticed the problem with was 73 MB. The demo file was derived from the original by removing data around the critical section, keeping the alignment to 16-k-blocks, and then replacing all printable ASCII characters by x. If I change the file length by adding or removing 16 bytes to / from the beginning of the demo file, the problem disappears, so alignment seems to be an issue. All this seems to indicate that the utf-8 decoder has problems when used for incremental decoding and a single surrogate appears around the block boundary. ---------- components: Unicode files: Demo.txt messages: 243376 nosy: RalfM, ezio.melotti, haypo priority: normal severity: normal status: open title: Exception with utf-8, surrogatepass and incremental decoding type: behavior versions: Python 3.4 Added file: http://bugs.python.org/file39400/Demo.txt _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue24214> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com