tomer filiba wrote: > [...] > besides, encoding suffers from many issues. suppose you have a > damaged UTF8 file, which you read char-by-char. when we reach the > damaged part, you'll never be able to "skip" it, as we'll just keep > read()ing bytes, hoping to make a character out of it , until we > reach EOF, i.e.: > > def read_char(self): > buf = "" > while not self._stream.eof: > buf += self._stream.read(1) > try: > return buf.decode("utf8") > except ValueError: > pass > > which leads me to the following thought: maybe we should have > an "enhanced" encoding library for py3k, which would report > *incomplete* data differently from *invalid* data. today it's just a > ValueError: suppose decode() would raise IncompleteDataError > when the given data is not sufficient to be decoded successfully, > and ValueError when the data is just corrupted. > > that could aid iostack greatly.
We *do* have that functionality in Python 2.5: incremental decoders can retain incomplete byte sequences on the call to the decode() method until the next call. Only when final=True is passed in the decode() call will it treat incomplete and invalid data in the same way: by raising an exception. Incomplete input: >>> import codecs >>> d = codecs.lookup("utf-8").incrementaldecoder() >>> d.decode("\xe1") u'' >>> d.decode("\x88") u'' >>> d.decode("\xb4") u'\u1234' Invalid input: >>> import codecs >>> d = codecs.lookup("utf-8").incrementaldecoder() >>> d.decode("\x80") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte Incomplete input with final=True: >>> import codecs >>> d = codecs.lookup("utf-8").incrementaldecoder() >>> d.decode("\xe1", final=True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: unexpected end of data Servus, Walter _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com