Guido van Rossum wrote: >> >> > I wonder if it would be possible to return the state as a pair >> >> > (unread, flags) where unread is a (byte) string of unprocessed bytes >> >> > and flags is some other state, with the constraint that in the >> initial >> >> > state the flags must be zero. Then I can optimize the case where >> flags >> >> > is returned as zero by subtracting len(unread) from the current >> >> > position and that'd be the correct seek position. >> >> >> >> I'd say that bytestream.tell() is the correct position. >> >> >> >> Or should seek() return to the last position where the codec was in a >> >> default state without anything buffered? (This can't work for UTF-16, >> >> because the codec almost never is in the default state.) >> > >> > That was my hope, yes (and I realize that UTF-16 is an exception). >> >> We could designate natural endianness as the default state, but that >> would mean that a codec state can't be transferred to a different >> machine (or we could declare little (or big) endianness to be the >> default state). > > I think it's okay for file positions involving codec states not to be > tranferable between platforms. I think they wouldn't even be > guaranteed between subsequent runs of the same program.
OK, done in the third version of the patch. >> > Consider UTF-8 though. If the chunk we read from the byte stream ended >> > in the middle of a multi-byte character, the codec will have the first >> > part of that character buffered. In general we want to subtract >> > buffered data from the byte stream's position when reporting the >> > position of the text stream. The idea is that if we later seek to the >> > reported position, we should be reading the same character data. This >> > can be accomplished in two ways: by backing up the byte stream to the >> > previous character boundary, and resetting the decoder to neutral; or >> > by positioning the byte stream to where it was originally and setting >> > the state of the decoder to what it was before. However, backing up >> > the byte stream has the advantage that no decoder state needs to be >> > encoded in the position cookie. >> >> OK, so for decoders getstate() should always return a tuple, with the >> first entry being the buffered byte string (or bytes object?) and the >> second being additional state data. >> >> Do we need any specification for encoders? > > I don't need this for encoders at all -- we don't use incremental > encoders, only incremental decoders. True for reading, but what about writing? >> >> The state returned from getstate() should be treated as an opaque >> value >> >> (e.g. for the buffered incremental codecs it is the buffered >> string, for >> >> the UTF-16 encoder it's the flag indicating whether a BOM has been >> >> written etc.). The codecs try to return None, if they are in some kind >> >> of default state (e.g. there's nothing buffered). >> > >> > I would like to await completion of those unit tests; >> >> The second version of the patch includes the unit tests (and fixes the >> utf-8-sig codec). >> >> > there seem to be >> > some subtle issues. >> >> Can you be more concrete? > > I think I just meant the str/bytes issue I already mentioned. Since the new version never sets the buffer to an explicit value except in the constructor this problem should have disappeared. >> > I wonder if setstate() should call self.reset() >> > first. >> >> Calling reset() and calling setstate() with the initial state should >> have the same effect. > > OK, I should do that anyway. (I wasn't aware of reset() until I saw > your patch. ;-) > >> > I'd also like to ask if setstate() could default to "" only if >> > the argument is None, not if it is empty; I'd like to use it to change >> > the buffer to be a bytes object. >> >> I'd say for Python 3000 it should always be a bytes object. > > Eventually, yes. But right now we're in a world where sometimes there > are bytes and sometimes there are (8-bit) strings -- and I'd like to > get as many tests passing with the new IO library without making it > the default first. OK. >> Will this >> interoperate seamlessly with the C part of the codec machinery? > > It should if it uses the buffer API as it should. When I encounter > places where it requires 8-bit strings I'll fix them > opportunistically. > >> > (And yes, I need to maintain more >> > hacks for that, alas). >> >> I'l try to update the patch tomorrow or over the weekend. > > Thanks! Done. I've also added documentation (The description of the constraints on the decoder state sounds quite esoteric ;)). Servus, Walter _______________________________________________ Python-3000 mailing list [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com