> >> > I wonder if it would be possible to return the state as a pair > >> > (unread, flags) where unread is a (byte) string of unprocessed bytes > >> > and flags is some other state, with the constraint that in the initial > >> > state the flags must be zero. Then I can optimize the case where flags > >> > is returned as zero by subtracting len(unread) from the current > >> > position and that'd be the correct seek position. > >> > >> I'd say that bytestream.tell() is the correct position. > >> > >> Or should seek() return to the last position where the codec was in a > >> default state without anything buffered? (This can't work for UTF-16, > >> because the codec almost never is in the default state.) > > > > That was my hope, yes (and I realize that UTF-16 is an exception). > > We could designate natural endianness as the default state, but that > would mean that a codec state can't be transferred to a different > machine (or we could declare little (or big) endianness to be the > default state).
I think it's okay for file positions involving codec states not to be tranferable between platforms. I think they wouldn't even be guaranteed between subsequent runs of the same program. > > Consider UTF-8 though. If the chunk we read from the byte stream ended > > in the middle of a multi-byte character, the codec will have the first > > part of that character buffered. In general we want to subtract > > buffered data from the byte stream's position when reporting the > > position of the text stream. The idea is that if we later seek to the > > reported position, we should be reading the same character data. This > > can be accomplished in two ways: by backing up the byte stream to the > > previous character boundary, and resetting the decoder to neutral; or > > by positioning the byte stream to where it was originally and setting > > the state of the decoder to what it was before. However, backing up > > the byte stream has the advantage that no decoder state needs to be > > encoded in the position cookie. > > OK, so for decoders getstate() should always return a tuple, with the > first entry being the buffered byte string (or bytes object?) and the > second being additional state data. > > Do we need any specification for encoders? I don't need this for encoders at all -- we don't use incremental encoders, only incremental decoders. > >> The state returned from getstate() should be treated as an opaque value > >> (e.g. for the buffered incremental codecs it is the buffered string, for > >> the UTF-16 encoder it's the flag indicating whether a BOM has been > >> written etc.). The codecs try to return None, if they are in some kind > >> of default state (e.g. there's nothing buffered). > > > > I would like to await completion of those unit tests; > > The second version of the patch includes the unit tests (and fixes the > utf-8-sig codec). > > > there seem to be > > some subtle issues. > > Can you be more concrete? I think I just meant the str/bytes issue I already mentioned. > > I wonder if setstate() should call self.reset() > > first. > > Calling reset() and calling setstate() with the initial state should > have the same effect. OK, I should do that anyway. (I wasn't aware of reset() until I saw your patch. ;-) > > I'd also like to ask if setstate() could default to "" only if > > the argument is None, not if it is empty; I'd like to use it to change > > the buffer to be a bytes object. > > I'd say for Python 3000 it should always be a bytes object. Eventually, yes. But right now we're in a world where sometimes there are bytes and sometimes there are (8-bit) strings -- and I'd like to get as many tests passing with the new IO library without making it the default first. > Will this > interoperate seamlessly with the C part of the codec machinery? It should if it uses the buffer API as it should. When I encounter places where it requires 8-bit strings I'll fix them opportunistically. > > (And yes, I need to maintain more > > hacks for that, alas). > > I'l try to update the patch tomorrow or over the weekend. Thanks! -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com