i haven't been online for the last couple of days, so i'll unify my replies into one post.
[Talin] > Right now, a typical > file handle consists of 3 "layers" - one representing the backing store > (file, memory, network, etc.), one for adding buffering, and one > representing the program-level API for reading strings, bytes, decoded > text, etc. yes, and it's also good you noted *typical*. the design is to allow virtually unlimited number of such layers, stacked one after the other, giving you very fine level of control without having to write a single line of "procedural" or tailored code. you just mix in what you want. [Talin] > I wonder if it wouldn't be better to cut that down to two. Specifically, > I would like to suggest eliminating the buffering layer. > My reasoning is fairly straightforward: Most file system handles, > network handles and other operating system handles already support > buffering, and they do a far better job of it than we can. indeed, but as guido said (and i believe it also says so at my wiki page), stdio cannot be trusted, let alone the way different OSes implement things. buffering, for once, is a horrible issue. i remember an old C program i wrote that worked fine on windows, but not on linux, because i didn't print a newline and stdout was line-buffered... i couldn't see the output, and it was a nightmare to debug. [Talin] > Well, as far as readline goes: In order to split the text into lines, > you have to decode the text first anyway, which is a layer 3 operation. > You can't just read bytes until you get a \n, because the file you are > reading might be encoded in UCS2 or something. well, the LineBufferedLayer can be "configured" to split on any "marker", i.e.: LineBufferedLayer(stream, marker = "\x00\x0a") and of course layer 3, which creates layer 2, can set this marker to any byte sequence. note it's a *byte* sequence, not chars, since this passes down to layer 1 transparently. i.e. delimiters = {"utf8" : "\x0a", "utf16" : "\x00\x0a"} def textfile(filename, mode, encoding = None): f = FileStream(filename, mode) f = LineBufferingLayer(f, delimiters[encoding]) f = TextInterface(f, encoding) return f [Talin] > It seems to me that no matter how you slice it, you can't have an > abstract "buffering" layer that is independent of both the layer beneath > and the layer above. but that's the whole idea! buffering is a complicated task that must *not* be rewritten for every type of underlying storage. if one wanted to write or read lines over a socket, one shouldn't have need to reimplement file-like line buffering, as done by socket.py. i want to be able to read lines directly from any stream: socket, file, or memory. how i choose to implement my HTTP parser is my only concern, i don't want to be limited by the kind of stream my parser would work over. [Nick] > You'd insert a buffering layer at the appropriate point for whatever you're > trying to do. The advantage of pulling the buffering out into a separate layer > is that it can be reused with different byte sources & sinks by supplying the > appropriate configuration parameters, instead of having to reimplement it for > each different source/sink. indeed [Marcin] > I think buffering makes sense as the topmost layer, and typically only > there. > Encoding conversion and newline conversion should be performed a block > at a time, below buffering, so not only I/O syscalls, but also > invocations of the recoding machinery are amortized by buffering. you have a good point, which i also stumbled upon when implementing the TextInterface. but how would you suggest to solve it? write()ing is always simpler, because you already have the entire buffer, which you can encode as a chunk. when read()ing, you can decode() the entire pre-read buffer first, but then you have a "tail" of undecodable data (an incomplete character or record), which would be quite nasty to handle. besides, encoding suffers from many issues. suppose you have a damaged UTF8 file, which you read char-by-char. when we reach the damaged part, you'll never be able to "skip" it, as we'll just keep read()ing bytes, hoping to make a character out of it , until we reach EOF, i.e.: def read_char(self): buf = "" while not self._stream.eof: buf += self._stream.read(1) try: return buf.decode("utf8") except ValueError: pass which leads me to the following thought: maybe we should have an "enhanced" encoding library for py3k, which would report *incomplete* data differently from *invalid* data. today it's just a ValueError: suppose decode() would raise IncompleteDataError when the given data is not sufficient to be decoded successfully, and ValueError when the data is just corrupted. that could aid iostack greatly. -tomer _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com