Re: [r6rs-discuss] Thing1: Inflammatory I/O

John Cowan Tue, 29 Sep 2009 07:45:05 -0700

Alaric Snell-Pym scripsit:

> The behaviour of read-char in terms of read-octet will need careful
> specifying for funny encodings, mind; some encodings have control
> characters that shift modes and the like, but aren't part of any
> character, so the byte on which a character boundary sits is a bit
> vague. I guess the best approach to that is to say that read-char
> reads 0 or more non-character octets, if present, then reads enough
> octets to decode one character, and anything it's buffered, it shares
> the buffer with read-octet.


That *sounds* good, but it's horribly slow in practice, and interpreters
(without JITs) will suffer especially badly from it.  Character
encoding/decoding needs to be done in big buffers for the same reason
that actual I/O does.  Making those buffers the same buffer is horribly
messy: if the internal character format is UTF-16 and the file encoding
is ASCII, you need twice as big a decoding buffer as the I/O buffer to
get any decent efficiency at all.

> This will run into issues with any hypothetical character encoding
> that uses sub-octet character boundaries, but that can be dealt with
> too, I think: if you do a read-octet when the character reader is in
> mid-octet, then the spare bits are discarded and you get the next octet.

Character encodings can be weird, but not *that* weird.  Bit-level
compression, when present, is usually expanded/compressed by a layer
between binary I/O and character I/O.

-- 
A rose by any other name                            John Cowan
may smell as sweet,                                 http://www.ccil.org/~cowan
but if you called it an onion                       [email protected]
you'd get cooks very confused.          --RMS

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Thing1: Inflammatory I/O

Reply via email to