Date: Thu, 19 Feb 2015 12:32:04 +0100 From: <cra...@gmx.net> Is there a way to globally (or for a port) tell MIT/GNU Scheme to never slashify anything? Whatever I send in, I want out, in exactly the same bytes. No special handling of ISO-8859-1, UTF-8 or whatever.
DISPLAY and WRITE-STRING will do exactly that, on a binary port. WRITE never will: WRITE will always escape `"' and `\', at the very least, and usually many other octets that are not usually graphic such as control characters. In particular, it is designed to escape any octets that do not represent graphic characters in ISO-8859-1. I think that's a little silly -- it should be limited to US-ASCII, not ISO-8859-1, by default. Currently the S-expression notation that MIT Scheme uses is defined in terms of ISO-8859-1 sequences. If you changed that to UTF-8 sequences, it would still work to limit the octets written verbatim in strings to be the US-ASCII graphic ones. But if you want a string containing the UTF-8 sequence for eszett to be written as the UTF-8 sequence for double-quoted eszett, it's not as simple a matter as changing the set of octets that should be escaped when unparsing a string versus written verbatim: what are called `strings' in MIT Scheme are more accurately `octet vectors', and do not necessarily contain only valid UTF-8 sequences. (Operations on `utf8-strings' are operations on strings which are expected to contain only valid UTF-8 sequences.) I wouldn't object to changing the S-expression notation so that it is defined in terms of UTF-8 sequences, although maybe it should be made a configurable option to avoid breaking any existing ISO-8859-1 S-expressions. We already have a few such configurable options, such as the keyword style, in the parser and unparser. You might (a) add a new parser file attribute, coding; (b) change the parser to do (port/set-coding port <coding>)[*]; (c) change HANDLER:STRING to do (port/set-coding port* <coding>); (d) add a new unparser variable *UNPARSER-CODING*; and (e) add logic to the string unparser to write verbatim all longest substrings of the string that are valid octet sequences in the current coding system (and don't contain `"', `\', or control characters), and escape all other octets. Similar considerations would have to apply to character literals and symbols. If you want to limit the allowed coding systems for the parser and unparser to be US-ASCII, ISO-8859-1, and UTF-8, that's OK too -- I don't think anyone actually cares about writing Scheme code in UTF-32BE. I know this isn't easy, and I know it's frustrating for anyone who wants to work with languages other than English. But anything less than this is going to cause even more problems for everybody. [*] As an aside, our scheme for binary I/O and coding systems is not very sensible. There should really be one concept of binary I/O sources/sinks, and a separate concept of decoding/encoding text in particular coding systems. But for now, maybe we should have an operation PORT/WITH-CODING that dynamically binds the coding system, and the parser should use that instead of modifying the port it is given, and if you pass the parser a port in a non-binary coding system you shouldn't expect anything good to come of it. _______________________________________________ MIT-Scheme-devel mailing list MIT-Scheme-devel@gnu.org https://lists.gnu.org/mailman/listinfo/mit-scheme-devel