Hello, I feel the need to correct points in this mail for the benefit of guile-user. No reply is needed.
On Wed 15 Feb 2017 00:58, David Kastrup <[email protected]> writes: > Mike Gran <[email protected]> writes: > >> But, for what it is worth, the Latin-1/UCS-32 design decision came >> from a couple of conflicting requirements. The switch happened in the >> 1.9.x series. >> >> There was several examples of legacy C code using Guile for an >> extension language that accessed the bytes of a string directly, using >> >> SCM_STRING_CHARS or scm_i_string_chars. To keep from breaking legacy >> code, we needed to retain the capability to use this (then already >> deprecated) capability to have C programs access 8-bit-locale string >> internals directly. > > But if you don't know whether the strings are Latin-1 or UCS-32, that's > sort of academical. Not at all. Legacy programs don't use codepoints >255. For UTF-32, attempting to get the string data would throw an exception. The SCM_STRING_CHARS hack was a good trade-off. > The problem is that Guile is _constantly_ required to recode strings it > is processing. And to add insult to injury, it cannot do this without > data loss when its string encoding assumptions are wrong. In Scheme, strings are sequences of characters. Encoding and decoding is only needed when going to and from bytes. Guile supports a finite number of encodings, so in general some encoding/decoding will always be needed. The specific encoding may change over time. > PostScript files are usually encoded in Latin-1 with occasional UCS-16 > passages. Reading and writing and copying such files byte-correctly > while trying to actually parse their contents is not feasible with > Guile. Works perfectly well. The web server for example reads the request as Latin-1 and the body as something else. Just re-set the port encoding and there you go. >> I still maintain that this design decision was a good one based on the >> simplicity of implementation. > > As I said: the problem is not the chosen internal representation. The > problem is that there is no API to access it, and it does not even map > to string ports. String ports have nothing to do with the discussion AFAIU. (Ports in Guile are sequences of bytes also. They may be accessed using textual interfaces as well. Therefore a string port must have an associated encoding, to read/write the bytes. But no error is possible for textual I/O with the default UTF-8 encoding as all characters are representable. Encoding to UTF-8 is fast and space-efficient.) Andy
