Hello! Mike Gran <spk...@yahoo.com> writes:
> Strings are internally encoded either as "narrow" 8-bit ISO-8859-1 > strings or as "wide" UTF-32 strings. Strings are usually created as > narrow strings. Narrow strings get automatically widened to wide > strings if non-8-bit characters are set! or appended to them. Great! > The machine-readable "write" form of strings has been changed. Before, > non-printable characters were given as hex escapes, for example \xFF. > Now there are three levels of hex escape for 8, 16, and 24 bit > characters: \xFF, \uFFFF, \UFFFFFF. This is a pretty common convention. > But after I coded this, I noticed that R6RS has a different convention > and I'll probably go with that. OK. I think it's probably good to follow R6RS when it has something to say. > The internal representation of strings seems to work already, but, the > reader doesn't work yet. For now, one can make wide strings like this: > >> (setlocale LC_ALL "") > ==> "en_US.UTF-8" > >> (define str (apply string (map integer->char '(100 200 300 400 500)))) > >> (write str) > ==>"d\xc8\u012c\u0190\u01f4" > > (display str) > ==>dÈĬƐǴ Eh eh, looks nice. Looking forward to typing `(λ (x y) (+ x y))'. ;-) > This is all going to be slower than before because of the string > conversion operations, but, I didn't want to do any premature > optimization. First, I wanted to get it working, but, there is plenty > of room for optimization later. Good. Maybe it'd be nice to add simple micro-benchmarks for `string-ref', `string-set!' et al. under `benchmarks'. > Character encoding needs to be a property of ports, so that not all > string operations are done in the current locale. This is necessary so > that UTF-8-encoded source files are not interpreted differently based on > the current locale. You seem to imply that `scm_getc ()' will now return a Unicode codepoint, is that right? What about `scm_c_{read,write} ()', and `scm_{get,put}s ()'? > The VM and interpreter need to be updated to deal with wide chars and > probably in other ways that are unclear to me now. Wide strings are > currently getting truncated to 8-bit somewhere in there. The compiler could use bytevectors when dealing with bytecode. Maybe that would clarify things. Thanks, Ludo'.