On Tue, 2009-04-21 at 23:37 +0200, Ludovic Courtès wrote: > > This is all going to be slower than before because of the string > > conversion operations, but, I didn't want to do any premature > > optimization. First, I wanted to get it working, but, there is plenty > > of room for optimization later. > > Good. Maybe it'd be nice to add simple micro-benchmarks for > `string-ref', `string-set!' et al. under `benchmarks'. >
I'll put it on my todo list. > > Character encoding needs to be a property of ports, so that not all > > string operations are done in the current locale. This is necessary so > > that UTF-8-encoded source files are not interpreted differently based on > > the current locale. > > You seem to imply that `scm_getc ()' will now return a Unicode > codepoint, is that right? What about `scm_c_{read,write} ()', and > `scm_{get,put}s ()'? > I vacillate on this, but, I think the most logical approach is to have scm_getc return codepoints and to have the rest of those functions return strings that could contain wide characters. This is if and only if the port has been assigned a character encoding. If it doesn't have an associated encoding, ports will be treated as de facto ISO-8859-1, where character values between 0 and 255 are stored without any interpretation and characters greater than 255 are invalid. (Unicode codepoints 0 to 255 are by design the same as ISO-8859-1.) > > The VM and interpreter need to be updated to deal with wide chars and > > probably in other ways that are unclear to me now. Wide strings are > > currently getting truncated to 8-bit somewhere in there. > > The compiler could use bytevectors when dealing with bytecode. Maybe > that would clarify things. On those issues, I'll have to concede to the wisdom of others. I'll do what I can with the C code, and then I'll need help. > > Thanks, > Ludo'. > Thanks for taking the time. -Mike