Re: Wide strings status

Mike Gran Tue, 21 Apr 2009 20:26:30 -0700

On Tue, 2009-04-21 at 23:37 +0200, Ludovic Courtès wrote:

> > This is all going to be slower than before because of the string
> > conversion operations, but, I didn't want to do any premature
> > optimization.  First, I wanted to get it working, but, there is plenty
> > of room for optimization later.
> 
> Good.  Maybe it'd be nice to add simple micro-benchmarks for
> `string-ref', `string-set!' et al. under `benchmarks'.
>


I'll put it on my todo list.

> > Character encoding needs to be a property of ports, so that not all
> > string operations are done in the current locale.  This is necessary so
> > that UTF-8-encoded source files are not interpreted differently based on
> > the current locale.
> 
> You seem to imply that `scm_getc ()' will now return a Unicode
> codepoint, is that right?  What about `scm_c_{read,write} ()', and
> `scm_{get,put}s ()'?
> 

I vacillate on this, but, I think the most logical approach is to have
scm_getc return codepoints and to have the rest of those functions
return strings that could contain wide characters.  This is if and only
if the port has been assigned a character encoding.  If it doesn't have
an associated encoding, ports will be treated as de facto ISO-8859-1,
where character values between 0 and 255 are stored without any
interpretation and characters greater than 255 are invalid.  (Unicode
codepoints 0 to 255 are by design the same as ISO-8859-1.)

> > The VM and interpreter need to be updated to deal with wide chars and
> > probably in other ways that are unclear to me now.  Wide strings are
> > currently getting truncated to 8-bit somewhere in there.
> 
> The compiler could use bytevectors when dealing with bytecode.  Maybe
> that would clarify things.

On those issues, I'll have to concede to the wisdom of others.  I'll do
what I can with the C code, and then I'll need help.

> 
> Thanks,
> Ludo'.
> 

Thanks for taking the time.

-Mike

Re: Wide strings status

Reply via email to