Wide strings

2009-01-25 Thread Mike Gran
Hi.  I know there has been a lot of talk about wide characters and
Unicode over the years.  I'd like to see it happen because how the are
implemented will determine the future of a couple of my side-projects.
I could pitch in, if you needed some help.

I looked over the history of guile-devel, and there has been a
tremendous amount of discussion about it.  Also, the Schemes seem to
be each inventing their own solution.

Tom Lord's 2003 proposal
    http://lists.gnu.org/archive/html/guile-devel/2003-11/msg00036.html
Marius Vollmer's idea
    http://lists.gnu.org/archive/html/guile-devel/2005-08/msg00029.html
R6RS
    http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html#node_chap_1
MIT Scheme
    
http://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Internal-Representation-of-Characters.html

There has also been some back-and-forth about to what extent the
internal representation of strings should be accessible, whether the
internal representation should be a vector or if it can be something
more efficient, and how not to completely break regular expressions.

Also, there is the question as to whether a wide character is a
codepoint or a grapheme.

Is there a current proposal on the table for how to reach this?

If you suffering from a dearth of opinions, I certainly have some
ideas.




Re: Wide strings

2009-01-25 Thread Ludovic Courtès
Hello!

Mike Gran spk...@yahoo.com writes:

 Hi.  I know there has been a lot of talk about wide characters and
 Unicode over the years.  I'd like to see it happen because how the are
 implemented will determine the future of a couple of my side-projects.
 I could pitch in, if you needed some help.

Indeed, it looks like you have some experience with GuCu!  ;-)

I agree it would be really nice to have Unicode support, but I'm not
aware of any plan, so please go ahead!  :-)

A few considerations regarding the inevitable debate about the internal
string representation:

  1. IMO it'd be nice to have ASCII strings special-cased so that they
 are always encoded in ASCII.  This would allow for memory savings
 since, e.g., most symbols are expected to contain only ASCII
 characters.  It might also simplify interaction with C in certain
 cases; for instance, it would make it easy to have statically
 initialized ASCII Scheme strings [0].

  2. O(1) `string-{ref,set!}' is somewhat mandated by parts of SRFI-13.
 For instance, `substring' takes indices as parameters,
 `string-index' returns an index, etc. (John Cowan once argued that
 an abstract type to represent the position would remove this
 limitation [1], but the fact is that we have to live with SRFI-13).

  3. GLib et al. like UTF-8, and it'd be nice to minimize the overhead
 when interfacing with these libs (e.g., by avoiding translations
 from one string representation to another).

  4. It might be nice to be friendly to `wchar_t' and friends.

Interestingly, some of these things are contradictory.

Will Clinger has a good summary of a range of possible implementations:

  https://trac.ccs.neu.edu/trac/larceny/wiki/StringRepresentations

Thanks,
Ludo'.

[0] http://thread.gmane.org/gmane.lisp.guile.devel/7998
[1] http://lists.r6rs.org/pipermail/r6rs-discuss/2007-April/002252.html





Re: Wide strings

2009-01-25 Thread Neil Jerram
2009/1/25 Ludovic Courtès l...@gnu.org:

 I agree it would be really nice to have Unicode support, but I'm not
 aware of any plan, so please go ahead!  :-)

Indeed.

 A few considerations regarding the inevitable debate about the internal
 string representation:

[...]

But what about the other possible debate, about the API?  Are you
thinking that we should accept R6RS's choice?

(I really haven't read up on all this enough - however when reading
Tom Lord's analysis just now, I was thinking why not just specify
that things like char-upcase don't work in the difficult cases, and
it seems to me that this is what R6RS chose to do.  So at first glance
the R6RS API looks OK to me.

(Although I read them at the time, I can't remember now what Tom's
remaining concerns with the R6RS proposal were; should probably go
back and read those again.  On the other hand, Tom did eventually vote
for R6RS, so I would guess that they can't have been that bad.))

Regards,
Neil




Re: Wide strings

2009-01-25 Thread Mike Gran
 From: Ludovic Courtès l...@gnu.org

I believe that we should aim for R6RS strings.

I think the most important thing is to have humility in the face of an
impossible problem: how to encode all textual information.  It is
important to stand on the shoulders of giants here.  It becomes a
matter of deciding which actively developed library of wide character
functions is to be used and how to integrate it.

There are 3 good, actively developed solutions of which I am aware.

1.  Use GNU libc functionality.  Encode wide strings as wchar_t.

2.  Use GLib functionality.  Encode wide strings as UTF-8.  Possibly
give up on O(1).  Possibly add indexing information to string to allow
O(1), which might negate the space advantage of UTF-8.
 
3.  Use IBM's ICU4c.  Encode wide strings as UTF-16.  Thus, add an
obscure dependency.

Option 3 is likely a non-starter, because it seems that Guile has
tried to avoid adding new non-GNU dependencies.  It is technologically
a great solution, IMHO.

Option 1 is probably the way to go, because it keeps Guile close to
the metal and keeps dependencies out of it.  Unfortunately, UTF-8
strings would require conversion.

  1. IMO it'd be nice to have ASCII strings special-cased so that they
    are always encoded in ASCII.  This would allow for memory savings
    since, e.g., most symbols are expected to contain only ASCII
    characters.  It might also simplify interaction with C in certain
    cases; for instance, it would make it easy to have statically
    initialized ASCII Scheme strings.

Why not?  It does solve the initialization problem of dealing with strings
before setlocale has been called.

Let's say that a string is a union of either an ASCII char vector or a
wchar_t vector.  A character then is just a Unicode codepoint.
String-ref returns a wchar_t.  This is all in line with R6RS as I
understand it.

There could then be a separate iterator and function set that does
(likely O(n)) operations on the grapheme clusters of strings.  A
grapheme cluster is a single written symbol which may be made up of
several codepoints.  Unicode Standard Annex #29 describes how to
partition a string into a set of graphemes.[1]

There is the problem of systems where wchar_t is 2 bytes instead of 4
bytes, like Cygwin.  For those systems, I'd recommend
restricting functionality to 16-bit characters instead of trying to
add an extra UTF-16 encoding/decoding step.  I think there should
always be a complete codepoint in each wchar_t.

-- 
Mike Gran

[1] http://www.unicode.org/reports/tr29/




Re: r6rs libraries

2009-01-25 Thread Julian Graham
Hi everyone,

(Switching this conversation to guile-devel from guile-user, since it
seems more appropriate to this list...)

Alright, so I've been studying the van Tonder and Dybvig-Ghuloum
implementations and banging my head against chapter 7 of R6RS, all
with an eye towards mapping them onto Guile's module system, and I
can't for the life of me figure out why the existing implementations
are as complicated as they are.  Maybe some more advanced Schemers
than I can shed some light on the following:

* Import and export levels seem to be a fancy way of notifying the
library system of the time at which a library needs to be
loaded/evaluated -- that is, if you import something from [library
foo] for the expand phase of [library bar] you've got to evaluate
(i.e., convert to a Guile module) the S-exp for [library foo] before
you can evaluate the S-exp for [library bar].  The levels system is
simply a numerical way of encapsulating this information, but the
proper order of evaluation can also be inferred by inspecting the
import- and export-specs of the libraries being loaded -- i.e., if the
header of [library bar] specifies an import of anything from [library
foo], no matter at what level, it's a safe move to evaluate [library
foo] (if you haven't already done so) before finishing the evaluation
of [library bar].  Is that right?

* R6RS says that a library's imports need to be visited/instantiated
at the time the bindings they export are referenced.  Why?  As
above, why can't they be visited/instantiated at the time the imports
for the importing library are processed?  Is there any noticeable
difference to the user?  Or do you guys read R6RS 7.2 to mean that the
side-effects of top-level expressions absolutely need to happen at a
time determined by the import level?

* R6RS also says that implementations are free to visit/instantiate
libraries more or less often than is required by the import-export
graph.  Why would you want to visit/instantiate a library more than
once?  Why not just do it once, turn it into a module, and cache it?
Andy Wingo noted that some implementations do a fresh visit for every
phase (and that it's problematic), but I can't even see why you'd want
to if the spec lets you off the hook for it.

I understand that the authors of the reference implementation
re-created a lot of machinery out of whole cloth since they were
avoiding assumptions about features of their target Scheme platforms,
but, man, both van Tonder and Dybvig-Ghuloum look like overkill for
Guile.  Am I missing a major piece of understanding here?


Regards,
Julian