Wide strings
Hi. I know there has been a lot of talk about wide characters and Unicode over the years. I'd like to see it happen because how the are implemented will determine the future of a couple of my side-projects. I could pitch in, if you needed some help. I looked over the history of guile-devel, and there has been a tremendous amount of discussion about it. Also, the Schemes seem to be each inventing their own solution. Tom Lord's 2003 proposal http://lists.gnu.org/archive/html/guile-devel/2003-11/msg00036.html Marius Vollmer's idea http://lists.gnu.org/archive/html/guile-devel/2005-08/msg00029.html R6RS http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html#node_chap_1 MIT Scheme http://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Internal-Representation-of-Characters.html There has also been some back-and-forth about to what extent the internal representation of strings should be accessible, whether the internal representation should be a vector or if it can be something more efficient, and how not to completely break regular expressions. Also, there is the question as to whether a wide character is a codepoint or a grapheme. Is there a current proposal on the table for how to reach this? If you suffering from a dearth of opinions, I certainly have some ideas.
Re: Wide strings
Hello! Mike Gran spk...@yahoo.com writes: Hi. I know there has been a lot of talk about wide characters and Unicode over the years. I'd like to see it happen because how the are implemented will determine the future of a couple of my side-projects. I could pitch in, if you needed some help. Indeed, it looks like you have some experience with GuCu! ;-) I agree it would be really nice to have Unicode support, but I'm not aware of any plan, so please go ahead! :-) A few considerations regarding the inevitable debate about the internal string representation: 1. IMO it'd be nice to have ASCII strings special-cased so that they are always encoded in ASCII. This would allow for memory savings since, e.g., most symbols are expected to contain only ASCII characters. It might also simplify interaction with C in certain cases; for instance, it would make it easy to have statically initialized ASCII Scheme strings [0]. 2. O(1) `string-{ref,set!}' is somewhat mandated by parts of SRFI-13. For instance, `substring' takes indices as parameters, `string-index' returns an index, etc. (John Cowan once argued that an abstract type to represent the position would remove this limitation [1], but the fact is that we have to live with SRFI-13). 3. GLib et al. like UTF-8, and it'd be nice to minimize the overhead when interfacing with these libs (e.g., by avoiding translations from one string representation to another). 4. It might be nice to be friendly to `wchar_t' and friends. Interestingly, some of these things are contradictory. Will Clinger has a good summary of a range of possible implementations: https://trac.ccs.neu.edu/trac/larceny/wiki/StringRepresentations Thanks, Ludo'. [0] http://thread.gmane.org/gmane.lisp.guile.devel/7998 [1] http://lists.r6rs.org/pipermail/r6rs-discuss/2007-April/002252.html
Re: Wide strings
2009/1/25 Ludovic Courtès l...@gnu.org: I agree it would be really nice to have Unicode support, but I'm not aware of any plan, so please go ahead! :-) Indeed. A few considerations regarding the inevitable debate about the internal string representation: [...] But what about the other possible debate, about the API? Are you thinking that we should accept R6RS's choice? (I really haven't read up on all this enough - however when reading Tom Lord's analysis just now, I was thinking why not just specify that things like char-upcase don't work in the difficult cases, and it seems to me that this is what R6RS chose to do. So at first glance the R6RS API looks OK to me. (Although I read them at the time, I can't remember now what Tom's remaining concerns with the R6RS proposal were; should probably go back and read those again. On the other hand, Tom did eventually vote for R6RS, so I would guess that they can't have been that bad.)) Regards, Neil
Re: Wide strings
From: Ludovic Courtès l...@gnu.org I believe that we should aim for R6RS strings. I think the most important thing is to have humility in the face of an impossible problem: how to encode all textual information. It is important to stand on the shoulders of giants here. It becomes a matter of deciding which actively developed library of wide character functions is to be used and how to integrate it. There are 3 good, actively developed solutions of which I am aware. 1. Use GNU libc functionality. Encode wide strings as wchar_t. 2. Use GLib functionality. Encode wide strings as UTF-8. Possibly give up on O(1). Possibly add indexing information to string to allow O(1), which might negate the space advantage of UTF-8. 3. Use IBM's ICU4c. Encode wide strings as UTF-16. Thus, add an obscure dependency. Option 3 is likely a non-starter, because it seems that Guile has tried to avoid adding new non-GNU dependencies. It is technologically a great solution, IMHO. Option 1 is probably the way to go, because it keeps Guile close to the metal and keeps dependencies out of it. Unfortunately, UTF-8 strings would require conversion. 1. IMO it'd be nice to have ASCII strings special-cased so that they are always encoded in ASCII. This would allow for memory savings since, e.g., most symbols are expected to contain only ASCII characters. It might also simplify interaction with C in certain cases; for instance, it would make it easy to have statically initialized ASCII Scheme strings. Why not? It does solve the initialization problem of dealing with strings before setlocale has been called. Let's say that a string is a union of either an ASCII char vector or a wchar_t vector. A character then is just a Unicode codepoint. String-ref returns a wchar_t. This is all in line with R6RS as I understand it. There could then be a separate iterator and function set that does (likely O(n)) operations on the grapheme clusters of strings. A grapheme cluster is a single written symbol which may be made up of several codepoints. Unicode Standard Annex #29 describes how to partition a string into a set of graphemes.[1] There is the problem of systems where wchar_t is 2 bytes instead of 4 bytes, like Cygwin. For those systems, I'd recommend restricting functionality to 16-bit characters instead of trying to add an extra UTF-16 encoding/decoding step. I think there should always be a complete codepoint in each wchar_t. -- Mike Gran [1] http://www.unicode.org/reports/tr29/
Re: r6rs libraries
Hi everyone, (Switching this conversation to guile-devel from guile-user, since it seems more appropriate to this list...) Alright, so I've been studying the van Tonder and Dybvig-Ghuloum implementations and banging my head against chapter 7 of R6RS, all with an eye towards mapping them onto Guile's module system, and I can't for the life of me figure out why the existing implementations are as complicated as they are. Maybe some more advanced Schemers than I can shed some light on the following: * Import and export levels seem to be a fancy way of notifying the library system of the time at which a library needs to be loaded/evaluated -- that is, if you import something from [library foo] for the expand phase of [library bar] you've got to evaluate (i.e., convert to a Guile module) the S-exp for [library foo] before you can evaluate the S-exp for [library bar]. The levels system is simply a numerical way of encapsulating this information, but the proper order of evaluation can also be inferred by inspecting the import- and export-specs of the libraries being loaded -- i.e., if the header of [library bar] specifies an import of anything from [library foo], no matter at what level, it's a safe move to evaluate [library foo] (if you haven't already done so) before finishing the evaluation of [library bar]. Is that right? * R6RS says that a library's imports need to be visited/instantiated at the time the bindings they export are referenced. Why? As above, why can't they be visited/instantiated at the time the imports for the importing library are processed? Is there any noticeable difference to the user? Or do you guys read R6RS 7.2 to mean that the side-effects of top-level expressions absolutely need to happen at a time determined by the import level? * R6RS also says that implementations are free to visit/instantiate libraries more or less often than is required by the import-export graph. Why would you want to visit/instantiate a library more than once? Why not just do it once, turn it into a module, and cache it? Andy Wingo noted that some implementations do a fresh visit for every phase (and that it's problematic), but I can't even see why you'd want to if the spec lets you off the hook for it. I understand that the authors of the reference implementation re-created a lot of machinery out of whole cloth since they were avoiding assumptions about features of their target Scheme platforms, but, man, both van Tonder and Dybvig-Ghuloum look like overkill for Guile. Am I missing a major piece of understanding here? Regards, Julian