At Thu, 18 Jun 2009 11:30:53 -0400, Carl Eastlund wrote: > On Thu, Jun 18, 2009 at 3:35 AM, Matthew Flatt<mfl...@cs.utah.edu> wrote: > > At Wed, 17 Jun 2009 20:28:10 -0400, Carl Eastlund wrote: > >> Why do symbol->string and keyword->string produce mutable strings? In > >> so doing, they have to allocate a new string every time. Is there any > >> way to get at an immutable string that is not allocated more than > >> once? I would prefer that this be the default behavior; R6RS already > >> specifies that symbol->string produces an immutable string, for > >> instance. > > > > Symbols and keywords are represented internally in UTF-8, while strings > > are represented internally as UTF-32. So, there's not an obvious way to > > have `symbol->string' avoid allocation, except by either caching a > > string reference in the symbol (probably not worth the extra space, > > since most symbols are never converted) or keeping a symbol-to-string > > mapping in a hash table (which any programmer can do externally). > > > > I think it would be a good idea to switch to an immutable-string > > result, but considering potential incompatibility, it has never seemed > > worthwhile in the short run. > > I see. I have contracts set up to accept only symbols and keywords > whose names are ASCII strings; I was planning to use a weak, eq?-based > hash of their names to shortcut the test. Apparently, though, I > cannot get eq?-unique names for symbols and strings. If I hash the > symbols and keywords themselves, I believe the weak table can never > reclaim the space (since interned symbols and keywords are forgeable);
No --- symbols and keywords are GCed, so a weak hash table would work. (And weakness in hash tables isn't about whether you could synthesize the key. We have `equal?'-based hash tables with weak keys, after all.) > However, while I'm musing out loud... would it be possible to have > symbol->bytes and keyword->bytes that produce the UTF-8 representation > (presumably with guarantees of uniqueness, immutability, and proper > UTF-8 encoding)? Do you mean that `symbol->bytes' would avoid allocation, which is possible because the symbol is UTF-8 encoded? If so, there's another part of the representation story that I left out last time. A symbol's content is "inlined" into the allocated symbol record, while a string or a byte string is a record containing a pointer to the string's characters. This difference has to do with C interoperability and a GC-based prohibition on pointers into the interior of an allocated object. So, there are many ways in which the current representations don't yield a cheap `symbol->bytes' operation. _________________________________________________ For list-related administrative tasks: http://list.cs.brown.edu/mailman/listinfo/plt-dev