[r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

John Cowan Fri, 18 Sep 2009 22:25:45 -0700

This is a proposal for the removal of string-set! (and consequently
string-fill!) from the R7RS small Scheme language.  I am publishing this
document to invite wide comment.  There is nothing official about it.
I very gratefully acknowledge the kind help of Alex Shinn, who provided
the topic sentences for most of the paragraphs below.  However, I retain
sole responsibility for this document, including all errors.


I believe that despite the prescription of the draft WG1 charter that
no features of IEEE Scheme (a subset of R4RS) should be removed from
R7RS small Scheme, an exception should be made for string-set!, for at
least the following reasons:

1) Immutable strings are more purely functional, and allow many
optimizations, such as being transparently and freely shareable between
procedures and between threads without concern for uncontrolled mutation.
For this and other reasons, the general trend in new languages/runtimes
such as Java and C# is toward immutable strings; unfortunately, this is
the kind of argument that Schemers usually don't like, so I won't bother
mentioning it.  :-)

2) Algorithms where you want to modify strings in the middle are rare,
and many of the classic devices (such as string-upcase!, a procedure that
mutates a string in place) are awkward or impossible with representations
that make use of characters of variable length such as UTF-8.  Typical
string algorithms want to also be able to do insertions and deletions,
which are not directly possible with classical Scheme strings.  Better
representations such as trees of immutable strings do allow such changes,
as well as making string appends O(n) in the number of strings rather
than in the sum of their lengths.

3) If strings are immutable, it's possible to have both fast O(1)
access to individual characters or substrings, and fairly space-efficient
representation of full Unicode strings, by using different representations
for strings drawn from diferent character repertoires.  For example,
an implementation might use 8-bit code units when all characters are
less than \#x100, 16-bit code units when all characters are less than
\#x10000, and 32-bit code units otherwise.

Unfortunately, mutating even a single character in such a representation
may require the entire string to be copied, which means that it also
requires indirection through a separate header that can be redirected
to point to the newly allocated code unit sequence.  Immutable strings
can just *be* their sequences, with a few extra bits indicating the
size of the code units, although this design does prevent easy sharing
of substrings.

4) As currently designed, strings are functionally just vectors of
characters.  In an 8-bit world, using the traditional representation
of strings carries a 4:1 storage advantage, making it worthwhile
to distinguish them clearly from general vectors  But 21-bit Unicode
characters are a much better fit, if represented as immediate (unboxed)
values, for general vectors using 32-bit pointers.  Granted that not all
small Scheme systems will provide full Unicode support, general vectors
start to look much less expensive than they once were.  In short: if
you want something that behaves like a vector of characters, simply use
a general vector that contains characters.

5) Making strings immutable also permits a design in which all strings
are Unicode-normalized.  Though this has its own costs (for example,
appending two strings may create a new string whose length is different
from the lengths of the two source strings), it would be effectively
impossible where arbitrary mutation is allowed.


As a consequence of removing string-set!, string-fill! (not in IEEE
Scheme) becomes impossible and string-copy less useful.  I do not propose
to remove string-copy, however, because it can eliminate space leaks
that are caused by taking a small shared substring of a large existing
string: when the larger string should be GC'ed, it is retained as a
whole because of the shared substring.  Using string-copy judiciously
can prevent such leaks.

-- 
John Cowan  [email protected]  http://ccil.org/~cowan
If he has seen farther than others,
        it is because he is standing on a stack of dwarves.
                --Mike Champion, describing Tim Berners-Lee (adapted)

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

[r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

Reply via email to