On Mar 19, 2007, at 8:17 PM, [EMAIL PROTECTED] wrote:

UTF-8 and UTF-16 require one or more code units to represent a given
scalar value. Since the number of code units depends on the scalar value being encoded there's no algorithm that maps the i'th scalar value to the
j'th code unit. If you want the i'th scalar value in a UTF-8 or UTF-16
string you have to search for it. And that, of course, is what string-ref
is, a request for the i'th scalar value (returned as a character).

From what I understand, UTF-8, UTF-16, and UTF-32 are interchange formats. Unicode text encoded in any one of the formats can be converted to another without loss of information (right?). Moreover, the internal representation of strings does not have to match the external representation. For example,
you can read a UTF-32 encoded file into a variable-length buffer to save
some space (sometimes); or alternatively, you can read a UTF-8 encoded file
into a fixed-length buffer to save time on random-access (sometimes).

Is the following a valid summary of the issue?

  The existence of string-ref and string-set! operations seems to imply
  that a variable-length internal representation is not an option and
a fixed-length representation wastes space and is therefore inefficient
  (mostly in an ascii-centered world).

Aziz,,,

_______________________________________________
r6rs-discuss mailing list
r6rs-discuss@lists.r6rs.org
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to