Re: Representation of String and ByteString in Mozart VM 2.0

Sébastien Doeraene Sun, 03 Jun 2012 01:27:52 -0700

Hi Kenny and everyone,

The question on strings is indeed a delicate issue. One thing we have to
take into account is backward compatibility, which we kind of need.

The fact that Mozart 1.4 uses lists of ISO-8859-1 code points to represent
strings is half an issue. Per se, that's a list of integers, right? And we
still support lists, and integers (obviously). The fact that such a list is
treated as a string is completely algorithm-dependent. (Much like the way
RAX is an integer or a boolean, dependent on the code that uses it.)
So, in order to keep support for such strings, we only need to keep the
APIs that deal with such strings (both in input and output). That's not
difficult. It only means that if we want another kind of strings to be the
default in Mozart 2.0, we'll have a duplicate set of procedures in the API.
It's pretty annoying for core procedures (like VirtualString.toString) but
not so much.

Now, regarding ByteString. We will keep it, both for backward compatibility
and forward progress. For new code, a ByteString can be used for what it's
named after: a string of *bytes*. Very useful for I/O, for example. Again,
APIs that used ByteString as strings [of characters] will need duplicates.

Now, for the new representation. First, I think I need to put a little
reminder about the following: Unicode and UTF-x are two very different
things. Unicode is independent of the encoding that is used to represent
it. Per se, it is basically a mapping (an injective function) from natural
numbers (N) to abstract characters. That's all (well that's all for UCS,
but Unicode also defines many other related *properties* of characters).

UTF-x and other encodings define how to encode/decode a list of such
natural numbers as a list of bytes. Basically they even have nothing to do
with text encoding. One might very well think of using UTF-8 as a means to
encode natural numbers of arbitrary length.

So, from a *language* point of view, I'd like to see the following. We have
two new types: Character and String (or CharString? I prefer the former).

A Character is a Unicode character (and I don't care how it's supposed to
be encoded). Operations applicable on Characters are the one defined by the
Unicode standard. Obviously converting to and from an Integer. But also ask
for Unicode properties of the character: is it lowercase, uppercase? etc.

A String is a finite sequence of Character's (and again, I don't care how
it's supposed to be encoded). I can get its length, extract the ith
character in the sequence, etc. Convert it from/to a List/Array/Tuple of
Characters.

And then, of course, I have procedures to encode/decode a String as a
ByteString (and/or a List of Characters as a List of Bytes), according to a
certain encoding.

>From an *implementation* point of view, of course we need a way to store
Characters and Strings in memory.

For Characters, I think it's pretty obvious to simply store the UTF-32
encoded character. Primitive data occupy at least 32/64 bits anyway, so
let's use them.

For Strings, that's not so obvious. I guess all of UTF-8, UTF-16 and UTF-32
are acceptable and have their merits. It seems to me, however, that both
Strings and Atoms should follow the same encoding (yes, this sentence
implies that I am open to changing the encoding of Atoms). Which one of
these is, from my point of view, the only unknown in the design, and the
only thing I have no strong position on.

According to this manifest [1], we should use UTF-8 for internal encoding.
UTF-16 might be a better bet, if only because of the existence of the ICU
library [2], which uses UTF-16 internally, and which could be used for all
text-related features of Mozart.

Cheers,
Sébastien

[1] http://www.utf8everywhere.org/
[2] http://userguide.icu-project.org/

On Sun, Jun 3, 2012 at 8:58 AM, Kenny TM~ <kenn...@gmail.com> wrote:

> Hi all,
>
> In Mozart 1.4 strings are represented as a list of ISO-8859-1 code
> points. In 2.0, Unicode will be supported, so how will strings and
> characters be represented? Will they become a list of UTF-32 code
> points (a list of integers from 0 to 0x10ffff), a list of UTF-16 code
> units (0 to 0xffff), a list of UTF-8 code units (0 to 255), or
> something else as indicated in [1]?
>
> UTF-32 looks like the most natural representation for a linked list.
> However, in 2.0, atoms are internally represented as a UTF-16 string
> (const char16_t*)...
>
> Also when a (virtual) string is converted to a ByteString, how will it
> be represented? An array of UTF-8 / UTF-16LE / UTF-16BE / UTF-32LE /
> UTF-32BE code units or something else, or will ByteString be
> deprecated?
>
> Thanks,
> -- Kenny.
>
> [1]:
> http://lists.gforge.info.ucl.ac.be/pipermail/mozart-hackers/2012/003406.html
>
> _________________________________________________________________________________
> mozart-hackers mailing list
> mozart-hackers@mozart-oz.org
> http://www.mozart-oz.org/mailman/listinfo/mozart-hackers
>

_________________________________________________________________________________
mozart-hackers mailing list                           
mozart-hackers@mozart-oz.org      
http://www.mozart-oz.org/mailman/listinfo/mozart-hackers

Re: Representation of String and ByteString in Mozart VM 2.0

Reply via email to