Hi Kenny and everyone, The question on strings is indeed a delicate issue. One thing we have to take into account is backward compatibility, which we kind of need.
The fact that Mozart 1.4 uses lists of ISO-8859-1 code points to represent strings is half an issue. Per se, that's a list of integers, right? And we still support lists, and integers (obviously). The fact that such a list is treated as a string is completely algorithm-dependent. (Much like the way RAX is an integer or a boolean, dependent on the code that uses it.) So, in order to keep support for such strings, we only need to keep the APIs that deal with such strings (both in input and output). That's not difficult. It only means that if we want another kind of strings to be the default in Mozart 2.0, we'll have a duplicate set of procedures in the API. It's pretty annoying for core procedures (like VirtualString.toString) but not so much. Now, regarding ByteString. We will keep it, both for backward compatibility and forward progress. For new code, a ByteString can be used for what it's named after: a string of *bytes*. Very useful for I/O, for example. Again, APIs that used ByteString as strings [of characters] will need duplicates. Now, for the new representation. First, I think I need to put a little reminder about the following: Unicode and UTF-x are two very different things. Unicode is independent of the encoding that is used to represent it. Per se, it is basically a mapping (an injective function) from natural numbers (N) to abstract characters. That's all (well that's all for UCS, but Unicode also defines many other related *properties* of characters). UTF-x and other encodings define how to encode/decode a list of such natural numbers as a list of bytes. Basically they even have nothing to do with text encoding. One might very well think of using UTF-8 as a means to encode natural numbers of arbitrary length. So, from a *language* point of view, I'd like to see the following. We have two new types: Character and String (or CharString? I prefer the former). A Character is a Unicode character (and I don't care how it's supposed to be encoded). Operations applicable on Characters are the one defined by the Unicode standard. Obviously converting to and from an Integer. But also ask for Unicode properties of the character: is it lowercase, uppercase? etc. A String is a finite sequence of Character's (and again, I don't care how it's supposed to be encoded). I can get its length, extract the ith character in the sequence, etc. Convert it from/to a List/Array/Tuple of Characters. And then, of course, I have procedures to encode/decode a String as a ByteString (and/or a List of Characters as a List of Bytes), according to a certain encoding. >From an *implementation* point of view, of course we need a way to store Characters and Strings in memory. For Characters, I think it's pretty obvious to simply store the UTF-32 encoded character. Primitive data occupy at least 32/64 bits anyway, so let's use them. For Strings, that's not so obvious. I guess all of UTF-8, UTF-16 and UTF-32 are acceptable and have their merits. It seems to me, however, that both Strings and Atoms should follow the same encoding (yes, this sentence implies that I am open to changing the encoding of Atoms). Which one of these is, from my point of view, the only unknown in the design, and the only thing I have no strong position on. According to this manifest [1], we should use UTF-8 for internal encoding. UTF-16 might be a better bet, if only because of the existence of the ICU library [2], which uses UTF-16 internally, and which could be used for all text-related features of Mozart. Cheers, Sébastien [1] http://www.utf8everywhere.org/ [2] http://userguide.icu-project.org/ On Sun, Jun 3, 2012 at 8:58 AM, Kenny TM~ <kenn...@gmail.com> wrote: > Hi all, > > In Mozart 1.4 strings are represented as a list of ISO-8859-1 code > points. In 2.0, Unicode will be supported, so how will strings and > characters be represented? Will they become a list of UTF-32 code > points (a list of integers from 0 to 0x10ffff), a list of UTF-16 code > units (0 to 0xffff), a list of UTF-8 code units (0 to 255), or > something else as indicated in [1]? > > UTF-32 looks like the most natural representation for a linked list. > However, in 2.0, atoms are internally represented as a UTF-16 string > (const char16_t*)... > > Also when a (virtual) string is converted to a ByteString, how will it > be represented? An array of UTF-8 / UTF-16LE / UTF-16BE / UTF-32LE / > UTF-32BE code units or something else, or will ByteString be > deprecated? > > Thanks, > -- Kenny. > > [1]: > http://lists.gforge.info.ucl.ac.be/pipermail/mozart-hackers/2012/003406.html > > _________________________________________________________________________________ > mozart-hackers mailing list > mozart-hackers@mozart-oz.org > http://www.mozart-oz.org/mailman/listinfo/mozart-hackers >
_________________________________________________________________________________ mozart-hackers mailing list mozart-hackers@mozart-oz.org http://www.mozart-oz.org/mailman/listinfo/mozart-hackers