Hi everyone, My apologies for answering this late. I had a draft written for this mail, and forgot that I had not sent it ...
OK, so, apparently, you don't like the idea ;-) Well, that's why we ask, too. First, I'll answer Torsten's direct question. What I am missing are strings in the VL definition. We currently have the > convenience notation "like this", which is implicitly translated into a > list of integers as you all know. I assume there be some convenient string > notation that is then implicitly translated into a list of UnicodeChar? I > assume doing so will not break any existing code (if we ignore the case > that VS could be processed "directly")? > Yes, we will provide a convenience notation to write literal UnicodeChars and UnicodeStrings (without breaking existing code). I don't know which notation yet. My current guess is to mimic C++: u"Some UnicodeString" and u&é and to add u'é'. But I'm not a fan myself of it, so if you have a better idea, let me know. I acknowledge the opposition against the encode and decode tuples. This might indeed be over-engineering. So let's drop them from this discussion. However, I'd like to argue against the "VirtualByteString is over-engineering" argument. Give up with the VirtualByteString. It looks like over-engineering to me. > With the types above, you only need three fundamental operations: > > - ToString: VirtualString -> String, > - Encode: String x Encoding -> ByteString, > - Decode: ByteString x Encoding -> String. > > If you want to argue in favor of minimality, then I think these three operations are already too much. We only need Encode and Decode. VirtualString is already a step ahead of minimalism. I think that either VirtualString was over-engineering in the first place, or VirtualByteString is not either. It seems to me that, let alone language-related operations, one should be able to perform the same set of operations on a sequence of characters as on a sequence of bytes. I find it so nice to have a good duality between sequences of bytes and of characters. In Mozart 1.4.0, this duality already existed, because both sequences of characters and of bytes could be represented by either - a list of integers -> for manipulations - a byte string -> for storage - a virtual string -> for construction The problem is that now, a character is not equal to a byte anymore. So, we have invented (but not yet implemented) the type UnicodeChar which is a character. This already allows the first representation. We have also created the type UnicodeString, which is the compact representation for storage. To support the construction representation, we have transformed the virtual string concept to be Unicode enabled. So for text, we have the three representation. But for bytes, we have kept the first and second, but have destroyed the third representation. So, my point of view is: we have actually *removed* something in the current design: we cannot construct sequences of bytes anymore. Hence, I would like to reintroduce this possibility. In that regard, I don't think this is over-engineering. So, I'd like to argue that there should exist some kind of virtual byte string representation. I believe in this quite strongly, even if the encode/decode tuples are not part of them. Put that way, do you still think it's over-engineering? > > I don't know how String should be represented. A list of unicode > characters looks nice conceptually, but this is a very costly > representation. Ideally, you would like to store a string as an atomic > data structure, but it should "behave" like a list of characters. Maybe > the unification of a string S with X|T could bind X to a (unicode) > character and T to a proxy that represents S without its first character. > You might have already got my answer to this in my previous explanation. Before there existed three representations for strings: the list of integers, the compact string (ByteString) and the virtual string. The three of them will exist, in their Unicode version, in Mozart 2. I have no intention to redesign that. > I must confess that I find the object model of Python strings very > attractive. A string is an object with an atomic storage, and it supports > array-like operations (including iterators) when you need an explicit > decomposition. And characters don't have a specific type: they are strings > of length 1. It is simple, clean, pragmatic and efficient. > This was more or less implemented by the ByteString before, and is now by UnicodeString. Except that characters have their type. We can change that when going to Unicode, and not introduce the Character type at all. Do you think it would make sense in Mozart/Oz? You might want to take into account that, in Mozart, small things (smaller or equal to a memory word) can be represented very efficiently, and, if located in mutable storage (like registers), never allocate external storage that needs to be garbage collected. A Character type can fall in this category. A string of 1 element cannot (unless we add an efficient representation for a string of 1 character, like we have efficient representation for Cons pairs). Cheers, Sébastien
_________________________________________________________________________________ mozart-hackers mailing list [email protected] http://www.mozart-oz.org/mailman/listinfo/mozart-hackers
