Re: Request for comments: VirtualString and VirtualByteString

Sébastien Doeraene Mon, 16 Jul 2012 06:05:06 -0700

Hi everyone,

My apologies for answering this late. I had a draft written for this mail,
and forgot that I had not sent it ...


OK, so, apparently, you don't like the idea ;-) Well, that's why we ask,
too.

First, I'll answer Torsten's direct question.

What I am missing are strings in the VL definition. We currently have the
> convenience notation "like this", which is implicitly translated into a
> list of integers as you all know. I assume there be some convenient string
> notation that is then implicitly translated into a list of UnicodeChar? I
> assume doing so will not break any existing code (if we ignore the case
> that VS could be processed "directly")?
>

Yes, we will provide a convenience notation to write literal UnicodeChars
and UnicodeStrings (without breaking existing code). I don't know which
notation yet. My current guess is to mimic C++: u"Some UnicodeString" and
u&é and to add u'é'. But I'm not a fan myself of it, so if you have a
better idea, let me know.

I acknowledge the opposition against the encode and decode tuples. This
might indeed be over-engineering. So let's drop them from this discussion.

However, I'd like to argue against the "VirtualByteString is
over-engineering" argument.

Give up with the VirtualByteString.  It looks like over-engineering to me.
>  With the types above, you only need three fundamental operations:
>
>    - ToString: VirtualString -> String,
>    - Encode: String x Encoding -> ByteString,
>    - Decode: ByteString x Encoding -> String.
>
> If you want to argue in favor of minimality, then I think these three
operations are already too much. We only need Encode and Decode.
VirtualString is already a step ahead of minimalism.

I think that either VirtualString was over-engineering in the first place,
or VirtualByteString is not either. It seems to me that, let alone
language-related operations, one should be able to perform the same set of
operations on a sequence of characters as on a sequence of bytes. I find it
so nice to have a good duality between sequences of bytes and of characters.

In Mozart 1.4.0, this duality already existed, because both sequences of
characters and of bytes could be represented by either

   - a list of integers -> for manipulations
   - a byte string -> for storage
   - a virtual string -> for construction

The problem is that now, a character is not equal to a byte anymore. So, we
have invented (but not yet implemented) the type UnicodeChar which is a
character. This already allows the first representation. We have also
created the type UnicodeString, which is the compact representation for
storage. To support the construction representation, we have transformed
the virtual string concept to be Unicode enabled.

So for text, we have the three representation. But for bytes, we have kept
the first and second, but have destroyed the third representation.

So, my point of view is: we have actually *removed* something in the
current design: we cannot construct sequences of bytes anymore. Hence, I
would like to reintroduce this possibility. In that regard, I don't think
this is over-engineering.

So, I'd like to argue that there should exist some kind of virtual byte
string representation. I believe in this quite strongly, even if the
encode/decode tuples are not part of them.

Put that way, do you still think it's over-engineering?

>
> I don't know how String should be represented.  A list of unicode
> characters looks nice conceptually, but this is a very costly
> representation.  Ideally, you would like to store a string as an atomic
> data structure, but it should "behave" like a list of characters.  Maybe
> the unification of a string S with X|T could bind X to a (unicode)
> character and T to a proxy that represents S without its first character.
>

You might have already got my answer to this in my previous explanation.
Before there existed three representations for strings: the list of
integers, the compact string (ByteString) and the virtual string. The three
of them will exist, in their Unicode version, in Mozart 2. I have no
intention to redesign that.


> I must confess that I find the object model of Python strings very
> attractive.  A string is an object with an atomic storage, and it supports
> array-like operations (including iterators) when you need an explicit
> decomposition.  And characters don't have a specific type: they are strings
> of length 1.  It is simple, clean, pragmatic and efficient.
>

This was more or less implemented by the ByteString before, and is now by
UnicodeString. Except that characters have their type. We can change that
when going to Unicode, and not introduce the Character type at all. Do you
think it would make sense in Mozart/Oz?

You might want to take into account that, in Mozart, small things (smaller
or equal to a memory word) can be represented very efficiently, and, if
located in mutable storage (like registers), never allocate external
storage that needs to be garbage collected. A Character type can fall in
this category. A string of 1 element cannot (unless we add an efficient
representation for a string of 1 character, like we have efficient
representation for Cons pairs).

Cheers,
Sébastien

_________________________________________________________________________________
mozart-hackers mailing list                           
[email protected]      
http://www.mozart-oz.org/mailman/listinfo/mozart-hackers

Re: Request for comments: VirtualString and VirtualByteString

Reply via email to