Re: [bitc-dev] String encoding, again

Jonathan S. Shapiro Thu, 17 Mar 2011 11:27:54 -0700

On Wed, Mar 16, 2011 at 7:15 PM, Sandro Magi <[email protected]>wrote:


> For #1, we can perhaps amortize character size overheads by storing
> string encoding in a header somehow. A string then becomes a sum:
>
> type string = Uint8 of byte[] | UInt16 of uint16[] | UInt32 of uint32[]
>

Since I started down this same path, I want to explain why it's a bad idea.

First, knowing the string *representation* isn't enough. The mere fact that
you are dealing with ucs2 code units doesn't tell you how to extract a code
point from the encoding. Are they Unicode code units or Shift-JIS code
units. I personally believe that the best solution to this is to declare
that the type String always holds Unicode, and if you need to process
something else, use something else. My view on this is purely defensive.
There is a limit to how many miracles should be performed by every library
that handles a String (which is to say: every library).

Second, we really don't want a string to be all one thing in this way. We
want a string to be made up of threads, each of which has a homogenous
representation. So we want something more like:

  type Thread = ucs1 of uint8[] | ucs2 of uint16[] | ucs4 of uint32[]
  type String = ropes of Thread[]

Third, the practical impact of defining String or Thread as a union type in
the way that Sandro suggests is that every single piece of code that
processes code points is going to have to do the code point reconstruction
(from code units) for itself. From the standpoint of error suppression, I
believe that *most* code should be strongly discouraged from doing that.

And here's the thing: if every bit of string handling code that operates on
code points has to reconstruct them, then we aren't arguing about
*whether*they will be reconstructed. We are only arguing about what
the chunking
factor will be. For a whole lot of reasons, it is better when processing
eight or more characters to do the decoding in a separate loop and return
the result as a chunk.

I think what I'm saying is that in a world of encoded representation, the
decoding phase needs to be part of the processing model.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] String encoding, again

Reply via email to