On Wed, Mar 16, 2011 at 7:15 PM, Sandro Magi <[email protected]>wrote:
> For #1, we can perhaps amortize character size overheads by storing > string encoding in a header somehow. A string then becomes a sum: > > type string = Uint8 of byte[] | UInt16 of uint16[] | UInt32 of uint32[] > Since I started down this same path, I want to explain why it's a bad idea. First, knowing the string *representation* isn't enough. The mere fact that you are dealing with ucs2 code units doesn't tell you how to extract a code point from the encoding. Are they Unicode code units or Shift-JIS code units. I personally believe that the best solution to this is to declare that the type String always holds Unicode, and if you need to process something else, use something else. My view on this is purely defensive. There is a limit to how many miracles should be performed by every library that handles a String (which is to say: every library). Second, we really don't want a string to be all one thing in this way. We want a string to be made up of threads, each of which has a homogenous representation. So we want something more like: type Thread = ucs1 of uint8[] | ucs2 of uint16[] | ucs4 of uint32[] type String = ropes of Thread[] Third, the practical impact of defining String or Thread as a union type in the way that Sandro suggests is that every single piece of code that processes code points is going to have to do the code point reconstruction (from code units) for itself. From the standpoint of error suppression, I believe that *most* code should be strongly discouraged from doing that. And here's the thing: if every bit of string handling code that operates on code points has to reconstruct them, then we aren't arguing about *whether*they will be reconstructed. We are only arguing about what the chunking factor will be. For a whole lot of reasons, it is better when processing eight or more characters to do the decoding in a separate loop and return the result as a chunk. I think what I'm saying is that in a world of encoded representation, the decoding phase needs to be part of the processing model. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
