Hi Dan & Michael,

As a guy who speaks a strange language (multi byte chars, multi glyph
chars,  caseless
text and half vowels) , I think you have made it too complicated than
it should be .

> charset end of things, offsets will be in graphemes (or Freds. I
> don't remember what we finally decided to name the things))

Writing a unicode composition and rendering is VERY VERY HARD ...
Find a way to leech that ... (I've tried a pango module for malayalam and
it's really really hard to do).

> When dealing with variable-length encodings, removal of codepoints in
> the middle may make the string shrink, and adding them may make it
> grow. The encoding layer is responsible for managing the underlying
> byte buffer to maintain consistency.

It was soo easy with immutable strings ... I think that is why Java could
implement unicode properly :)

> >>  void to_encoding(STRING *);
> >>
> >>    Make the string the new encoding, in place

A String should always be Unicode IMHO , they should be converted to
byte buffers
by encoding and back from byte buffers while decoding.

> >>  UINTVAL get_codepoint(STRING *, offset);
> >>  void set_codepoint(STRING, offset, UINTVAL codepoint);

*if* , String always contains (length, UINTVAL[])  always , doesn't it
make life easier  ?


> >>  UINTVAL get_byte(STRING *, offset)
...
> Byte offset. Needs more clarity.
...
> >>   void set_byte(STRING *, offset, UINTVAL byte);
...

My advice would be to never let the layer above the encoding know that
we're storing
it in bytes :)

> >>   STRING *get_codepoints(STRING, offset, count);

Immutability of returned string (and original) would save memory ..
especially the UINTVAL
array was GC allocated :) .. of course what you have here is the
substring operation in
a new and obfuscated name :)

// some psuedo code as I see it.
substring(string, offset, count)
{
    // validate params or catch fire and exit
   string2=gc_alloc(string_header);
   string2->length =  count;
   string2->data = &(string->data[offset]); // hopefully data is also gc_alloc'd
   return string2;
}

I'm afraid your design is waaay too complicated, at least for an
average guy like me .
I'd like to suggest that all STRING be  unicode and convert to byte
buffers and back for all
other purposes. But that's just a suggestion :)

Gopal

Reply via email to