Hi Dan & Michael, As a guy who speaks a strange language (multi byte chars, multi glyph chars, caseless text and half vowels) , I think you have made it too complicated than it should be .
> charset end of things, offsets will be in graphemes (or Freds. I > don't remember what we finally decided to name the things)) Writing a unicode composition and rendering is VERY VERY HARD ... Find a way to leech that ... (I've tried a pango module for malayalam and it's really really hard to do). > When dealing with variable-length encodings, removal of codepoints in > the middle may make the string shrink, and adding them may make it > grow. The encoding layer is responsible for managing the underlying > byte buffer to maintain consistency. It was soo easy with immutable strings ... I think that is why Java could implement unicode properly :) > >> void to_encoding(STRING *); > >> > >> Make the string the new encoding, in place A String should always be Unicode IMHO , they should be converted to byte buffers by encoding and back from byte buffers while decoding. > >> UINTVAL get_codepoint(STRING *, offset); > >> void set_codepoint(STRING, offset, UINTVAL codepoint); *if* , String always contains (length, UINTVAL[]) always , doesn't it make life easier ? > >> UINTVAL get_byte(STRING *, offset) ... > Byte offset. Needs more clarity. ... > >> void set_byte(STRING *, offset, UINTVAL byte); ... My advice would be to never let the layer above the encoding know that we're storing it in bytes :) > >> STRING *get_codepoints(STRING, offset, count); Immutability of returned string (and original) would save memory .. especially the UINTVAL array was GC allocated :) .. of course what you have here is the substring operation in a new and obfuscated name :) // some psuedo code as I see it. substring(string, offset, count) { // validate params or catch fire and exit string2=gc_alloc(string_header); string2->length = count; string2->data = &(string->data[offset]); // hopefully data is also gc_alloc'd return string2; } I'm afraid your design is waaay too complicated, at least for an average guy like me . I'd like to suggest that all STRING be unicode and convert to byte buffers and back for all other purposes. But that's just a suggestion :) Gopal