Re: The encoding API

Dan Sugalski Thu, 12 Aug 2004 23:43:40 -0700

At 2:19 PM +0530 8/11/04, Gopal V wrote:

Hi Dan & Michael,

As a guy who speaks a strange language (multi byte chars, multi glyph
chars,  caseless
text and half vowels) , I think you have made it too complicated than
it should be .


This scared me some, as I've not gotten to the complicated part... :)

One thing I didn't make clear is that this is a mediating layer for code that wants very low-level access (basically direct byte access--IO code and such) and semi-low-level access (from the charset code mostly)

This is the level that's supposed to provide the equivalent of direct memory access while still maintaining internal consistency in the buffer.

> charset end of things, offsets will be in graphemes (or Freds. I
don't remember what we finally decided to name the things))
Writing a unicode composition and rendering is VERY VERY HARD ...
Find a way to leech that ... (I've tried a pango module for malayalam and
it's really really hard to do).

I fully plan to do this. It wouldn't surprise me to find we don't do the right thing for a sizeable subset of Unicode, at least to start with.

> When dealing with variable-length encodings, removal of codepoints in

 the middle may make the string shrink, and adding them may make it
 grow. The encoding layer is responsible for managing the underlying
 byte buffer to maintain consistency.


It was soo easy with immutable strings ... I think that is why Java could
implement unicode properly :)

:) This isn't even the layer that manages unicode. All it does is make sure that you don't have bad UTF-8 sequences in your byte buffer.

> >> void to_encoding(STRING *);

 >>
 >>    Make the string the new encoding, in place


A String should always be Unicode IMHO , they should be converted to
byte buffers
by encoding and back from byte buffers while decoding.

Yeah, that's a sore spot. The short answer is that All Unicode All The Time isn't going to happen, so we need a working Plan B. This is part of that plan. The charset code is another part (and I'll get to that later today, I hope) with the ops that've already been detailed as the third part of the plan.

Ultimately bytecode *can* treat strings as all-unicode if it chooses to, while Parrot does whatever it has to to make that work.

> >> UINTVAL get_codepoint(STRING *, offset);
>> void set_codepoint(STRING, offset, UINTVAL codepoint);
*if* , String always contains (length, UINTVAL[])  always , doesn't it
make life easier  ?


Not if the underlyung buffer's a mmapped file in UTF-8 it doesn't... :-P

 > >>  UINTVAL get_byte(STRING *, offset)
...
Byte offset. Needs more clarity.
...
>> void set_byte(STRING *, offset, UINTVAL byte);
...
My advice would be to never let the layer above the encoding know that
we're storing
it in bytes :)

Yep. This is for layers below us -- IO layers, direct buffer access layers, and whatnot. That and for folks that are convinced, rightly or wrongly, that they really really do need to peek at the bytes.

 > >>   STRING *get_codepoints(STRING, offset, count);
Immutability of returned string (and original) would save memory ..

Parrot's got a copy-on-write system, so what'll happen here is that you'll get a newly allocated string header pointing into a buffer that's marked COW, so there'll be no memory copying unless something actually goes and updates the underlying memory.

I'm afraid your design is waaay too complicated, at least for an
average guy like me .


Ah, I think you underestimate yourself. :-P

I also think that we'll end up with a small handful of encoding layers and that'll be it. We're going to have 8-bit fixed, 16-bit fixed, 32-bit fixed, UTF-8, UTF-16, and a few asian encoding layers, and that's likely going to be it. I'm not expecting to see new ones come along, unless someone wants to go nuts with a zip/gzip compression encoding that pretends to be one of the standard encodings or something. -- Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Re: The encoding API

Reply via email to