In message <[EMAIL PROTECTED]> Gibbs Tanton - tgibbs <[EMAIL PROTECTED]> wrote:
> This is good, unless someone has objections I'll commit this. However, we > also need the ability to do unicode in the assembler (I'll do this later > today if no one beats me to it), and we need some way to communicate the > encoding number between the C and the Perl code. It probably does still need some cleaning up but that can be done incremently. One of the main things that I wasn't sure about but forgot to mention in the original message is what we want to do about malformed strings. Are we going to assume strings are well formed and go hell for leather in handling them or do we want to move to the paranoid end of the spectrum and check everything we do and throw exceptions when something odd is spotted? Currently the code does a bit of both - sometimes it checks things and sometimes it doesn't. > I guess the question with native strings is will it always be ASCII or will > it be Shift-JIS etc...? And the follow up to that is can, for the short > term, we assume it will be ASCII and then improve our native string > transcoding over time? Well according to string.pod native will always be a single byte per character encoding and never a wide character or shifted encoding so that rules out Shift-JIS and most other far eastern encodings. BTW the claim in string.pod that UTF-8 needs a maximum of 3 bytes per character is wrong, at least if you allow U+0000 to U+10FFFF as your character space which is what I did - any character over U+FFFF needs four bytes. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/