On Tue, 15 Aug 2000, Simon Cozens wrote:

> (I'm not really following Perl 6, but Unicode is obviously something
> I have a concern about. Please *do* CC me replies, just this once.)
> 
> On Sat, Aug 05, 2000 at 11:16:46AM +0000, Nick Ing-Simmons wrote:
> > Agreed - but that is due to grafting it in late - and possibly 
> > trying to be too clever intuiting whether existing perl5-code is 
> > working on bytes or chars.
> 
> This is why we should:
>     i)   Make the choice of internal encoding (UTF-8/UTF-16/UTF-32) decidable 
> at compile time.

Perl compile time or perl program compile time? Regardless, I don't see
any reason not to have a user-choosable default, since if we can handle
bytes and utf-8 simultaneously there's no real reason it should matter to
perl.

>     ii)  Deal with strings internally through pluggable support routines.

s/strings/variables/;

Say hi to Perl V(table) :)

>     iii) Never assume bytes.

What, never? Not even in vectors and bitmaps? :)

I agree, though. Character and byte are separate constructs and need to be
dealt with separately.

>     iv)  Provide the user a method of converting their input and output to and
> from the UTF Perl uses.

That'll go into the bits on line disciplines. Already there, I think.
 
> > But the goal was to avoid a 100Mbyte ASCII "string" becoming a 400Mbyte
> > UTF32 "string" with 300Mbytes of 0x000000.
> 
> Hey, if the user wants it, the user ought get it.
> "No UTF32 for you!" - Perl Nazi.

Yeah, but the problem with that is that for each additional encoding
scheme, perl needs to have some conversion to the other encodings so it
can do reasonable comaprisons and suchlike things.
 
> > Perhaps the regex engine should always force UF8 form ?
> 
> I think we really want to store data internally in a common, Unicode format.

Maybe we should just abstract it, though the more abstract it gets the
slower the regex engine's likely to be, as it does prefer to rip through
raw data buffers.

                                        Dan

Reply via email to