The standard for URLs uses a double encoding: A URL is coded in UTF-8 and then 
all bytes with high bits set are written in the %xx format.  Therefore, if you 
just convert each %xx to the proper byte, the result is a valid UTF-8 string. 
You don't need to worry about multi-byte codes, if UTF-8 is the result you want.

--
Mark Biggar
[EMAIL PROTECTED]
[EMAIL PROTECTED]
[EMAIL PROTECTED]


> Hi!
> 
> > the "XXX -- correct" refers to the :16 (IIRC, Larry said on p6l that he
> > liked that, but I wasn't able to find it in the Synopses).
> > 
> > BTW, Pugs' chr does understand input > 255 correctly:
> >   pugs> ord "€"
> >   8364
> >   pugs> chr 8364
> >   '€'
> Yes, I know it.
> 
> > $decoded does contain valid UTF-8, the problem is Pugs' print/say
> > builtin -- compare:
> It's interesting, and it can be the problem, but I think, the CGI.pm way 
> is not the good solution to decode the URL encoded string: if you say 
> chr(0xE2)~chr(0x82)~chr(0xA2), then they are 3 characters, and chr(0xE2) 
> is a 2 byte coded character in UTF-8 (on a iso-8859-1 terminal, the 
> output can be good, but the internal storage and handling isn't). I mean 
>   if you would like to handle the string in memory, and you query the 
> length of it, the in this way you get 3, but the right is 1.
> 
> So, if there isn't a trick there (for example a function called "byte" 
> that is usable as "chr"), then CGI.pm have to recognize %E2%82%AC as one 
> character and have to decode it with evaluating chr(8364).
> 
> Additionally, detecting character boundings is not so easy, because a 
> character can 2-4 bytes long, and two or more characters can be next to 
> each other.
> 
> Bye,
>    Andras

Reply via email to