The standard for URLs uses a double encoding: A URL is coded in UTF-8 and then all bytes with high bits set are written in the %xx format. Therefore, if you just convert each %xx to the proper byte, the result is a valid UTF-8 string. You don't need to worry about multi-byte codes, if UTF-8 is the result you want.
-- Mark Biggar [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] > Hi! > > > the "XXX -- correct" refers to the :16 (IIRC, Larry said on p6l that he > > liked that, but I wasn't able to find it in the Synopses). > > > > BTW, Pugs' chr does understand input > 255 correctly: > > pugs> ord "€" > > 8364 > > pugs> chr 8364 > > '€' > Yes, I know it. > > > $decoded does contain valid UTF-8, the problem is Pugs' print/say > > builtin -- compare: > It's interesting, and it can be the problem, but I think, the CGI.pm way > is not the good solution to decode the URL encoded string: if you say > chr(0xE2)~chr(0x82)~chr(0xA2), then they are 3 characters, and chr(0xE2) > is a 2 byte coded character in UTF-8 (on a iso-8859-1 terminal, the > output can be good, but the internal storage and handling isn't). I mean > if you would like to handle the string in memory, and you query the > length of it, the in this way you get 3, but the right is 1. > > So, if there isn't a trick there (for example a function called "byte" > that is usable as "chr"), then CGI.pm have to recognize %E2%82%AC as one > character and have to decode it with evaluating chr(8364). > > Additionally, detecting character boundings is not so easy, because a > character can 2-4 bytes long, and two or more characters can be next to > each other. > > Bye, > Andras
