Hi,
Am 26.09.2004 um 18:13 schrieb Stas Bekman:
Boris Zentner wrote: [...]
So you suggest that APR::Table should always look at the perl variable before it stores it, and store the PV string as is if UTF8 flag is not set, otherwise encoded it into a utf8 string and only then store it. That way no data will ever get corrupted.
On return it should always return a PV with the utf8 string, w/o setting UTF8 flag (and doing any further conversion). A user that will want to do the conversion will do that on their own.
In other words (using the Dump entries from your original example), if APR::Table gets:
SV = PV(0x1a48774) at 0x1a5b7a4 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x2b67550 "\366\344\374"\0 CUR = 3 LEN = 4
it stores "\366\344\374"\0 as is. If it gets:
SV = PVMG(0x1b17300) at 0x1a48afc
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x2b67cc0 "\303\266\303\244\303\274"\0 [UTF8 "\x{f6}\x{e4}\x{fc}"]
CUR = 6
LEN = 7
MAGIC = 0x2b67310
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 3
it decodes "\303\266\303\244\303\274"\0 into "\366\344\374"\0 and stores the latter.
When APR::Table gives back the perl variable it should always return it as:
SV = PV(0x1acd900) at 0x1a5d484 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x2b671d0 "\366\344\374"\0 CUR = 3 LEN = 4
Is that correct? I think we could do that.
No, I think the stored strings are correct currently. The example use latin1 chars that by luck are bellow 256 in utf8. In other words we can not convert them back. A conversion back and forth is not possible every-time.
The old example has three chars from the iso-8859-1 charset (öäü). These chars
iso-8859-1 | unicode | utf8 ö | f6 | c3 b6 ä | e4 | c3 a4 ü | fc | c3 bc
So it is decoded to c3b6c3a4c3bc00 in utf8 and this is what is stored in the APR::Table and is what I want back. Just the flag is what I like to get back to.
Look at this:
# chr 0x2663 is black club suit char in unicode. This char is encoded to utf8 to e2 99 a3. the char is not included in iso-8859-1.
my $utf = "Hi " . chr(0x2663) . " there";
this results in "Hi \342\231\243 there"\0
This string is 10 chars long, but 12 bytes.
Just to get you mad, if the "hi ..." string is inserted into hte table and converted back to utf8 a second time the result is
"Hi \303\242\302\231\302\243 there"\0
Here is a new example that helps to understand what happened:
use Devel::Peek; use Encode; my $utf = "Hi " . chr(0x2663) . " there"; Dump($utf); Encode::_utf8_off($utf); # simulate set/get to apr::Table $utf .= chr(0x2663); chop $utf; Dump($utf);
Care to send a patch to the code, tests and docs? But may be wait a bit starting to work on that, in case someone objects to that change. First, I just wanted to know if you are willing to work on that ;)
What I'm not sure about is that whether this is the only place, where this kind of transformation needs to be done. I'm pretty sure that we have quite a few other APIs that grab the PV slot passing it to APR/Apache, regardless of the state of the UTF8 flag.
-- __________________________________________________________________ Stas Bekman JAm_pH ------> Just Another mod_perl Hacker http://stason.org/ mod_perl Guide ---> http://perl.apache.org mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com http://modperlbook.org http://apache.org http://ticketmaster.com
-- Boris
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]