I'm not sure if this is a mod_perl problem or not, but I can't reproduce it
under regular perl, so I thought I'd post here. Anyway it's apache 1.3.29,
mod_perl 1.29 and perl 5.8.4.

The problem is occuring in the following piece of code. I've tried creating
a test case, but I can't seem to narrow it down. Just creating a basic
handler to test this seems to work, but when it's used like this buried deep
in some code, it fails. Always a bugger of a problem to track down.

Anyway, the problem seems to be with using "join" where the array has utf-8
strings in it. The resultant string does NOT have the utf-8 flag set. The
basic problem code is this:

        $BodyText = join("\n", @Lines[0 .. (@Lines < 3 ? @Lines-1 : 2)]) .
"\n";

Narrowing it down a bit, and dumping the internal structures as so:

        warn '$Lines[0]: ' . $Lines[0];
        warn 'utf-8 $Lines[0]: ' . is_utf8($Lines[0]);
        Dump($Lines[0]);

        $BodyText = join("\n", $Lines[0]);

        warn '$BodyText: ' . $BodyText;
        warn 'utf-8 $BodyText: ' . is_utf8($BodyText);
        Dump($BodyText);

I get:

$Lines[0]: Hej mor,
utf-8 $Lines[0]: 1
SV = PV(0x9a051a4) at 0xa27f828
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0xa2f0008 "Hej mor,"\0 [UTF8 "Hej mor,"]
  CUR = 8
  LEN = 9

Which looks fine, but then the joined result:

$BodyText: Hej mor,
utf-8 $BodyText:  at /home/mod_perl/hm/Data/Store/Mailbox.pm line 400.
SV = PVMG(0xa279140) at 0x8cb9228
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK)
  IV = 0
  NV = 0
  PV = 0xa2bbf50 "Hej mor,\n"\0
  CUR = 9
  LEN = 408
  MAGIC = 0xa397cd8
    MG_VIRTUAL = &PL_vtbl_taint
    MG_TYPE = PERL_MAGIC_taint(t)

Ouch, that seems wrong. No utf-8 flag, and the string seems to be marked as
tainted, even though the inputs aren't? I thought maybe it had something to
do with that $BodyText had been assigned to earlier and obviously was
tainged, and wasn't loosing it when the new value was being assigned to it.
So I changed to:

        $#Lines = 0;
        warn '$Lines[0]: ' . $Lines[0];
        warn 'utf-8 $Lines[0]: ' . is_utf8($Lines[0]);
        Dump($Lines[0]);

        my $NewBodyText = join("\n", $Lines[0]);

        warn '$NewBodyText: ' . $NewBodyText;
        warn 'utf-8 $NewBodyText: ' . is_utf8($NewBodyText);
        Dump($NewBodyText);

Which gives:

$Lines[0]: Hej mor, at /home/mod_perl/hm/Data/Store/Mailbox.pm line 393.
utf-8 $Lines[0]: 1 at /home/mod_perl/hm/Data/Store/Mailbox.pm line 394.
SV = PV(0x99f7a94) at 0xa386e68
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0xa2bc188 "Hej mor,"\0 [UTF8 "Hej mor,"]
  CUR = 8
  LEN = 9
$BodyText: Hej mor,
utf-8 $BodyText:  at /home/mod_perl/hm/Data/Store/Mailbox.pm line 400.
SV = PVMG(0xa3b61a8) at 0xa346cc0
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0xa3dde10 "Hej mor,\n"\0
  CUR = 9
  LEN = 162

Ah, so the magic taint stuff is now gone (though it is still a PVMG rather
than a PV?), but it still doesn't have the UTF-8 flag set (and the fact this
string doesn't have any utf-8 chars isn't the problem, it happens on all of
them, even those that do have utf-8 chars). There is no 'use bytes' or
anything at the top of the module, so I don't think that's the problem,
though I don't think that should actuall affect things should it since it
only controls how the actual source code is interpreted? I tried explicitly
doing 'use utf8' to check, but no difference.

Testing on a small standalong program from the command line, it does seem to
work as expected:

[EMAIL PROTECTED] root]# perl -e 'use Devel::Peek; $a="\x{1234}"; @a = ("a", $a,
"b"); $c = join "d", @a; Dump($c);'
SV = PV(0x811ee40) at 0x81318d0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x812e0e8 "ad\341\210\264db"\0 [UTF8 "ad\x{1234}db"]
  CUR = 7
  LEN = 8

Which actually raises a general perl question I just wanted to check. If you
have two strings and concat them, and one has the utf-8 flag and the other
doesn't, the resultant string does have the utf-8 flag set? Assuming that th
e non-utf8 flagged string is ASCII, this will work fine. If it has chars >
127 in it though, it'll create a rubbish string...

Ok, so to summarise, I think I see two problems here:
1. Assigning an untainted value to a value that was previously tainted
leaves the new value tainted
2. join with utf-8 strings doesn't seem to leave the joined string with the
utf-8 flag on

Seems all a bit weird to me...

Rob


-- 
Report problems: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html
List etiquette: http://perl.apache.org/maillist/email-etiquette.html

Reply via email to