character encoding & regex

Tom Allison Sat, 16 Jun 2007 12:29:41 -0700

I'm trying to do some regular expression on strings in email. They could beencoded to something. But I can't tell because I don't have a utf8 unicodexterm window that will show me anything. At best I get ?????a?? and othertrash like that. I think this is typical for ascii text renderings of two-bitcharacters.

Not be to deterred by the lack of anything this fancy in xterm I thought I wouldplug along.


I made a character thus:
my $string = chr(0x263a);  # reported to be a smiley face...

under 'use bytes' this prints as a ':'
without bytes this prints to something resembling a, a little box, a little 
circle.


And with unicode and locales and bytes it all gets extremely ugly.

I found something that SpamAssassin uses to convert all this "goo" into arepeatable set of characters (which is all I'm really after) by runningsomething that looks like this:


sub _quote_bytea {
    my ($str) = @_;
    my $buf = "";
    foreach my $char (split(//,$str)) {
        my $oct = sprintf ("%lo", ord($char));
        if (length( $oct ) < 2 ) { $oct = '0' . $oct; }
        if (length( $oct ) < 3 ) { $oct = '0' . $oct; }
        $buf .= '\\\\\\\\' . $oct;
    }
    return $buf;
}

Which is also "ugly" in it's own right. But I found mention that the "%lo" isconsidered really backward compatable notation and not something you might wantto use (or need to) in perldoc -f sprintf.

So one question I have that might be useful is, what alternatives does modernperl offer to "%lo" ?

I probably have a lot more, but I honestly am not sure if I can get an answer Ican live with. I'm just trying to tokenize email and haven't seen a need tosupport these other character sets just yet. I would like to. But I haven'tbeen able to find any sane way of doing it -- like can I convert everything intoutf8 format or just convert everything into octal numbers? I don't need perfecthuman-readable conversion, I just need consistent conversions.


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

character encoding & regex

Reply via email to