On 06/16/2007 02:29 PM, Tom Allison wrote:
I'm trying to do some regular expression on strings in email. They could be encoded to something. But I can't tell because I don't have a utf8 unicode xterm window that will show me anything. At best I get ?????a?? and other trash like that. I think this is typical for ascii text renderings of two-bit characters.

Not be to deterred by the lack of anything this fancy in xterm I thought I would plug along.

I made a character thus:
my $string = chr(0x263a);  # reported to be a smiley face...

under 'use bytes' this prints as a ':'
without bytes this prints to something resembling a, a little box, a little circle.


And with unicode and locales and bytes it all gets extremely ugly.


I found something that SpamAssassin uses to convert all this "goo" into a repeatable set of characters (which is all I'm really after) by running something that looks like this:


What do you mean by a "repeatable set of characters"? Unicode characters are repeatable.

sub _quote_bytea {
    my ($str) = @_;
    my $buf = "";
    foreach my $char (split(//,$str)) {
        my $oct = sprintf ("%lo", ord($char));
        if (length( $oct ) < 2 ) { $oct = '0' . $oct; }
        if (length( $oct ) < 3 ) { $oct = '0' . $oct; }
        $buf .= '\\\\\\\\' . $oct;
    }
    return $buf;
}

Which is also "ugly" in it's own right. But I found mention that the "%lo" is considered really backward compatable notation and not something you might want to use (or need to) in perldoc -f sprintf.


The way I read it, it says that %O is a "backward compatible" version of %lo.

So one question I have that might be useful is, what alternatives does modern perl offer to "%lo" ?

I probably have a lot more, but I honestly am not sure if I can get an answer I can live with. I'm just trying to tokenize email and haven't seen a need to support these other character sets just yet. I would like to. But I haven't been able to find any sane way of doing it -- like can I convert everything into utf8 format or just convert everything into octal numbers? I don't need perfect human-readable conversion, I just need consistent conversions.


You probably should convert everything to utf8. Also, you need a utf8-enabled xterm such as rxvt-unicode or gnome-terminal. On my Debian Etch system, the text console seems to be UTF8 by default :-O

BTW, I still have no clue what you mean by tokenize e-mail.



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to