Re: character encoding & regex

Mumia W. Sat, 16 Jun 2007 14:09:17 -0700

On 06/16/2007 02:29 PM, Tom Allison wrote:

I'm trying to do some regular expression on strings in email. They couldbe encoded to something. But I can't tell because I don't have a utf8unicode xterm window that will show me anything. At best I get?????a?? and other trash like that. I think this is typical for asciitext renderings of two-bit characters.
Not be to deterred by the lack of anything this fancy in xterm I thoughtI would plug along.
I made a character thus:
my $string = chr(0x263a);  # reported to be a smiley face...

under 'use bytes' this prints as a ':'
without bytes this prints to something resembling a, a little box, alittle circle.
And with unicode and locales and bytes it all gets extremely ugly.
I found something that SpamAssassin uses to convert all this "goo" intoa repeatable set of characters (which is all I'm really after) byrunning something that looks like this:

What do you mean by a "repeatable set of characters"? Unicode charactersare repeatable.

sub _quote_bytea {
    my ($str) = @_;
    my $buf = "";
    foreach my $char (split(//,$str)) {
        my $oct = sprintf ("%lo", ord($char));
        if (length( $oct ) < 2 ) { $oct = '0' . $oct; }
        if (length( $oct ) < 3 ) { $oct = '0' . $oct; }
        $buf .= '\\\\\\\\' . $oct;
    }
    return $buf;
}
Which is also "ugly" in it's own right. But I found mention that the"%lo" is considered really backward compatable notation and notsomething you might want to use (or need to) in perldoc -f sprintf.

The way I read it, it says that %O is a "backward compatible" version of%lo.

So one question I have that might be useful is, what alternatives doesmodern perl offer to "%lo" ?
I probably have a lot more, but I honestly am not sure if I can get ananswer I can live with. I'm just trying to tokenize email and haven'tseen a need to support these other character sets just yet. I wouldlike to. But I haven't been able to find any sane way of doing it --like can I convert everything into utf8 format or just converteverything into octal numbers? I don't need perfect human-readableconversion, I just need consistent conversions.

You probably should convert everything to utf8. Also, you need autf8-enabled xterm such as rxvt-unicode or gnome-terminal. On my DebianEtch system, the text console seems to be UTF8 by default :-O


BTW, I still have no clue what you mean by tokenize e-mail.



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: character encoding & regex

Reply via email to