On 06/16/2007 02:29 PM, Tom Allison wrote:
I'm trying to do some regular expression on strings in email. They could
be encoded to something. But I can't tell because I don't have a utf8
unicode xterm window that will show me anything. At best I get
?????a?? and other trash like that. I think this is typical for ascii
text renderings of two-bit characters.
Not be to deterred by the lack of anything this fancy in xterm I thought
I would plug along.
I made a character thus:
my $string = chr(0x263a); # reported to be a smiley face...
under 'use bytes' this prints as a ':'
without bytes this prints to something resembling a, a little box, a
little circle.
And with unicode and locales and bytes it all gets extremely ugly.
I found something that SpamAssassin uses to convert all this "goo" into
a repeatable set of characters (which is all I'm really after) by
running something that looks like this:
What do you mean by a "repeatable set of characters"? Unicode characters
are repeatable.
sub _quote_bytea {
my ($str) = @_;
my $buf = "";
foreach my $char (split(//,$str)) {
my $oct = sprintf ("%lo", ord($char));
if (length( $oct ) < 2 ) { $oct = '0' . $oct; }
if (length( $oct ) < 3 ) { $oct = '0' . $oct; }
$buf .= '\\\\\\\\' . $oct;
}
return $buf;
}
Which is also "ugly" in it's own right. But I found mention that the
"%lo" is considered really backward compatable notation and not
something you might want to use (or need to) in perldoc -f sprintf.
The way I read it, it says that %O is a "backward compatible" version of
%lo.
So one question I have that might be useful is, what alternatives does
modern perl offer to "%lo" ?
I probably have a lot more, but I honestly am not sure if I can get an
answer I can live with. I'm just trying to tokenize email and haven't
seen a need to support these other character sets just yet. I would
like to. But I haven't been able to find any sane way of doing it --
like can I convert everything into utf8 format or just convert
everything into octal numbers? I don't need perfect human-readable
conversion, I just need consistent conversions.
You probably should convert everything to utf8. Also, you need a
utf8-enabled xterm such as rxvt-unicode or gnome-terminal. On my Debian
Etch system, the text console seems to be UTF8 by default :-O
BTW, I still have no clue what you mean by tokenize e-mail.
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/