I'm trying to do some regular expression on strings in email. They could be
encoded to something. But I can't tell because I don't have a utf8 unicode
xterm window that will show me anything. At best I get ?????a?? and other
trash like that. I think this is typical for ascii text renderings of two-bit
characters.
Not be to deterred by the lack of anything this fancy in xterm I thought I would
plug along.
I made a character thus:
my $string = chr(0x263a); # reported to be a smiley face...
under 'use bytes' this prints as a ':'
without bytes this prints to something resembling a, a little box, a little
circle.
And with unicode and locales and bytes it all gets extremely ugly.
I found something that SpamAssassin uses to convert all this "goo" into a
repeatable set of characters (which is all I'm really after) by running
something that looks like this:
sub _quote_bytea {
my ($str) = @_;
my $buf = "";
foreach my $char (split(//,$str)) {
my $oct = sprintf ("%lo", ord($char));
if (length( $oct ) < 2 ) { $oct = '0' . $oct; }
if (length( $oct ) < 3 ) { $oct = '0' . $oct; }
$buf .= '\\\\\\\\' . $oct;
}
return $buf;
}
Which is also "ugly" in it's own right. But I found mention that the "%lo" is
considered really backward compatable notation and not something you might want
to use (or need to) in perldoc -f sprintf.
So one question I have that might be useful is, what alternatives does modern
perl offer to "%lo" ?
I probably have a lot more, but I honestly am not sure if I can get an answer I
can live with. I'm just trying to tokenize email and haven't seen a need to
support these other character sets just yet. I would like to. But I haven't
been able to find any sane way of doing it -- like can I convert everything into
utf8 format or just convert everything into octal numbers? I don't need perfect
human-readable conversion, I just need consistent conversions.
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/