You need to have a 'use utf8;' statement at the beginning of your
program to tell Perl that it is encoded in utf8.
I tested it with that, and it works.
Pierre Nugues wrote:
Dear All,
I wrote a simple tokenizer for texts containing Latin9 characters. It does not
behave as expected with the Swedish text below and I would like to find a
workaround.
More precisely, perl does not remove properly the Swedish quotes: »
(RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK, U+00BB) from the text. See the
first character of the first line of this text.
When I run the program on a Mac Snow Leopard, with version 5.8.8 on the text encoded
in UTF-8, Perl outputs a defective UTF-8 code for this character: <BB>
I could solve the problem by removing the û character from the tr// list (LATIN
SMALL LETTER U WITH CIRCUMFLEX, U+00FB.)
Do you know of a better, cleaner way to work around this bug?
Thank you for your help
Pierre
--
### The Perl Program
### An elementary tokenizer. Save it in UTF-8
__BEGIN
while ($line = <>) {
$text .= $line;
}
$text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
# The dash character must be quoted
$text =~ s/([,.?!:;()'\-])/\n$1\n/g;
$text =~ s/\n+/\n/g;
print $text;
___END
### The text to reproduce the bug. Save it in UTF-8
___BEGIN
»Tjuvgömmare!» säga skatorna och se ut som samvetet självt. »Vi äro
polisbetjänter, vi. Hit med tjuvgodset!»
»Å, tyst, era rackare! Jag är gårdsfogden.»
»Just den rätta!» håna de.
___END
--
Pierre Nugues, Lunds Tekniska Högskola, Institutionen för datavetenskap, Box
118, S-221 00 Lund, Suède.
Tél. (0046) 46 222 96 40, http://www.cs.lth.se/~pierre
Visiteurs: Lunds Tekniska Högskola, E-huset, rum 4134A, Ole Römers väg 3, S-223
63 Lund.
Mon livre/My book: http://www.cs.lth.se/home/Pierre_Nugues/ilppp/