Re: Workaround to a unicode bug needed

karl williamson Mon, 06 Sep 2010 09:31:20 -0700

You need to have a 'use utf8;' statement at the beginning of yourprogram to tell Perl that it is encoded in utf8.


I tested it with that, and it works.


Pierre Nugues wrote:

Dear All,

I wrote a simple tokenizer for texts containing Latin9 characters. It does not 
behave as expected with the Swedish text below and I would like to find a 
workaround.

More precisely, perl does not remove properly the Swedish quotes: » 
(RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK, U+00BB) from the text. See the 
first character of the first line of this text.

When I run the program on a Mac Snow Leopard, with version 5.8.8 on the text encoded 
in UTF-8, Perl outputs a defective UTF-8 code for this character: <BB>
I could solve the problem by removing the û character from the tr// list (LATIN 
SMALL LETTER U WITH CIRCUMFLEX, U+00FB.)
Do you know of a better, cleaner way to work around this bug?

Thank you for your help
Pierre
--

### The Perl Program
### An elementary tokenizer. Save it in UTF-8
__BEGIN

while ($line = <>) {$text .= $line;

}
$text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
  # The dash character must be quoted
$text =~ s/([,.?!:;()'\-])/\n$1\n/g;
$text =~ s/\n+/\n/g;
print $text;

___END

### The text to reproduce the bug. Save it in UTF-8

___BEGIN

»Tjuvgömmare!» säga skatorna och se ut som samvetet självt. »Vi äropolisbetjänter, vi. Hit med tjuvgodset!»»Å, tyst, era rackare! Jag är gårdsfogden.»»Just den rätta!» håna de.___END

--
Pierre Nugues, Lunds Tekniska Högskola, Institutionen för datavetenskap, Box 
118, S-221 00 Lund, Suède.
Tél. (0046) 46 222 96 40, http://www.cs.lth.se/~pierre
Visiteurs: Lunds Tekniska Högskola, E-huset, rum 4134A, Ole Römers väg 3, S-223 
63 Lund.
Mon livre/My book: http://www.cs.lth.se/home/Pierre_Nugues/ilppp/

Re: Workaround to a unicode bug needed

Reply via email to