Workaround to a Unicode bug needed

Pierre Nugues Mon, 06 Sep 2010 06:11:03 -0700

Dear All,

I wrote a simple tokenizer for texts containing Latin9 characters. It does not 
behave as expected with the Swedish text below and I would like to find a 
workaround.


More precisely, Perl does not remove properly the Swedish quotes: » 
(RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK, U+00BB) from the text. See the 
first character of the first line of this text.

When I run the program on a Mac Snow Leopard, with version 5.8.8 on the text 
encoded in UTF-8, Perl outputs a defective UTF-8 code for this character: <BB>
I could solve the problem by removing the û character from the tr// list (LATIN 
SMALL LETTER U WITH CIRCUMFLEX, U+00FB.)
Do you know of a better, cleaner way to work around this bug?

Thank you for your help
Pierre
--

### The Perl Program
### An elementary tokenizer. Save it in UTF-8
__BEGIN

while ($line = <>) { 
  $text .= $line;
}
$text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
 # The dash character must be quoted
$text =~ s/([,.?!:;()'\-])/\n$1\n/g;
$text =~ s/\n+/\n/g;
print $text;

___END

### The text in Swedish to reproduce the bug. Save it in UTF-8

___BEGIN
»Tjuvgömmare!» säga skatorna och se ut som samvetet självt. »Vi äro 
polisbetjänter, vi. Hit med tjuvgodset!» 
»Å, tyst, era rackare! Jag är gårdsfogden.» 
»Just den rätta!» håna de. 
___END
--
Pierre Nugues, Lunds Tekniska Högskola, Institutionen för datavetenskap, Box 
118, S-221 00 Lund, Suède.
Tél. (0046) 46 222 96 40, http://www.cs.lth.se/~pierre
Visiteurs: Lunds Tekniska Högskola, E-huset, rum 4134A, Ole Römers väg 3, S-223 
63 Lund.
Mon livre/My book: http://www.cs.lth.se/home/Pierre_Nugues/ilppp/


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Workaround to a Unicode bug needed

Reply via email to