Dear All, I wrote a simple tokenizer for texts containing Latin9 characters. It does not behave as expected with the Swedish text below and I would like to find a workaround.
More precisely, Perl does not remove properly the Swedish quotes: » (RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK, U+00BB) from the text. See the first character of the first line of this text. When I run the program on a Mac Snow Leopard, with version 5.8.8 on the text encoded in UTF-8, Perl outputs a defective UTF-8 code for this character: <BB> I could solve the problem by removing the û character from the tr// list (LATIN SMALL LETTER U WITH CIRCUMFLEX, U+00FB.) Do you know of a better, cleaner way to work around this bug? Thank you for your help Pierre -- ### The Perl Program ### An elementary tokenizer. Save it in UTF-8 __BEGIN while ($line = <>) { $text .= $line; } $text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs; # The dash character must be quoted $text =~ s/([,.?!:;()'\-])/\n$1\n/g; $text =~ s/\n+/\n/g; print $text; ___END ### The text in Swedish to reproduce the bug. Save it in UTF-8 ___BEGIN »Tjuvgömmare!» säga skatorna och se ut som samvetet självt. »Vi äro polisbetjänter, vi. Hit med tjuvgodset!» »Å, tyst, era rackare! Jag är gårdsfogden.» »Just den rätta!» håna de. ___END -- Pierre Nugues, Lunds Tekniska Högskola, Institutionen för datavetenskap, Box 118, S-221 00 Lund, Suède. Tél. (0046) 46 222 96 40, http://www.cs.lth.se/~pierre Visiteurs: Lunds Tekniska Högskola, E-huset, rum 4134A, Ole Römers väg 3, S-223 63 Lund. Mon livre/My book: http://www.cs.lth.se/home/Pierre_Nugues/ilppp/ -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/