http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759
--- Comment #5 from Galen Charlton <[email protected]> --- I wrote a little test program to compare the options: ___BEGIN___ #!/usr/bin/perl use Modern::Perl; use Text::Unaccent qw//; use Text::Unaccent::PurePerl qw//; use utf8; use Unicode::Normalize; binmode STDOUT, ':utf8'; my @str = ( 'été', 'umlaüt', 'עברית', 'חוֹלָם', '北京市', 'Άά Έέ Ήή Ίί Όό Ύύ Ώώ', 'مُدَرِّسَة' ); sub unaccent { my $str = NFKD(shift); $str =~ s/\p{NonspacingMark}//g; return $str; } foreach (@str) { if ($_ eq 'مُدَرِّسَة') { # special case to avoid locking my terminal session (!) print "Text::Unaccent - $_ => *refusing to let Text::Unaccent do this*\n"; } else { print "Text::Unaccent - $_ => " . Text::Unaccent::unac_string('utf-8', $_) . "\n"; } print "Text::Unaccent::PurePerl - $_ => " . Text::Unaccent::PurePerl::unac_string($_) . "\n"; print "Strip NonspacingMark - $_ => " . unaccent($_) . "\n"; } ___END___ Here's its output: Text::Unaccent - été => ete Text::Unaccent::PurePerl - été => ete Strip NonspacingMark - été => ete Text::Unaccent - umlaüt => umlaut Text::Unaccent::PurePerl - umlaüt => umlaut Strip NonspacingMark - umlaüt => umlaut Text::Unaccent - עברית => ×¢×ר×ת Text::Unaccent::PurePerl - עברית => עברית Strip NonspacingMark - עברית => עברית Text::Unaccent - חוֹלָם => ××Ö¹×Ö¸× Text::Unaccent::PurePerl - חוֹלָם => חוֹלָם Strip NonspacingMark - חוֹלָם => חולם Text::Unaccent - 北京市 => åäº¬å¸ Text::Unaccent::PurePerl - 北京市 => 北京市 Strip NonspacingMark - 北京市 => 北京市 Text::Unaccent - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Îα Îε Îη Îι Îο Î¥Ï Î©Ï Text::Unaccent::PurePerl - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω Text::Unaccent - مُدَرِّسَة => *refusing to let Text::Unaccent do this* Text::Unaccent::PurePerl - مُدَرِّسَة => مُدَرِّسَة Strip NonspacingMark - مُدَرِّسَة => مدرسة Some conclusions: [1] Text::Unaccent mangles non-Latin characters outright; that's enough reason to get rid of it. [2] Both Text::Unaccent::PurePerl and stripping NonspacingMark characters are better -- they strip accents from Latin scripts, and don't mangle non-Latin. Removing NonspacingMark characters is more aggressive; I think we need input from Arabic, Hebrew, and Greek suers as to whether that is acceptable -- or, alternatively, if we need a system preference, or need to bite the bullet and package Text::Unaccent::PurePerl. -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list [email protected] http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
