http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759
--- Comment #13 from David Cook <dc...@prosentient.com.au> --- (In reply to Galen Charlton from comment #5) > Some conclusions: > > [1] Text::Unaccent mangles non-Latin characters outright; that's enough > reason to get rid of it. As I pointed out in my overly long comments, it doesn't appear that Text::Unaccent is actually mangling non-Latin characters. Rather, in your example, it looks like Perl doesn't correctly handle the concatenated string composed of one string with a UTF8 flag set and one string without a UTF8 flag set. It looks like Perl tries to do a utf8::upgrade() on the string without the UTF8 flag set (ie the one returned from Text::Unaccent's C code), and instead of reading it as an octet string and correctly translating into a UTF8 string of corresponding Unicode code points, it reads each octet in as a code point, which creates a completely different string for display purposes even though the underlying octets are the same. When given the octets d9 and 85 (ie the Arabic letter Meem), it creates a "UTF8 string" with the code points of "\x{d9}\x{85}" when it should create a "UTF8 string" with the code point "\x{645}". Instead of creating "\x{645}", Perl reads the octets d9 and 85 in as "\x{d9}\x{85}" This only appears to be a problem when you put the Text::Unaccent string in the same string as a Perl string with a UTF8 flag. If you were to break them into two separate lines, they'd display correctly in the terminal. Or you could use Encode::decode("UTF-8",$unaccented) to create a Perl string with a UTF8 flag with the proper code point "\x{645}"; > [2] Both Text::Unaccent::PurePerl and stripping NonspacingMark characters > are better -- they strip accents from Latin scripts, and don't mangle > non-Latin. Removing NonspacingMark characters is more aggressive; I think > we need input from Arabic, Hebrew, and Greek suers as to whether that is > acceptable -- or, alternatively, if we need a system preference, or need to > bite the bullet and package Text::Unaccent::PurePerl. I suspect that Text::Unaccent and Text::Unaccent::PurePerl are mostly the same, but that Text::Unaccent::PurePerl doesn't lose the UTF8 flag on the input string. We could avoid Text::Unaccent::PurePerl if we simply use "Encode::decode("UTF-8",$unaccented)" when using Text::Unaccent to translate the internal byte string into an internal UTF8 string. While it might not be required that we do that, doing so would probably prevent future buggy behaviour from occurring. That said, Text::Unaccent and Text::Unaccent::PurePerl don't necessarily look good enough. They miss diacritics in Arabic at least, although I think we definitely need input from Arabic, Hebrew, and CJK users regarding how stripping NonspacingMark affects those strings. My guess is that it's fine to strip the diacritics out of Arabic, but there are people much more qualified than me to answer that question on the listserv. Greek actually looks OK with Text::Unaccent if the encoding is handled. We can see that a bit more clearly with the following lines: use Text::Unaccent qw/unac_debug/; unac_debug($Text::Unaccent::DEBUG_HIGH); -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list Koha-bugs@lists.koha-community.org http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/