http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759
--- Comment #11 from David Cook <[email protected]> --- Analyzing what "use utf8" does and it's... interesting. #use utf8; #binmode STDOUT, ':utf8'; say "Hex = ".unpack("H*",$_); Hex = d985d98fd8afd98ed8b1d990d991d8b3d98ed8a9 Text::Unaccent - مُدَرِّسَة => مُدَرِّسَة echo "مُدَرِّسَة" | xxd -p d985d98fd8afd98ed8b1d990d991d8b3d98ed8a90a [That last 0a byte is just a LF character (ie \n)] use utf8; #binmode STDOUT, ':utf8'; say "Hex = ".unpack("H*",$_); Hex = 454f2f4e315051334e29 Text::Unaccent - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø© #use utf8; binmode STDOUT, ':utf8'; say "Hex = ".unpack("H*",$_); Hex = d985d98fd8afd98ed8b1d990d991d8b3d98ed8a9 Text::Unaccent - Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø© => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø© use utf8; binmode STDOUT, ':utf8'; say "Hex = ".unpack("H*",$_); Hex = 454f2f4e315051334e29 Text::Unaccent - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø© -- I have no idea what 454f2f4e315051334e29 is... it's not UTF-8 or Latin1. In fact, if you try to read it as either... you'll just read that EO/N1PQ3N). Ahh, I was missing this error message: Character in 'H' format wrapped in unpack at unaccent.pl line 46. Here's some more info using Devel::Peek::Dump(): PV = 0x1ba6b20 "\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0 [UTF8 "\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}"] Indeed, if we look back at our UTF-8 table: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1536 0645 is the code point for ARABIC LETTER MEEM which would be encoded as d9 85. 454f2f4e315051334e29 is clearly a butchering of the internal string of Unicode codepoints "\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}" where only the low-byte values of the code point is being shown. -- Ahh... I think I might have figured it out. When you use "use utf8": $_ = PV = 0xf25f60 "\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0 [UTF8 "\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}"] Text::Unaccent::unac_string('UTF-8', $_) = PV = 0x2a0a0c0 "\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0 If you print out the content of "Text::Unaccent::unac_string('UTF-8', $_)" on its own, you'll get مُدَرِّسَة. However, if you mix $_ and $unaccented in a single concatenated string, you're going to wind up with a correct $_ but a double-encoded $unaccented. If you look at the concatenated string, you'll get a PV of: PV = 0x29028c0 "Text::Unaccent - \331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251 - \303\231\302\205\303\231\302\217\303\230\302\257\303\231\302\216\303\230\302\261\303\231\302\220\303\231\302\221\303\230\302\263\303\231\302\216\303\230\302\251 \n"\0 [UTF8 "Text::Unaccent - \x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} - \x{d9}\x{85}\x{d9}\x{8f}\x{d8}\x{af}\x{d9}\x{8e}\x{d8}\x{b1}\x{d9}\x{90}\x{d9}\x{91}\x{d8}\x{b3}\x{d9}\x{8e}\x{d8}\x{a9} \n"] So in that UTF8 section you have $_ represented by Unicode codepoints while the UTF-8 encoded bytes of $unaccepted have been transformed into a string of codepoints using a hexadecimal byte for each code point. If you wanted to concatenate them both in the string, you'd first have to run "$unaccented = decode('UTF-8', $unaccented)". Then your concatenated string would internally look like: PV = 0x27812a0 "Text::Unaccent - \331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251 - \331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251 \n"\0 [UTF8 "Text::Unaccent - \x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} - \x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} \n"] And that would be correct: Text::Unaccent - مُدَرِّسَة - مُدَرِّسَة Strip NonspacingMark - مُدَرِّسَة => مدرسة I mean... the output still doesn't do us much good, but that explains the mangling. While we gave Text::Unaccent a Perl string with a UTF8 flag set, it took that string through to some C code using a XS interface, did a few things (depending on the scenario), and then passed back a Perl string without a UTF8 flag set, which seems to confuse Perl. If we do a utf8::upgrade($unaccented) earlier, it still creates a string with incorrect code points... -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list [email protected] http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
