http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759
--- Comment #21 from David Cook <[email protected]> --- (In reply to Yuval Hager from comment #20) > > I suspect that will make the output of Text::Unaccent and > > Text::Unaccent::PurePerl the same. > > > > Not really, it stays the same garbled mess. > That's odd. In that case, you could try replacing the following line: print "Text::Unaccent - $_ => " . Text::Unaccent::unac_string('utf-8', $_) . "\n"; with these lines: use Encode; my $unaccented = Text::Unaccent::unac_string('utf-8', $_); $unaccented = encode("UTF-8",$unaccented); print "Text::Unaccent - $_ => $unaccented \n"; The garbled mess is, basically, because we're using "use utf8" and Text::Unaccent returns strings without a UTF8 flag. > > unac_debug($Text::Unaccent::DEBUG_HIGH); > > > > That will also tell you what Text::Unaccent is doing (or probably not > > doing). > > I tested on one string: > > unac.c:13708: unac_data0[7] & unac_positions[0][8]: 0x05e7 => untouched > unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched > unac.c:13708: unac_data0[30] & unac_positions[0][31]: 0x05de => untouched > unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched > unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x05e5 => untouched > Text::Unaccent - קָמָץ => ×§Ö¸×ָץ > > > > Note that nothing seems to happen with the (Japanese?) ideograms that Galen > > tested. I wonder if accents are even a thing with CJK languages... > > I am definitely not an authoritative source, but I know a tiny bit of > Japanese. The letters above are Kanji alphabet, and to the best of my > knowledge do not have diacritics. BUT Japanese has two more alphabets, > Hiragana and Katakana, both use diacritics, which CANNOT be removed, or they > change the sound (and potentially the meaning). > For example, in the word Hiragana, the first syllable is ひ (Hi, pronounce > Hee). This same syllable, with two ticks is び, and it sounds like Bee. A > circle makes it ぴ - sounds like Pee. Testing those three: > I was just reading some comments from a friend who was suggesting the same thing. > Text::Unaccent - ひびぴ => ã²ã²ã² > Text::Unaccent::PurePerl - ひびぴ => ひひひ > Strip NonspacingMark - ひびぴ => ひひひ > > So we've changed 'Hee Bee Pee' to 'Hee Hee Hee'. The same result (and same > syllables) for Katakana: > > Text::Unaccent - ヒビピ => ããã > Text::Unaccent::PurePerl - ヒビピ => ヒヒヒ > Strip NonspacingMark - ヒビピ => ヒヒヒ > > So diacritics, at least in those two alphabets, should not be removed, to > the best of my knowledge. In that case, I really wonder whether we should actually be removing accents for any languages, and instead look at why we started stripping accents in the first place. Text::Unaccent is clearly not removing accents for many languages, so clearly it can't be that big of a problem, no? -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list [email protected] http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
