[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

bugzilla-daemon Thu, 10 Dec 2015 21:46:56 -0800

http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759


--- Comment #21 from David Cook <[email protected]> ---
(In reply to Yuval Hager from comment #20)
> > I suspect that will make the output of Text::Unaccent and
> > Text::Unaccent::PurePerl the same. 
> >
> 
> Not really, it stays the same garbled mess.
> 

That's odd.

In that case, you could try replacing the following line:

print "Text::Unaccent           - $_ => " .
Text::Unaccent::unac_string('utf-8', $_) . "\n";

with these lines:

use Encode;
my $unaccented = Text::Unaccent::unac_string('utf-8', $_);
$unaccented = encode("UTF-8",$unaccented);

print "Text::Unaccent           - $_ => $unaccented \n";

The garbled mess is, basically, because we're using "use utf8" and
Text::Unaccent returns strings without a UTF8 flag.

> > unac_debug($Text::Unaccent::DEBUG_HIGH);
> > 
> > That will also tell you what Text::Unaccent is doing (or probably not 
> > doing).
> 
> I tested on one string:
> 
> unac.c:13708: unac_data0[7] & unac_positions[0][8]: 0x05e7 => untouched
> unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
> unac.c:13708: unac_data0[30] & unac_positions[0][31]: 0x05de => untouched
> unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
> unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x05e5 => untouched
> Text::Unaccent           - קָמָץ => ×§Ö¸×Ö¸×¥
> 
> 
> > Note that nothing seems to happen with the (Japanese?) ideograms that Galen
> > tested. I wonder if accents are even a thing with CJK languages...
> 
> I am definitely not an authoritative source, but I know a tiny bit of
> Japanese. The letters above are Kanji alphabet, and to the best of my
> knowledge do not have diacritics. BUT Japanese has two more alphabets,
> Hiragana and Katakana, both use diacritics, which CANNOT be removed, or they
> change the sound (and potentially the meaning).
> For example, in the word Hiragana, the first syllable is ひ (Hi, pronounce
> Hee). This same syllable, with two ticks is び, and it sounds like Bee. A
> circle makes it ぴ - sounds like Pee. Testing those three:
> 

I was just reading some comments from a friend who was suggesting the same
thing. 

> Text::Unaccent           - ひびぴ => ã²ã²ã²
> Text::Unaccent::PurePerl - ひびぴ => ひひひ
> Strip NonspacingMark     - ひびぴ => ひひひ
> 
> So we've changed 'Hee Bee Pee' to 'Hee Hee Hee'. The same result (and same
> syllables) for Katakana:
> 
> Text::Unaccent           - ヒビピ => ããã
> Text::Unaccent::PurePerl - ヒビピ => ヒヒヒ
> Strip NonspacingMark     - ヒビピ => ヒヒヒ
> 
> So diacritics, at least in those two alphabets, should not be removed, to
> the best of my knowledge.

In that case, I really wonder whether we should actually be removing accents
for any languages, and instead look at why we started stripping accents in the
first place.

Text::Unaccent is clearly not removing accents for many languages, so clearly
it can't be that big of a problem, no?

-- 
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[email protected]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

Reply via email to