Re: Unicode::Normalize surprise with dotless i

SADAHIRO Tomoyuki Thu, 05 Sep 2002 07:09:23 -0700


On Thu, 05 Sep 2002 13:06:49 +0200
[EMAIL PROTECTED] (Andreas J. Koenig) wrote:


> Hi, Tomoyuki,
> 
> is it a bug in Unicode::Normalize or in my code: I expected that for
> combining a circumflex with a small letter i, I'd have to use the
> dotless i, but to my surprise, NFC refuses to combine with the dotless
> i. Here's a demo progam:
> 
> % perl -le '
> use Unicode::Normalize;
> use Encode;
> use charnames ":full";
> for my $e (qw(ascii)){
>   print Encode::encode($e,
>     NFKC("combining with i: i\N{COMBINING CIRCUMFLEX ACCENT}
> combining with dotless i: \N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING CIRCUMFLEX 
>ACCENT}"),
>     Encode::FB_PERLQQ); 
> }
> '
> combining with i: \x{00ee}
> combining with dotless i: \x{0131}\x{0302}
> 
> 
> What do you think?

Hello.
I have a short and a long answer, respectively.

(1) 

<LATIN SMALL LETTER I WITH CIRCUMFLEX> is not
<LATIN SMALL LETTER DOTLESS I WITH CIRCUMFLEX>.

(2)
Ok, please suppose NFC of <dotless-i, circumflex> is <i-circumflex>.
If NFC of a string is equal to NFC of another string,
they are called canonical equivalent.
Similarly, if NFKC of two strings are equal each other,
they are called compatibility equivalent.

Then <dotless-i> must be either canonical or compatibility equivalent
to <i>, since <i-circumflex> is NFC (or NFKC) of
<dotless-i, circumflex> as well as that of <i, circumflex>.
In such a case, users of Turkish or other some languages would
be disallowed to use them in different senses.

Japanese people also use <i-circumflex> in Latin transliteration
of Japanese, called ROMAJI, as long "i". (Long "i" is usually
represented by "ii" or <i-macron>, though.)
If <i-circumflex> might be <dotless-i> with <circumflex>,
but not <i> with <circumflex>, <i-circumflex> should be
a long sound of <dotless i>, but not long "i".
That is also surprising.  :)

Regards,
SADAHIRO Tomoyuki

Re: Unicode::Normalize surprise with dotless i

Reply via email to