Re: strcoll and hiragana

Glenn Maynard Mon, 25 Feb 2002 19:24:38 -0800

Testing again, I probably messed up the ja_JP test:

09:54pm [EMAIL PROTECTED]/2 [~] export LC_ALL=en_US.UTF-8
09:54pm [EMAIL PROTECTED]/2 [~] sort
か
き
か
日本
綺麗
日本
eof
か
き
か
日本
日本
綺麗
09:54pm [EMAIL PROTECTED]/2 [~] export LC_ALL=ja_JP.UTF-8
09:54�� [EMAIL PROTECTED]/2 [~] sort
か
き
か
日本
綺麗
日本
eof
か
か
き
日本
日本
綺麗

So, the Han range (or at least the two characters I'm trying) are being
ordered in both EN and JP, but hiragana are only being ordered in JP.

On Tue, Feb 26, 2002 at 01:43:54AM +0100, Pablo Saratxaga wrote:
> Well, do you have LC_COLLATE definitions for those locales ?

Yep.  That shouldn't matter; there should still be a default ordering.
(I'm not saying the ordering is wrong; I'm saying there's no ordering at
all.)

> And language *DOES* matter.

All characters should be collated, regardless of the active LC_COLLATE.
(Obivously, the language affects which ordering is used, but there
should always *be* an ordering, even if it's arbitrary.)

ex (random characters chosen whose meanings I have no idea):

10:14pm [EMAIL PROTECTED]/2 [~] cat blah
課
場
馬
輪
派
吉
10:09�� [EMAIL PROTECTED]/2 [~] export LC_ALL=en_US.UTF-8
10:10pm [EMAIL PROTECTED]/2 [~] sort blah
吉
場
派
課
輪
馬
10:10pm [EMAIL PROTECTED]/2 [~] export LC_ALL=ja_JP.UTF-8
10:10�� [EMAIL PROTECTED]/2 [~] sort blah
課
吉
場
派
馬
輪

Both are sorting, but the sort is different.  (That's OK.)  I think it's
sorting by UCS order in en_US.

This keeps things like "sort | uniq" working, and other things that
depend on sort really sorting.

strcoll(3): "The strcoll() function returns an integer less than, equal to,
or greater than zero if s1 is found, respectively, to be less than, to match,
or be greater than s2, when both are interpreted as appropriate for the
current locale."

"か" is not a match for "あ" by any sane definition of "match".  Make up an
ordering, if necessary, but don't return 0.  (UCS order is an acceptable
ordering, as far as I'm concerned, when real tables aren't available.)

Summary: the behavior for kanji in en_US looks OK to me.

> Here kanji collates in japanese, not in english (in other words, it
> behaves the same as kana)
> 
> I use glibc 2.2.4 and sort from textutils 2.0.17

glibc 2.2.5.  (textutils 2.0; I assume sort just calls strcoll().)

The only thing I see as a problem is that hiragana and katakana are not
being collated at all, and I have no idea why that's happening.  It
seems obvious that strcoll("か","あ") returning 0 is incorrect.

(Hmm.  This brings up a question about multilingual collation, but I
think I'll bump that into another thread.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: strcoll and hiragana

Reply via email to