On Tue, Jan 31, 2012 at 06:45:12PM +1100, I wrote: > Apparently, the mapping from a string of Kanji to its pronunciation > (ordering) isn't even a deterministic operation, at least for proper > names.
(Of course I meant "proper nouns". Actual non-determinism might even be limited to proper nouns, though I'm not sure that that changes anything from a coding point of view.) > Thus, the solution would have to involve supplying pronunciations somehow > for at least some glossary entries. More precisely, it follows that sorting Kanji entries by pronunciation would in general require supplying pronunciations for some entries. However, I don't want my unclear wording to contribute to wrong conclusions about what Publican actually requires: I'm not in a position to say whether Publican requires index or glossary entries involving Kanji to be sorted by contextually-correct pronunciation. All I've learnt over the past couple of days is that *outside of* a book index or glossary, Kanji are sorted sometimes by contextually-correct pronunciation and sometimes by some other order (and I think there's more than one alternative, even). If anyone wants a concrete sample for an "is this output acceptable" question (and if not using software just for japanese sorting, like Lingua::JA::Sort::JIS), then I suggest making sure that the collation function is tailored for a Japanese locale (e.g. using Unicode::Collate::Locale->new(locale => 'ja-JP')): without that, collation software is unlikely to try to use a specifically-japanese ordering of Kanji characters or intersperse Katakana with Hiragana. In particular, the documentation for plain Unicode::Collate is explicit that it doesn't intersperse Katakana with Hiragana, and that its Kanji ordering is simply by unicode block & code point rather than by a JIS ordering. So I think the easiest thing to do that has a good chance of getting a "yes, this is acceptable" answer would be to switch from Unicode::Collate to Unicode::Collate::Locale and pass locale => $LANG to the constructor (where $LANG is the Publican language like en-US or ja-JP). Effect on other languages: Switching to a locale-sensitive collator might also make for a better collation of Indic languages (handling of virama, and some related reordering rules). Whereas if applied to Spanish for indexes, note that it might move entries like chkconfig from near the beginning of the C entries to just before D; it's not clear to me whether that's a good or bad thing for a word like chkconfig that isn't even spanish and thus arguably isn't using the Spanish ch digraph. (In both cases, I haven't actually tested the behaviour, nor have I asked a native speaker for their preferences for index/glossary sorting in technical documentation.) pjrm. _______________________________________________ publican-list mailing list [email protected] https://www.redhat.com/mailman/listinfo/publican-list Wiki: https://fedorahosted.org/publican
