Mark Davis <[EMAIL PROTECTED]> writes: >>ICU's pedantic form > >The goal for ICU is to be charset neutral, and support all of the >conversions that are in modern use. There are a large number of >variants of character sets;
Fair enough - but as shipped (I downloaded it earlier this week) it comes with a convrtrs.txt which maps MIME's EUC-JP onto something it calls ibm-33722 which has the behaviour I reported in at the start of this thread. >you can use the one you want. It is not a question of which _I_ want - it is a question of which one(s) CJK perl users want/expect/need. In so far a _I_ want any particular one it is the one which is going to match the X11 font encoding so I can in my naive westerner's way see what it looks like - and I have not a clue which one that is ... >See: > >http://oss.software.ibm.com/icu/charset/index.html I huge list and I don't see how to "grep" it for the provenance of the table (not that many seem to have any). So can the experts - ideally native reading experts not theorists - tell me which ICU (or other open source) table(s) they want/expect/need, or failing that which ones have proven troublesome. There seem to be at least 4 EUC-JP mappings in that list AIX, Solaris, glibc and Java If we cannot get any answers "quickly" then I think Dan is correct - we should un-bundle the whole CJK encoding stuff from the "core" into a family of CPAN modules. Which gives me a design choice: A. Bundle a "pragmatic" set of CJK which are fast and causes least build pain for non CJK users (i.e. compact precompiled form) B. Make it as easy as possible for end-user to drop in a new encoding from (say) a .ucm file. I can obvioulsy try for both - but they seem to be pulling in opposite directions at present. Meanwhile I will go fix the bugs in the core's :encoding logic ... -- Nick Ing-Simmons http://www.ni-s.u-net.com/