On Wed, Feb 20, 2002 at 07:52:19AM +0000, Nick Ing-Simmons wrote: > If you - as perl's Big5 expert - say that that is the one to go with that > is good enough for me.
Alright. I think we could just use Big5+'s First Standard Segment group as the map here -- It's superior than either Tcl or iconv's map at several places, and afaik has no obvious shortcomings. > "compile" can take two forms - Tcl's .enc files which are packed UCS2 > values - and ICU's .ucm files which are human readable and commentable > text files. (Compile can also convert between the two.) http://autrijus.org/big5.enc.bz2 is the massaged map. The only adjustments I made is to allow 00A0 and 00FA..00FF to retain their meaning, instead of ruling them as 'unmapped' characters. The Big5+ spec is undefined in this point, and makes conversion of legacy documents slightly easier. Also, Encode.pm seems unable to handle '00xy' in the map, where 'x' has its highest bit set. There are six such places: Big5 UCS2 Charname ----------------------------- A150 00B7 MIDDLE DOT A1B1 00A7 SECTION SIGN A1D1 00D7 MULTIPLICATION SIGN A1D2 00F7 DIVISION SIGN A1D3 00B1 PLUS-MINUS SIGN A258 00B0 DEGREE SIGN For example, decode('big5', "\xA1\x50") simply equals to "\xB7", instead of the required "\xC2\xB7" UTF-8 expansion form. Can this be fixed? /Autrijus/
msg00664/pgp00000.pgp
Description: PGP signature