Hi, I have a new revision of my I-D: improving ACE using code point reordering v1.0 http://www.postel.co.kr/lsb-ace-01.txt ( sources included for dude and amc-w) It reports 30%~ 58% improvements in compressing ACE labels of typical han/hangeul business names in CJK. Reordering v1.0 is based on character frequency plus WORD ADJACENCY statitistics on modern han/hangeul business names and the fact that most frequent 256 han letters have cumulative usage frequency of near 60% ( for top freq 256 hangul syllables, about 80%). I applied this reordering to both DUDE and AMC-ACE-W, and found DUDE outperforms AMC-ACE-W even for han/hangul. I think tricks, tuning or heuristics (not based on language-specific knowledge) are not enough to get the 'ceiling' ACE compression ratio. I propose "Let languages compress themselves in new reordering layer between NAMEPREP and ACE, and leave ACE encoding as simple as possible". Your careful evaluation and feedback, please. Thanks. Soobok, [EMAIL PROTECTED] ---------------------------------------------------------------------------- --- Example Strings About 30%~58% improvement in DUDE compression ratio is achieved in these Hangul examples. LDUDE and LAMCW denote reordering-applied DUDE-02 and AMC-ACE-W, respectively. (AMCW for AMC-ACE-W). Most examples show LDUDE outperforms LAMCW. (K1) Korean String 1: ( 24 hangul syllables ) u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 u+C5BC u+B9C8 u+B098 u+C88B DUDE-02 : 6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8sit\ usauiyz5i23az96iz6ze3xaz2td ( 82 chars ) LDUDE : 5suhxb9jt2pydtwetwkxhtsrxhbyhvsmvvk7r2ityd6atqt8etvittk ( 55 chars, 33.9% shorter ) AMCW : 6tvifgem42ixihhakfnh6nhhem5wrk6fmpmpwim6m5wrmwxn5u8eivw\ mp6iqige2nem ( 67 chars ) LAMCW : 5swhtg8r5tycsb5swfgirxi5sxhsabyg5vypgcz2isa5tyd4d5p5sxj\ gmbgd5 ( 61 chars ) (K2) Korean String 2: ( 9 hangul syllables ) <KRNIC in korean> U+D55C U+AD6D U+C778 U+D130 U+B137 U+C815 U+BCF4 U+C13C DUDE-02 : 7xvNz2vBy4tFtywIyssHz3uCzw8Bz76I ( 32 chars ) LDUDE : 5syAB3BIJ7BB7N ( 14 chars, 56.2% shorter ) AMCW : 7xxNFmpM52QjsGjzNaxJhwKj6 ( 25 chars ) LAMCW : 5ssAsB3AIBwAB3P ( 15 chars ) (K3) Korean String 3: ( 18 hangul syllables ) U+C804 U+AD6D U+C2E4 U+C9C1 U+B178 U+C219 U+C790 U+B300 U+CC45 U+C885 U+AD50 U+C2DC U+BBFC U+B2E8 U+CCB4 U+D611 U+C758 U+D68C DUDE-02 : 62yEyxyJy92J5uFz25JzvyBx2Jzw3Az9wFw6Ayx7Fy92Nz3uA3tEz8\ xNt44FttwJtt7E ( 68 chars ) LDUDE : 5szAtBtvBt7Mt2Qv4Qu7KtFt5It3MuEvAtvDyJCtuC4G4J ( 46 chars, 32% shorter ) AMCW : 62sEFmpKzeNqbGm2Ks3M6sG2aPcfNefFksKy6I96GziPfwRstM42Rwn ( 55 chars ) LAMCW : 5stAsB5tvAGhmGmgG2mGatsE5t7JGbhsDvD5tsAyIK5swJ8RwG ( 50 chars ) (K4) Korean String 4: ( 7 hangul syllables ) <Hynics Semiconductor in korean> U+D558 U+C774 U+B2C9 U+C2A4 U+BC18 U+B3C4 U+CCB4 DUDE-02 : 7xvItuuNzx5PzsyPz85N97Nz9zA ( 27 chars ) LDUDE : 5s3C4F5Q7PtwRtMK ( 16 chars, 40% shorter ) AMCW : 7xxIM5wGyjKxeJa2G8ePfw ( 22 chars ) LAMCW : 5s9CxH8JvE5tzMyAK ( 17 chars ) (K5) Korean String 5: ( 13 hangul syllables ) U+D658 U+ACBD U+C6B4 U+B3D9 U+C5F0 U+D569 U+BC18 U+D575 U+D2B9 U+BCC4 U+C704 U+C6D0 U+D68C DUDE-02 : 7yvIz48Fy4sJzxyPzyuJts3Jy3zBy3yPz6Ny8zPz56At7EtsxN ( 50 chars ) LDUDE : 5s7NB4EDvHFtxDv5Kv6NtIt4R5GwK ( 29 chars, 42% shorter ) AMCW : 7yxIFf7MxwG83MrsRmjJa2RmxQx3JgeM2eMysRwn ( 40 chars ) LAMCW : 5s5N5PtJKuPI5tzMGybGiptF5s5KsNwG ( 32 chars ) About 35%~50% improvement in DUDE compression ratio is achieved in these UniHan examples. (TC1) Traditional Chinese String 1: ( 16 letters ) u+5354 u+91c7 u+5065 u+5eb7 u+4e8b u+696d u+670d u+52d9 u+7db2 u+002d u+5354 u+91c7 u+6709 u+9650 u+516c u+53f8 DUDE-02 : xvve6u3d6t4c87ctsvnuz8g8yavx7eu9ym-u88g6u3d9y6q9txj6z\ vnu3e ( 58 chars) LDUDE : xs8qy7ny9jhyi6f6bb8h-4iy7nyxkbed ( 32 chars, 44.8% shorter) AMCW : xvxen8huyfafzs2mc5pcipw7jh7u--xxen8hcijqcsvynx9i ( 48 chars ) LAMCW : xs2q2xcu4m4n6esb6abug--2q2xcusijpq ( 34 chars ) (TC2) Traditional Chinese String 2: ( 21 letters ) u+5317 u+4eac u+5e02 u+91ab u+85e5 u+7d93 u+6fdf u+6280 u+8853 u+7d93 u+71df u+516c u+53f8 u+5fa1 u+91ab u+7db2 u+7d61 u+83ef u+91ab u+7db2 u+8def DUDE-02 : xvzht75mts4q694jtwwq92zgtuwn7xr847d9x6a6wnus5du3e6xj6\ 8sk86tj7d982qtuwe86tj9sxp ( 78 chars) LDUDE : xtwicfz6b99a38g27c2vdd8cz7mzuqdt6izuiy6iz5nz5fy6by6ib ( 53 chars, 32.0% shorter) AMCW : xvths4naacn7mj9fh6veq9beakuvh6ve89vynx9iapbn7mh7uyb2v\ 8rn7mh7um9r ( 64 chars ) LAMCW : xtuiukr28q5tqu9i4ukutjk9i3uduspqv6g28quug33kuur28quugh ( 54 chars ) (TC3) Traditional Chinese String 3: ( 18 letters ) u+795e u+8fb2 u+7db2 u+990a u+8eab u+4fdd u+5065 u+7db2 u+5065 u+5eb7 u+4e16 u+754c u+5065 u+5eb7 u+8a2d u+8a08 u+5bb6 u+60e0 DUDE-02 : z3vq9y8n9usa8w5itz4b6tzgt95iu77hu77h87cts4bv5xkuxuj87\ c7w3kuf7t5qv5xg ( 68 chars ) LDUDE : xwsiw5e9kzyqz8fhb2p2phtvgxtbwuah8qbtwmyg ( 40 chars, 41.1% shorter ) AMCW : z3xqnpuh7uq2knfmt7puyfh7uuyfafzstgf4nuyfafzmbpsi75gys\ 8a ( 55 chars ) LAMCW : xwyiu7nug3wiu4pkmug4mnv3ky2mu4mnwcdvsiyq ( 40 chars ) (SC1) Simplified Chinese String 1 : ( 16 letters ) <ministry of foreign trade and economic cooperation, PRC> u+4e2d u+534e u+4eba u+6c11 u+5171 u+548c u+56fd u+5bf9 u+5916 u+8d38 u+6613 u+7ecf u+6d4e u+5408 u+4f5c u+90e8 DUDE-02 : w8wpt7ydt79euu4mv7yax9puzb7seu8r7wuq85umt27ntv2bv3wgt\ 5xe795e ( 60 chars ) LDUDE : xswjuzru6nu7fv7kv4gutrwgb7mbwiu6cuzqqxm ( 39 chars, 35.0% shorter ) AMCW : w8up29ps5kdst5uh7ygsup29pm3cb39n8tknpb39hkygswhdysupa\ qd ( 55 chars ) LAMCW : xsujwxgu3kwwrv3fwvduunykm5ab9jwvmuwfmta ( 39 chars ) (SC2) Simplified Chinese String 2 : ( 18 letters ) u+4e2d u+56fd u+4eba u+6c11 u+5927 u+5b66 u+4e2d u+56fd u+8d22 u+653f u+91d1 u+878d u+653f u+7b56 u+7814 u+7a76 u+4e2d u+5fc3 DUDE-02 : w8wpt27at2whuu4mvxvguwbtxwmt27a757r82tp9w8qtyxn8u5ct\ 8yjvwcuycvwxmtt8q ( 69 chars ) LDUDE : xswjf5gu7fu6rb4ifz8dx6ju8gnu8kwugy8fd8rd ( 40 chars, 42.2% shorter ) LAMCW : xsujun3kwwru2abujn36rwsgu8anwsg2uau6fgujk ( 41 chars ) About 20%~35% improvement in DUDE compression ratio is achieved in these Japanese Kanji/Katakana examples. (JP1) Japanese String 1: ( 25 letters ) U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3 U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9 U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF U+30FC DUDE-02 : z3xQu97Pv4vGuuyRu5xRu6Jxz8BQMuHtDxDMxHuGzNwItPwMxAtE\ wIwIwNwD (60 chars) LDUDE : xs8Nu2Cu4RvMGBysxGyCKtHtQCPFtAyPyKtPBGPyAyAyFyR ( 47 chars, 21.6% shorter) AMCW : z3vQ28DDyxs5KB9fCjnvs6P6DI8R9N4RE9D7F4J8B9N5H8H9D5M9\ D5R9N ( 57 chars ) LAMCW : xs2NwsQu4B3KNPvs6M4JD5E4KIFA5A7P5H4KMPA6A4A6F4K ( 47 chars ) (JP2) Japanese String 2: ( 16 letters ) U+8CA1 U+56E3 U+6CD5 U+4EBA U+5317 U+6D77 U+9053 U+81EA U+7136 U+4FDD U+8B77 U+63A8 U+9032 U+5354 U+4F1A DUDE-02 : 266B74wCv4vGuuyRt74Pv8yA97uEtt5J9s7Nv88M6w4K827R9v3K\ 6vyGt6wQ (60 chars) LDUDE : xs3Hu9Ju4RvMt5CFvuGvsRxtGw5Iz2Ev6BzIwtJE ( 40 chars, 33.3% shorter) AMCW : 264B28DDyxs5KxtHD5zNuvI9kE3yt7PMmzBpiNtuxxEttK ( 46 chars ) LAMCW : xs9HwsQu4B3KvuIPwsMvsEytCu4K3uQy8R3Hu2QK ( 40 chars ) (JP3) Japanese String 3: ( 16 letters ) U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3 U+30B9 U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44 U+5408 DUDE-02 : yztBu37P78xB9svIv29Ey22EwJuRyKwx3Kt6wQv3sI87CttyK734\ H85vQu3wN (61 chars) LDUDE : xttHxPvtFu9CDyssAyEyHyRys9PxQ4KHGEu4CuwJ ( 40 chars, 34.4% shorter) AMCW : z3vQ28DDyxs5KB9fCjnvs6P6DI8R9N4RE9D7F4J8B9N5H8H9D5M9\ D5R9N ( 57 chars ) LAMCW : xs2NwsQu4B3KNPvs6M4JD5E4KIFA5A7P5H4KMPA6A4A6F4K ( 47 chars ) LDUDE-2 shows the same good compression ratio for Latin family of scripts. (L1) Vietnamese: ( 38 syllables using diacritical marks ) Ta<dotbelow>isaoho<dotbelow>kh<ocirc>ngth<ecirc><hookabove>chi\ <hookabove>no<acute>iti<ecirc><acute>ngVi<ecirc><dotbelow>t U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069 u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA u+0323 u+0074 DUDE-02 : vEvfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\ yitptp2dv8mvyrjvBvr2dv6jvxh ( 82 chars ) LDUDE : uGuh5c5kckqhh5n4atm3n3ktmtdq2cxd7kmb7a7hb7q7irr2dxm7rt\ muDvr2dvj5f (66 chars , 16 chars(19%) shorter) (L2) Spanish: ( using basic Latin & Latin Supplement ) Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 u+0061 u+00F1 u+006F u+006C DUDE-02 : vAvrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmuMvgdt\ b3a3qd (61 chars) LDUDE : uAurftmtg2q2hbrhcbbmfcepnjiimidpjdqpmrmuMuqmb3a3qd (51 chars, 10 chars (16%) shorter) (L3) Czech: (using Latin Extended A) Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D u+0065 u+0073 u+006B u+0079 DUDE-02 : vAuctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (45 chars) LDUDE : uAukfycypkfepzpzfmibmtb3m8ayiqtik (34 chars, 24% shorter)
