At 17:39 01/10/19 +0900, Soobok Lee wrote: >----- Original Message ----- >From: "Martin Duerst" <[EMAIL PROTECTED]> > > > > > >1) saturations in TLD namespaces would require longer names for which > > > REORDERING is designed to give greater benefits/compression ratio. > > > > No. What James referred to is that saturation tends to fill up the > > short name slots, and thus flatten the probability distribution. > > I.e. if somebody doesn't get the name they wanted, the chance is > > that they go for something like xq.com, because it's easy to > > remember because it's short. Neither x nor q are very frequent > > letters. > >Han/hangeul characters carries meanings while latin alphabets >denote phonemes. Therefore your analogy between latin and han domains >may be false. Chinese people would rather choose to register >digit-added variants of alreagy taken desired domains in saturated ML.com, >instead of choosing non-sense irrelevant rare han characters.
Some really rare and irrelevant han characters may indeed never be chosen. But still if you want to name a company, there are many different possibilities, and people will look for short, not yet used possibilities (which still make some sense) rather than use longer and longer names. >Later time, I will provide some proofs that SC and TC only have >small partial set of frequent characters. That's already clear in >SJIS and KSC5601 han characters set which size is less than 5000. Yes, this is true. > > >to avoid countriy-specific biases in han reordering table. > > > > > >non-CJK scripts often haver small set of basic alphabets, and their > > >character usage patterns are more stable than those for han/hangeul. > > > > No, many other scripts are used for many more languages, with > > quite different usage patterns. (A lot of Han usage in Japan, > > and most of it in Korea, is due to loanwords from Chinese.) > > > >But, even without Urdu consideration in >arabic reordering, the efficiency of reordering is always better than >without it, because the lexicographic ordering in un-reordered >arabic script block can be regarded as *RANDOM* ordering >in frequency measure (maximum entropy). It's probably not, because most alphabets contain a few 'late additions'. And just using first order frequency to bring the most frequent characters to the front may not be the most efficient way for compression. >Partial reordering (without Urdu consideration) is always better than >no reordering. I don't deny that you may be able to squeeze out a few bits. But I don't think that should be the aim of this exercise. >If Urdu text samples are available, my arabic reordering table may be >improved to reflect them, though. Which might then make it less efficient for Arabic. Regards, Martin.
