Re: Unicode, SMS and year 2012
While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. We are talking about the whole of Unicode, not just BMP. /Sz On Sat, Apr 28, 2012 at 21:48, Doug Ewell d...@ewellic.org wrote: anbu at peoplestring dot com wrote: What are some of the reasons a new encoding will face challenges? The main challenge to a new encoding is that UTF-8 is already present in numerous applications and operating systems, and that any encoding intended to serve as an alternative, let alone a replacement UTF-8, must be better enough to justify re-engineering of these systems. Some people are simply opposed to additional encoding schemes. The HTML5 specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 (while allowing many non-Unicode legacy encodings and quietly mapping others to Windows encodings); one committee member was quoted as saying that other encodings of Unicode waste developer time. Any encoding that does not align code point boundaries along byte boundaries will be criticized for requiring excessive processing. The argument that I made will be made by others, that if it necessary to process bit-by-bit, one might as well use a general-purpose compression algorithm. It is popular to present gzip as the ideal compression approach, since it is widely available, especially on Linux-type systems, and publicly documented (and not IP-encumbered). I may have missed some other objections. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
On 04/28/2012 07:54 AM, a...@peoplestring.com wrote: I apologise for my poor explanation. I further assure, the codes are not magically created, they are created by the EBNF below. I regenerated the EBNF to make me as clear as possible, in fact, now they are two: 1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1) 1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1) These oft-repeated incomprehensible strings of symbols would be a whole lot more intuitively understandable if, say, you were to use a _different_ symbol for either 0 or 1 and not (0|1) (and maybe some spaces to split it up for the eye), and/or there were an actual *explanation* of what they meant, as in: 1 X {1X}... {0X}... 0 X 1 X 1 X {0X}... {1X}... 1 X 0 X and words like ... The bits in odd-numbered positions [counting from zero] can be either value and hold the data being transferred; in the even-numbered positions the first [zeroth] bit is 1, followed either by a a string of 1s, then 0s, ending with 0 1; or else a string of 0s, then 1s, ending with 1 0. Or something like that, maybe done better. My eyes glaze over at the sight of what looks like a random selection out of [{}10|()]*, and I'm probably not the only one. ~mark
Re: Kaktovik Inupiaq numerals
Den 2012-04-28 12:50, skrev Richard Wordingham richard.wording...@ntlworld.com: On Fri, 27 Apr 2012 13:50:15 -0700 Ken Whistler k...@sybase.com wrote: On 4/27/2012 10:45 AM, Richard Wordingham wrote: If they are to be adopted by the CLDR, the digits need to be coded consecutively. I doubt this matters in any case, because this proposed use is for a vigesimal system, which has digits 0..19, not digits 0..9. Trying to treat the first 10 digits as decimal digits in CLDR could accomplish nothing, IMO. I don't believe the exclusion of non-decimal bases is set in stone. So, while they wouldn't fit in to CLDR as it stands now, it would not take a huge change to add them. CLDR used to require sequentially encoded decimal digits, but my understanding is that that is no longer the case. And indeed, the numeral systems need not be decimal, or even positional. Roman numerals are supported, as are (e.g.) Armenian numerals, and traditional Chinese numerals (non-positional, using multiplier words). While vigesimal systems aren't supported (in CLDR) to the degree that any got *named*, in the way some other systems have been, there is still *some* support. See e.g. http://unicode.org/cldr/trac/browser/trunk/common/rbnf/nci.xml (a full-fledged vigesimal system in those rules) for spelling out numbers as words in Classical Nahuatl. There is also http://unicode.org/cldr/trac/browser/trunk/common/rbnf/kl.xml, for spelling out numbers in Kalaallisut (Greenlandic), but it is not full-fledged vigesimal. These RBNF rules are based on what I could find out from sources on the web a few years ago. If anyone has corrections/extensions/variation to these, or additions for other languages using vigesimal systems (yes, I did see that there was some data on the Wikipedia pages referenced), please send them to me, preferably with contact information to someone in the know, and I'll see what I can do. I cannot use vigesimal digits, though, since none are as yet encoded. But if some set of vigesimal digits were to be encoded, supporting them via RBNF would likely be the first point of support in CLDR. Furthermore, what Inuit has is a vigesimal *counting* system, as the article indicates. But this innovated set of numerals, is attempting to turn this into a full-blown radix-20 numerical system, which I doubt has any cultural validity. I presume you are talking about how the hundreds are (or were) traditionally expressed. The Inuit number system is another case of the rather widespread use of mixed 5/20 counting systems, which count 4 hands of 5 into groups of 20. Indeed, it immediately made me think of Welsh, where native-speakers' use of their vigesimal system has been hammered by the use of Arabic numerals. (In England, resistance to this 'heathen notation' collapsed long ago.) Before anyone points it out, I do know that Welsh _pymtheg_ '15' and possibly even _ugain_ '20' ultimately derive from a (superseded) decimal system. However, Welsh goes decimal at 100, so this vigesimal notation would not match the language at all for higher numbers. I don't think combining diacritics makes sense in this case. Rather, this kind of construction is better handled by taking the graphic elements for 5, 10, and 15, and ligating them in a font for the combined units. So the only elements requiring encoding would be 0, 1, 2, 3, 4, 5, 10, 15, in order to fully represent this system. No. One must be able to distinguish ONE, FIVE (= '25') and FIVE, ONE (= '101') from the notation for '6'. Or are you suggesting that rendering of ZWJ should be *essential* for the semantics, not just for acceptability? While I would have liked to have seen the use of combining characters (or ligation) in certain other cases where it is not present in Unicode, I think that that approach would be very inappropriate here; this is for digits for use in a positional system). Just encode (when that time comes) each of the new digits corresponding to 0, ..., 19 *atomically*. The Kaktovik digits are niftily designed though, with a logic in the (abstract) graphical design, and each of them can be drawn in a single pen stroke. They have found their way into some fonts (http://www.linguistsoftware.com/linup.htm#Kaktovik), and has some support form the Inuit Circumpolar Council (http://inuitcircumpolar.com/section.php?Nav=SectionID=10Lang=En). /Kent K The (undemonstrated) use of the notation denoting hands for which I suggested a combining diacritic could be handled by ligatures specified by ZWJ, but there could be a lot of them. Look at the ugly mess in New Tai Lue caused by not anticipating the need for medial 'v' because the UTC knew too little about Tai Lue (or even, more surprisingly, Northern Thai). Richard.
Re: Unicode, SMS and year 2012
On 04/29/2012 12:38 PM, a...@peoplestring.com wrote: Hi! I have noticed that I have created the previous definitions in a hurry to answer the question raised, as quick as possible. They are incomplete. I used the EBNF notation to express my encoding. Please refer Wikipedia (in Wikipedia, especially 'Table of Symbols') or other sources on EBNF: http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form#Table_of_symbols I am creating a well defined one. Yes, I know about EBNF notation. I didn't say it was wrong. I just said it would be a lot easier to follow and understand. ~mark
Re: Unicode, SMS and year 2012
Szelp, A. Sz. wrote: Some people are simply opposed to additional encoding schemes. The HTML5 specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 (while allowing many non-Unicode legacy encodings and quietly mapping others to Windows encodings); one committee member was quoted as saying that other encodings of Unicode waste developer time. While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. We are talking about the whole of Unicode, not just BMP. All UTFs (8, 16, 32) can represent all of Unicode, as can SCSU. The only Unicode encoding that can represent only the BMP is UCS-2, which AFAIK is no longer endorsed by UTC. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
On 2012/04/29 18:58, Szelp, A. Sz. wrote: While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. Well, except that it's hopelessly inefficient and therefore essentially nobody is using it. We are talking about the whole of Unicode, not just BMP. Yes. For transmission, use UTF-8 (or maybe UTF-16). Regards,Martin.