Re: Unicode, SMS and year 2012

2012-04-29 Thread Szelp, A. Sz.
While there are good reasons the authors of HTML5 brought to ignore SCSU or
BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping
of Unicode codepoints to byte values seems shortsighted. We are talking
about the whole of Unicode, not just BMP.

/Sz



On Sat, Apr 28, 2012 at 21:48, Doug Ewell d...@ewellic.org wrote:

 anbu at peoplestring dot com wrote:

  What are some of the reasons a new encoding will face challenges?


 The main challenge to a new encoding is that UTF-8 is already present in
 numerous applications and operating systems, and that any encoding intended
 to serve as an alternative, let alone a replacement UTF-8, must be better
 enough to justify re-engineering of these systems.

 Some people are simply opposed to additional encoding schemes. The HTML5
 specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 (while
 allowing many non-Unicode legacy encodings and quietly mapping others to
 Windows encodings); one committee member was quoted as saying that other
 encodings of Unicode waste developer time.

 Any encoding that does not align code point boundaries along byte
 boundaries will be criticized for requiring excessive processing. The
 argument that I made will be made by others, that if it necessary to
 process bit-by-bit, one might as well use a general-purpose compression
 algorithm. It is popular to present gzip as the ideal compression approach,
 since it is widely available, especially on Linux-type systems, and
 publicly documented (and not IP-encumbered).

 I may have missed some other objections.


 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­




Re: Unicode, SMS and year 2012

2012-04-29 Thread Mark E. Shoulson
On 04/28/2012 07:54 AM, a...@peoplestring.com wrote:
 I apologise for my poor explanation. I further assure, the codes are not
 magically created, they are created by the EBNF below. I regenerated the
 EBNF to make me as clear as possible, in fact, now they are two:

 1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1)

 1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1)



These oft-repeated incomprehensible strings of symbols would be a whole
lot more intuitively understandable if, say, you were to use a
_different_ symbol for either 0 or 1 and not (0|1) (and maybe some
spaces to split it up for the eye), and/or there were an actual
*explanation* of what they meant, as in:

1 X {1X}... {0X}... 0 X 1 X
1 X {0X}... {1X}... 1 X 0 X

and words like ... The bits in odd-numbered positions [counting from
zero] can be either value and hold the data being transferred; in the
even-numbered positions the first [zeroth] bit is 1, followed either by
a a string of 1s, then 0s, ending with 0 1; or else a string of 0s, then
1s, ending with 1 0. Or something like that, maybe done better. My eyes
glaze over at the sight of what looks like a random selection out of
[{}10|()]*, and I'm probably not the only one.

~mark




Re: Kaktovik Inupiaq numerals

2012-04-29 Thread Kent Karlsson

Den 2012-04-28 12:50, skrev Richard Wordingham
richard.wording...@ntlworld.com:

 On Fri, 27 Apr 2012 13:50:15 -0700
 Ken Whistler k...@sybase.com wrote:
 
 On 4/27/2012 10:45 AM, Richard Wordingham wrote:
 If they are to be adopted by the CLDR, the digits need to be coded
 consecutively.
 
 I doubt this matters in any case, because this proposed use is for
 a vigesimal system, which has digits 0..19, not digits 0..9. Trying to
 treat the first 10 digits as decimal digits in CLDR could accomplish
 nothing, IMO.
 
 I don't believe the exclusion of non-decimal bases is set in stone.
 So, while they wouldn't fit in to CLDR as it stands now, it would not
 take a huge change to add them.

CLDR used to require sequentially encoded decimal digits, but my
understanding is that that is no longer the case. And indeed, the
numeral systems need not be decimal, or even positional. Roman numerals
are supported, as are (e.g.) Armenian numerals, and traditional
Chinese numerals (non-positional, using multiplier words).

While vigesimal systems aren't supported (in CLDR) to the degree
that any got *named*, in the way some other systems have been, there
is still *some* support. See e.g.
 http://unicode.org/cldr/trac/browser/trunk/common/rbnf/nci.xml
(a full-fledged vigesimal system in those rules) for spelling out
numbers as words in Classical Nahuatl. There is also
 http://unicode.org/cldr/trac/browser/trunk/common/rbnf/kl.xml,
for spelling out numbers in Kalaallisut (Greenlandic), but it is not
full-fledged vigesimal.

These RBNF rules are based on what I could find out from sources on
the web a few years ago. If anyone has corrections/extensions/variation
to these, or additions for other languages using vigesimal systems (yes,
I did see that there was some data on the Wikipedia pages referenced),
please send them to me, preferably with contact information to someone
in the know, and I'll see what I can do. I cannot use vigesimal digits,
though, since none are as yet encoded. But if some set of vigesimal
digits were to be encoded, supporting them via RBNF would likely be the
first point of support in CLDR.

 Furthermore, what Inuit has is a vigesimal *counting* system, as the
 article indicates. But this innovated set of numerals, is attempting
 to turn this into a full-blown radix-20 numerical system, which I
 doubt has any cultural validity.
 
 I presume you are talking about how the hundreds are (or were)
 traditionally expressed.
 
 The Inuit number system is another case of the rather widespread use
 of mixed 5/20 counting systems, which count 4 hands of 5 into
 groups of 20.
 
 Indeed, it immediately made me think of Welsh, where native-speakers'
 use of their vigesimal system has been hammered by the use of Arabic
 numerals.  (In England, resistance to this 'heathen notation' collapsed
 long ago.)  Before anyone points it out,  I do know that Welsh _pymtheg_
 '15' and possibly even _ugain_ '20' ultimately derive from a
 (superseded) decimal system.  However, Welsh goes decimal at 100, so
 this vigesimal notation would not match the language at all for higher
 numbers.
 
 I don't think combining diacritics makes sense in this case. Rather,
 this kind of construction is better handled by taking the graphic
 elements for 5, 10, and 15, and ligating them in a font for the
 combined units. So the only elements requiring encoding would
 be 0, 1, 2, 3, 4, 5, 10, 15, in order to fully represent this system.
 
 No.  One must be able to distinguish ONE, FIVE (= '25') and FIVE,
 ONE (= '101') from the notation for '6'.  Or are you suggesting that
 rendering of ZWJ should be *essential* for the semantics, not just for
 acceptability?

While I would have liked to have seen the use of combining characters
(or ligation) in certain other cases where it is not present in Unicode,
I think that that approach would be very inappropriate here; this is for
digits for use in a positional system). Just encode (when that time comes)
each of the new digits corresponding to 0, ..., 19 *atomically*.

The Kaktovik digits are niftily designed though, with a logic in the
(abstract) graphical design, and each of them can be drawn in a single
pen stroke.

They have found their way into some fonts
 (http://www.linguistsoftware.com/linup.htm#Kaktovik), and has some
support form the Inuit Circumpolar Council
 (http://inuitcircumpolar.com/section.php?Nav=SectionID=10Lang=En).

/Kent K


 The (undemonstrated) use of the notation denoting hands for which I
 suggested a combining diacritic could be handled by ligatures
 specified by ZWJ, but there could be a lot of them.  Look at the ugly
 mess in New Tai Lue caused by not anticipating the need for medial 'v'
 because the UTC knew too little about Tai Lue (or even, more
 surprisingly, Northern Thai).
 
 Richard.
 





Re: Unicode, SMS and year 2012

2012-04-29 Thread Mark E. Shoulson
On 04/29/2012 12:38 PM, a...@peoplestring.com wrote:
 Hi!

 I have noticed that I have created the previous definitions in a hurry to
 answer the question raised, as quick as possible.
 They are incomplete.
 I used the EBNF notation to express my encoding.

 Please refer Wikipedia (in Wikipedia, especially 'Table of Symbols') or
 other sources on EBNF:

 http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form#Table_of_symbols

 I am creating a well defined one.

Yes, I know about EBNF notation. I didn't say it was wrong. I just said
it would be a lot easier to follow and understand.

~mark



Re: Unicode, SMS and year 2012

2012-04-29 Thread Doug Ewell

Szelp, A. Sz. wrote:


Some people are simply opposed to additional encoding schemes. The
HTML5 specification explicitly forbids the use of UTF-32, SCSU, and
BOCU-1 (while allowing many non-Unicode legacy encodings and quietly
mapping others to Windows encodings); one committee member was quoted
as saying that other encodings of Unicode waste developer time.


While there are good reasons the authors of HTML5 brought to ignore
SCSU or BOCU-1, having excluded UTF-32 which is the most direct,
one-to-one mapping of Unicode codepoints to byte values seems
shortsighted. We are talking about the whole of Unicode, not just BMP.


All UTFs (8, 16, 32) can represent all of Unicode, as can SCSU. The only 
Unicode encoding that can represent only the BMP is UCS-2, which AFAIK 
is no longer endorsed by UTC.


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­





Re: Unicode, SMS and year 2012

2012-04-29 Thread Martin J. Dürst

On 2012/04/29 18:58, Szelp, A. Sz. wrote:

While there are good reasons the authors of HTML5 brought to ignore SCSU or
BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping
of Unicode codepoints to byte values seems shortsighted.


Well, except that it's hopelessly inefficient and therefore essentially 
nobody is using it.



We are talking about the whole of Unicode, not just BMP.


Yes. For transmission, use UTF-8 (or maybe UTF-16).

Regards,Martin.