Re: Unicode, SMS and year 2012

2012-05-01 Thread Sarasvati
Hello everyone: The discussion threads with the subjects Unicode, SMS, and year 2012 and ece are now closed. We have received some complaints about intellectual property concerns, and assertions of IP that were raised in this thread. All messages in the affected threads have been expunged from

Re: Unicode, SMS and year 2012

2012-04-30 Thread David Starner
On Sat, Apr 28, 2012 at 6:22 PM, Naena Guru naenag...@gmail.com wrote: How I see Unicode is as a set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit, and CJKV that use some sort of 16-bit paring. That's one lens to see Unicode through, but in most cases it's

Re: Unicode, SMS and year 2012

2012-04-30 Thread Asmus Freytag
Darcula and other novels aside, there are applications where text volume definitely matters. One I've come across in my work is transaction-log filtering. Logs, like http logs, can generate rather interesting streams of text data, where the volume easily becomes so large that merely

Re: Unicode, SMS and year 2012

2012-04-29 Thread Szelp, A. Sz.
While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. We are talking about the whole of Unicode, not just BMP. /Sz On Sat, Apr 28, 2012 at

Re: Unicode, SMS and year 2012

2012-04-29 Thread Mark E. Shoulson
On 04/28/2012 07:54 AM, a...@peoplestring.com wrote: I apologise for my poor explanation. I further assure, the codes are not magically created, they are created by the EBNF below. I regenerated the EBNF to make me as clear as possible, in fact, now they are two:

Re: Unicode, SMS and year 2012

2012-04-29 Thread Mark E. Shoulson
On 04/29/2012 12:38 PM, a...@peoplestring.com wrote: Hi! I have noticed that I have created the previous definitions in a hurry to answer the question raised, as quick as possible. They are incomplete. I used the EBNF notation to express my encoding. Please refer Wikipedia (in Wikipedia,

Re: Unicode, SMS and year 2012

2012-04-29 Thread Doug Ewell
Szelp, A. Sz. wrote: Some people are simply opposed to additional encoding schemes. The HTML5 specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 (while allowing many non-Unicode legacy encodings and quietly mapping others to Windows encodings); one committee member was quoted

Re: Unicode, SMS and year 2012

2012-04-29 Thread Martin J. Dürst
On 2012/04/29 18:58, Szelp, A. Sz. wrote: While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. Well, except that it's hopelessly

Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu
How data is transformed to this string is undefined, which is a problem. As mentioned in the mail, just like utf-8 is pre-installed in most systems, this design would also be pre-installed in the systems intending to use them. The example given above is not existing anywhere. One needs to come

Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu
increments progressively by two bits. Please refer attached Spreadsheet for more comparison of values. Original Message Subject: Re: Unicode, SMS and year 2012 Date: Sat, 28 Apr 2012 07:54:02 -0400 From: a...@peoplestring.com To: freak...@googlemail.com How data is transformed

Re: Unicode, SMS and year 2012

2012-04-28 Thread Richard Wordingham
On Fri, 27 Apr 2012 11:21:05 -0700 Doug Ewell d...@ewellic.org wrote: SCSU works equally well, or almost so, with any text sample where the non-ASCII characters fit into a single block of 128 code points. For anything other than Latin-1 you need one byte of overhead, to switch to another

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
anbu at peoplestring dot com wrote: Document encoded in SCSU or BOCU-1, given that the document contains only ASCII characters, may appear corrupt on a system that doesn't recognise SCSU or BOCU-1. This is the curious point of view that ASCII compatibility (or transparency) is a bad thing.

Re: Unicode, SMS and year 2012 - SQU, not UQU

2012-04-28 Thread Richard Wordingham
On Sat, 28 Apr 2012 18:55:00 +0100 Richard Wordingham richard.wording...@ntlworld.com wrote: I wrote: With SCSU that avoids Unicode mode and UQU whenever possible, most alphabetic languages work fairly well. I meant: With SCSU that avoids Unicode mode and SQU whenever possible, most

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
Mark Davis  wrote: I suspect the punycode goal is to take a wide character set into a restricted character set, without caring much on resulting string length; if the original string happens to be in other character set than the target restricted character set, then the string length

Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
anbu at peoplestring dot com wrote: This clearly shows that my design yields number of values more than double that of UTF8 I didn't know we were competing against UTF-8 on efficiency. That's easy. UTF-8 is not at all guaranteed to be the most efficient encoding possible, or even reasonably

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
I wrote: 0xxx - encodes U+ through U+007F 1xxx 0xxx - encodes U+0080 through U+3FFF 1xxx 1xxx - encodes U+4000 through U+10 (and onward to 0x1F) Last code sequence should be 1xxx 1xxx 0xxx. -- Doug Ewell | Thornton, Colorado, USA

Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu
There are many reasons why a new encoding that is merely more efficient than UTF-8, especially one that sacrifices byte-based processing or other design features, will face a severe uphill battle in trying to displace UTF-8. What are some of the reasons a new encoding will face? On Sat,

Fwd: Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu
The question shall read as: What are some of the reasons a new encoding will face challenges? Original Message Subject: Re: Fwd: Re: Unicode, SMS and year 2012 Date: Sat, 28 Apr 2012 15:32:47 -0400 From: a...@peoplestring.com To: d...@ewellic.org There are many reasons why

Re: Unicode, SMS and year 2012

2012-04-28 Thread Cristian Secară
În data de Sat, 28 Apr 2012 12:53:17 -0600, Doug Ewell a scris: Not to say this isn’t so, but can you point to a tool or site where a user can type a string and see the output with different parameterizations? Pretty much all of the “Convert to Punycode” pages I see are only able to convert

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
anbu at peoplestring dot com wrote: What are some of the reasons a new encoding will face challenges? The main challenge to a new encoding is that UTF-8 is already present in numerous applications and operating systems, and that any encoding intended to serve as an alternative, let alone a

Re: Unicode, SMS and year 2012

2012-04-28 Thread Cristian Secară
În data de Sat, 28 Apr 2012 12:41:51 -0600, Doug Ewell a scris: If I'm going to use a variable-length, non-byte-aligned encoding, where there is no chance of realigning in case of a flipped or dropped bit (which seems to be of great concern to many people), I might as well go ahead and use a

Re: Unicode, SMS and year 2012

2012-04-28 Thread Naena Guru
Hi Cristian, This is a bit of a deviation from the issues you raise, but it relates to the subject in a different way. The SMS char set does not seem to follow Unicode. How I see Unicode is as a set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit, and CJKV that use some

Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
On Friday, April 27, anbu at peoplestring dot com wrote: In addition I had a few more questions, of which the one below is the most significant: What if one had to send a text in multiple scripts, like in the case of a text and its translation in the same message? I thought maybe a new

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
Richard Wordingham wrote: With SCSU that avoids Unicode mode and UQU whenever possible, most alphabetic languages work fairly well. However, extra windows are needed to cover the half-blocks from A480 to ABFF, 15 new codes. If I were being miserly, I wouldn't cover A500-A5FF. In November

Unicode, SMS and year 2012

2012-04-27 Thread Cristian Secară
Few years ago there was a discussion here about Unicode and SMS (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is the same, i.e. a SMS text message that uses characters from the GSM character set can include 160 characters per message (stream of 7 bit × 160), whereas a message

Fwd: Re: Unicode, SMS and year 2012

2012-04-27 Thread anbu
- 4] - 64 Original Message Subject: Re: Unicode, SMS and year 2012 Date: Fri, 27 Apr 2012 07:54:41 -0400 From: a...@peoplestring.com To: or...@secarica.ro, unicode@unicode.org Hi! I also had the same questions. In addition I had a few more questions, of which the one below

Fwd: Re: Unicode, SMS and year 2012

2012-04-27 Thread anbu
Subject: Fwd: Re: Unicode, SMS and year 2012 Date: Fri, 27 Apr 2012 08:14:13 -0400 From: a...@peoplestring.com To: or...@secarica.ro, unicode@unicode.org Please note the following corrections to the mail below: The number of codes supported with a given number of bits, n, is given by: [2 ^ (n

Re: Unicode, SMS and year 2012

2012-04-27 Thread Doug Ewell
Cristian Secară orice at secarica dot ro wrote: It turned out that they (ETSI its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was

Re: Unicode, SMS and year 2012

2012-04-27 Thread Mark Davis ☕
Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values.

RE: Unicode, SMS and year 2012

2012-04-27 Thread Doug Ewell
Mark Davis  mark at macchiato dot com wrote: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization

Re: Unicode, SMS and year 2012

2012-04-27 Thread Doug Ewell
Anbu Kaveeswarar Selvaraju anbu at peoplestring dot com wrote: What if one had to send a text in multiple scripts, like in the case of a text and its translation in the same message? I thought maybe a new transition format or a new character encoding would be suitable. I am currently working

Re: Unicode, SMS and year 2012

2012-04-27 Thread Cristian Secară
În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but

Re: Unicode, SMS and year 2012

2012-04-27 Thread Robert Abel
Hi On 2012/04/28 00:23, a...@peoplestring.com wrote: 1. let 'x' be the position of a code positioned at an odd number eg when we take the code '1001010110', the first '1' is positioned at location '1' (so an odd number), the first '0' is positioned at location '2' (not an odd number), the

Re: Unicode, SMS and year 2012

2012-04-27 Thread Mark Davis ☕
That is not correct. One of the chief reasons that punycode was selected was the reduction in size. Tests with the idnbrowser is not relevant. As I said: In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. That is, the

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst
On 2012/04/28 4:26, Mark Davis ☕ wrote: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst
On 2012/04/28 7:29, Cristian Secară wrote: În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a

Re: Unicode, SMS and year 2012

2012-04-27 Thread Cristian Secară
În data de Fri, 27 Apr 2012 17:28:13 -0700, Mark Davis ☕ a scris: That is not correct. One of the chief reasons that punycode was selected was the reduction in size. Tests with the idnbrowser is not relevant. As I said: In that form, it uses a smaller number of bytes per character, but a

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst
On 2012/04/27 17:06, Cristian Secară wrote: It turned out that they (ETSI its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was introduced