Re: Unicode, SMS and year 2012

2012-05-01 Thread Sarasvati
Hello everyone: The discussion threads with the subjects "Unicode, SMS, and year 2012" and "ece" are now closed. We have received some complaints about intellectual property concerns, and assertions of IP that were raised in this thread. All messages in the affected thread

Re: Unicode, SMS and year 2012

2012-04-29 Thread Asmus Freytag
Darcula and other novels aside, there are applications where text volume definitely matters. One I've come across in my work is transaction-log filtering. Logs, like http logs, can generate rather interesting streams of text data, where the volume easily becomes so large that merely attempting

Re: Unicode, SMS and year 2012

2012-04-29 Thread David Starner
On Sat, Apr 28, 2012 at 6:22 PM, Naena Guru wrote: > How I see Unicode is as a > set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit, > and CJKV that use some sort of 16-bit paring. That's one lens to see Unicode through, but in most cases it's substantially distorting. Uni

Re: Unicode, SMS and year 2012

2012-04-29 Thread Martin J. Dürst
On 2012/04/29 18:58, Szelp, A. Sz. wrote: While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. Well, except that it's hopelessly inefficien

Re: Unicode, SMS and year 2012

2012-04-29 Thread Doug Ewell
Szelp, A. Sz. wrote: Some people are simply opposed to additional encoding schemes. The HTML5 specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 (while allowing many non-Unicode legacy encodings and quietly mapping others to Windows encodings); one committee member was quoted a

Re: Unicode, SMS and year 2012

2012-04-29 Thread Mark E. Shoulson
On 04/29/2012 12:38 PM, a...@peoplestring.com wrote: > Hi! > > I have noticed that I have created the previous definitions in a hurry to > answer the question raised, as quick as possible. > They are incomplete. > I used the EBNF notation to express my encoding. > > Please refer Wikipedia (in Wikip

Re: Unicode, SMS and year 2012

2012-04-29 Thread Mark E. Shoulson
On 04/28/2012 07:54 AM, a...@peoplestring.com wrote: >> I apologise for my poor explanation. I further assure, the codes are not >> magically created, they are created by the EBNF below. I regenerated the >> EBNF to make me as clear as possible, in fact, now they are two: >> >> 1(0|1){1(0|1)}{0(0|1

Re: Unicode, SMS and year 2012

2012-04-29 Thread Szelp, A. Sz.
While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. We are talking about the whole of Unicode, not just BMP. /Sz On Sat, Apr 28, 2012 at 2

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
Richard Wordingham wrote: With SCSU that avoids Unicode mode and UQU whenever possible, most alphabetic languages work fairly well. However, extra windows are needed to cover the half-blocks from A480 to ABFF, 15 new codes. If I were being miserly, I wouldn't cover A500-A5FF. In November 201

Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
On Friday, April 27, wrote: In addition I had a few more questions, of which the one below is the most significant: What if one had to send a text in multiple scripts, like in the case of a text and its translation in the same message? I thought maybe a new transition format or a new character

Re: Unicode, SMS and year 2012

2012-04-28 Thread Naena Guru
Hi Cristian, This is a bit of a deviation from the issues you raise, but it relates to the subject in a different way. The SMS char set does not seem to follow Unicode. How I see Unicode is as a set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit, and CJKV that use some sor

Re: Unicode, SMS and year 2012

2012-04-28 Thread Cristian Secară
În data de Sat, 28 Apr 2012 12:41:51 -0600, Doug Ewell a scris: > If I'm going to use a variable-length, non-byte-aligned encoding, > where there is no chance of realigning in case of a flipped or > dropped bit (which seems to be of great concern to many people), I > might as well go ahead and use

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
wrote: What are some of the reasons a new encoding will face challenges? The main challenge to a new encoding is that UTF-8 is already present in numerous applications and operating systems, and that any encoding intended to serve as an alternative, let alone a replacement UTF-8, must be "

Re: Unicode, SMS and year 2012

2012-04-28 Thread Cristian Secară
În data de Sat, 28 Apr 2012 12:53:17 -0600, Doug Ewell a scris: > Not to say this isn’t so, but can you point to a tool or site where a > user can type a string and see the output with different > parameterizations? Pretty much all of the “Convert to Punycode” pages > I see are only able to conver

Fwd: Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu
The question shall read as: What are some of the reasons a new encoding will face challenges? Original Message Subject: Re: Fwd: Re: Unicode, SMS and year 2012 Date: Sat, 28 Apr 2012 15:32:47 -0400 From: To: > There are many reasons why a new encoding that is merely m

Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu
> There are many reasons why a new encoding that is merely more efficient > than UTF-8, especially one that sacrifices byte-based processing or > other design features, will face a severe uphill battle in trying to > displace UTF-8. What are some of the reasons a new encoding will face? On Sat

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
I wrote: 0xxx - encodes U+ through U+007F 1xxx 0xxx - encodes U+0080 through U+3FFF 1xxx 1xxx - encodes U+4000 through U+10 (and onward to 0x1F) Last code sequence should be 1xxx 1xxx 0xxx. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.

Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
wrote: This clearly shows that my design yields number of values more than double that of UTF8 I didn't know we were competing against UTF-8 on efficiency. That's easy. UTF-8 is not at all guaranteed to be the most efficient encoding possible, or even reasonably possible. It was originally

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
Mark Davis 🚙 wrote: >> I suspect the punycode goal is to take a wide character set into a >> restricted character set, without caring much on resulting string >> length; if the original string happens to be in other character set >> than the target restricted character set, then the string length

Re: Unicode, SMS and year 2012 - SQU, not UQU

2012-04-28 Thread Richard Wordingham
On Sat, 28 Apr 2012 18:55:00 +0100 Richard Wordingham wrote: I wrote: > With SCSU that avoids Unicode mode and UQU whenever possible, most > alphabetic languages work fairly well. I meant: "With SCSU that avoids Unicode mode and SQU whenever possible, most alphabetic languages work fairly wel

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell
wrote: Document encoded in SCSU or BOCU-1, given that the document contains only ASCII characters, may appear corrupt on a system that doesn't recognise SCSU or BOCU-1. This is the curious point of view that ASCII compatibility (or transparency) is a bad thing. It does not apply to BOCU-1, w

Re: Unicode, SMS and year 2012

2012-04-28 Thread Richard Wordingham
On Fri, 27 Apr 2012 11:21:05 -0700 "Doug Ewell" wrote: > SCSU works equally well, or almost so, with any text sample where the > non-ASCII characters fit into a single block of 128 code points. For > anything other than Latin-1 you need one byte of overhead, to switch > to another window, and for

Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu
increments progressively by two bits. Please refer attached Spreadsheet for more comparison of values. Original Message Subject: Re: Unicode, SMS and year 2012 Date: Sat, 28 Apr 2012 07:54:02 -0400 From: To: > How data is transformed to this string is > undefined, which is a p

Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu
> How data is transformed to this string is > undefined, which is a problem. As mentioned in the mail, just like utf-8 is pre-installed in most systems, this design would also be pre-installed in the systems intending to use them. The example given above is not existing anywhere. One needs to come

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst
On 2012/04/27 17:06, Cristian Secară wrote: It turned out that they (ETSI& its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was introduced

Re: Unicode, SMS and year 2012

2012-04-27 Thread Cristian Secară
În data de Fri, 27 Apr 2012 17:28:13 -0700, Mark Davis ☕ a scris: > That is not correct. One of the chief reasons that punycode was > selected was the reduction in size. Tests with the idnbrowser is not > relevant. As I said: > > > In that form, it uses a smaller number of > > bytes per character

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst
On 2012/04/28 7:29, Cristian Secară wrote: În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a s

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst
On 2012/04/28 4:26, Mark Davis ☕ wrote: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use

Re: Unicode, SMS and year 2012

2012-04-27 Thread Mark Davis ☕
That is not correct. One of the chief reasons that punycode was selected was the reduction in size. Tests with the idnbrowser is not relevant. As I said: > In that form, it uses a smaller number of > bytes per character, but a parameterization allows use of all byte > values. That is, the paramet

Re: Unicode, SMS and year 2012

2012-04-27 Thread Robert Abel
Hi On 2012/04/28 00:23, a...@peoplestring.com wrote: > 1. let 'x' be the position of a code positioned at an odd number eg when > we take the code '1001010110', the first '1' is positioned at location '1' > (so an odd number), the first '0' is positioned at location '2' (not an odd > number), the

Re: Unicode, SMS and year 2012

2012-04-27 Thread Cristian Secară
În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: > Actually, if the goal is to get as many characters in as possible, > Punycode might be the best solution. That is the encoding used for > internationalized domains. In that form, it uses a smaller number of > bytes per character,

Re: Unicode, SMS and year 2012

2012-04-27 Thread Doug Ewell
Anbu Kaveeswarar Selvaraju wrote: > What if one had to send a text in multiple scripts, like in the case > of a text and its translation in the same message? > > I thought maybe a new transition format or a new character encoding > would be suitable. I am currently working on a new form of > repr

RE: Unicode, SMS and year 2012

2012-04-27 Thread Doug Ewell
Mark Davis 🍍 wrote: > Actually, if the goal is to get as many characters in as possible, > Punycode might be the best solution. That is the encoding used for > internationalized domains. In that form, it uses a smaller number of > bytes per character, but a parameterization allows use of all byte

Re: Unicode, SMS and year 2012

2012-04-27 Thread Mark Davis ☕
Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values.

Re: Unicode, SMS and year 2012

2012-04-27 Thread Doug Ewell
Cristian Secară wrote: > It turned out that they (ETSI & its groups) created a way to solve the > 70 characters limitation, namely “National Language Single Shift” and > “National Language Locking Shift” mechanism. This is described in 3GPP > TS 23.038 standard and it was introduced since release

Fwd: Re: Unicode, SMS and year 2012

2012-04-27 Thread anbu
Subject: Fwd: Re: Unicode, SMS and year 2012 Date: Fri, 27 Apr 2012 08:14:13 -0400 From: To: , Please note the following corrections to the mail below: The number of codes supported with a given number of bits, n, is given by: [2 ^ (n ÷ 2)] [n - 4] The total number of codes supported with

Fwd: Re: Unicode, SMS and year 2012

2012-04-27 Thread anbu
- 4] - 64 Original Message Subject: Re: Unicode, SMS and year 2012 Date: Fri, 27 Apr 2012 07:54:41 -0400 From: To: , Hi! I also had the same questions. In addition I had a few more questions, of which the one below is the most significant: What if one had to send a text in

Unicode, SMS and year 2012

2012-04-27 Thread Cristian Secară
Few years ago there was a discussion here about Unicode and SMS (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is the same, i.e. a SMS text message that uses characters from the GSM character set can include 160 characters per message (stream of 7 bit × 160), whereas a message