Hello everyone:
The discussion threads with the subjects Unicode, SMS, and year 2012
and ece are now closed.
We have received some complaints about intellectual property concerns,
and assertions of IP that were raised in this thread.
All messages in the affected threads have been expunged from
On Sat, Apr 28, 2012 at 6:22 PM, Naena Guru naenag...@gmail.com wrote:
How I see Unicode is as a
set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit,
and CJKV that use some sort of 16-bit paring.
That's one lens to see Unicode through, but in most cases it's
Darcula and other novels aside, there are applications where text volume
definitely matters.
One I've come across in my work is transaction-log filtering. Logs, like
http logs, can generate rather interesting streams of text data, where
the volume easily becomes so large that merely
While there are good reasons the authors of HTML5 brought to ignore SCSU or
BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping
of Unicode codepoints to byte values seems shortsighted. We are talking
about the whole of Unicode, not just BMP.
/Sz
On Sat, Apr 28, 2012 at
On 04/28/2012 07:54 AM, a...@peoplestring.com wrote:
I apologise for my poor explanation. I further assure, the codes are not
magically created, they are created by the EBNF below. I regenerated the
EBNF to make me as clear as possible, in fact, now they are two:
On 04/29/2012 12:38 PM, a...@peoplestring.com wrote:
Hi!
I have noticed that I have created the previous definitions in a hurry to
answer the question raised, as quick as possible.
They are incomplete.
I used the EBNF notation to express my encoding.
Please refer Wikipedia (in Wikipedia,
Szelp, A. Sz. wrote:
Some people are simply opposed to additional encoding schemes. The
HTML5 specification explicitly forbids the use of UTF-32, SCSU, and
BOCU-1 (while allowing many non-Unicode legacy encodings and quietly
mapping others to Windows encodings); one committee member was quoted
On 2012/04/29 18:58, Szelp, A. Sz. wrote:
While there are good reasons the authors of HTML5 brought to ignore SCSU or
BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping
of Unicode codepoints to byte values seems shortsighted.
Well, except that it's hopelessly
How data is transformed to this string is
undefined, which is a problem.
As mentioned in the mail, just like utf-8 is pre-installed in most
systems, this design would also be pre-installed in the systems intending
to use them. The example given above is not existing anywhere. One needs to
come
increments progressively by two bits.
Please refer attached Spreadsheet for more comparison of values.
Original Message
Subject: Re: Unicode, SMS and year 2012
Date: Sat, 28 Apr 2012 07:54:02 -0400
From: a...@peoplestring.com
To: freak...@googlemail.com
How data is transformed
On Fri, 27 Apr 2012 11:21:05 -0700
Doug Ewell d...@ewellic.org wrote:
SCSU works equally well, or almost so, with any text sample where the
non-ASCII characters fit into a single block of 128 code points. For
anything other than Latin-1 you need one byte of overhead, to switch
to another
anbu at peoplestring dot com wrote:
Document encoded in SCSU or BOCU-1, given that the document contains
only ASCII characters, may appear corrupt on a system that doesn't
recognise SCSU or BOCU-1.
This is the curious point of view that ASCII compatibility (or
transparency) is a bad thing.
On Sat, 28 Apr 2012 18:55:00 +0100
Richard Wordingham richard.wording...@ntlworld.com wrote:
I wrote:
With SCSU that avoids Unicode mode and UQU whenever possible, most
alphabetic languages work fairly well.
I meant:
With SCSU that avoids Unicode mode and SQU whenever possible, most
Mark Davis wrote:
I suspect the punycode goal is to take a wide character set into a
restricted character set, without caring much on resulting string
length; if the original string happens to be in other character set
than the target restricted character set, then the string length
anbu at peoplestring dot com wrote:
This clearly shows that my design yields number of values more than
double that of UTF8
I didn't know we were competing against UTF-8 on efficiency. That's
easy. UTF-8 is not at all guaranteed to be the most efficient encoding
possible, or even reasonably
I wrote:
0xxx - encodes U+ through U+007F
1xxx 0xxx - encodes U+0080 through U+3FFF
1xxx 1xxx - encodes U+4000 through U+10
(and onward to 0x1F)
Last code sequence should be 1xxx 1xxx 0xxx.
--
Doug Ewell | Thornton, Colorado, USA
There are many reasons why a new encoding that is merely more efficient
than UTF-8, especially one that sacrifices byte-based processing or
other design features, will face a severe uphill battle in trying to
displace UTF-8.
What are some of the reasons a new encoding will face?
On Sat,
The question shall read as:
What are some of the reasons a new encoding will face challenges?
Original Message
Subject: Re: Fwd: Re: Unicode, SMS and year 2012
Date: Sat, 28 Apr 2012 15:32:47 -0400
From: a...@peoplestring.com
To: d...@ewellic.org
There are many reasons why
În data de Sat, 28 Apr 2012 12:53:17 -0600, Doug Ewell a scris:
Not to say this isn’t so, but can you point to a tool or site where a
user can type a string and see the output with different
parameterizations? Pretty much all of the “Convert to Punycode” pages
I see are only able to convert
anbu at peoplestring dot com wrote:
What are some of the reasons a new encoding will face challenges?
The main challenge to a new encoding is that UTF-8 is already present in
numerous applications and operating systems, and that any encoding
intended to serve as an alternative, let alone a
În data de Sat, 28 Apr 2012 12:41:51 -0600, Doug Ewell a scris:
If I'm going to use a variable-length, non-byte-aligned encoding,
where there is no chance of realigning in case of a flipped or
dropped bit (which seems to be of great concern to many people), I
might as well go ahead and use a
Hi Cristian,
This is a bit of a deviation from the issues you raise, but it relates to
the subject in a different way.
The SMS char set does not seem to follow Unicode. How I see Unicode is as a
set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit,
and CJKV that use some
On Friday, April 27, anbu at peoplestring dot com wrote:
In addition I had a few more questions, of which the one below is the
most significant:
What if one had to send a text in multiple scripts, like in the case
of a text and its translation in the same message?
I thought maybe a new
Richard Wordingham wrote:
With SCSU that avoids Unicode mode and UQU whenever possible, most
alphabetic languages work fairly well. However, extra windows are
needed to cover the half-blocks from A480 to ABFF, 15 new codes. If I
were being miserly, I wouldn't cover A500-A5FF.
In November
Few years ago there was a discussion here about Unicode and SMS
(Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is
the same, i.e. a SMS text message that uses characters from the GSM
character set can include 160 characters per message (stream of 7 bit ×
160), whereas a message
- 4] - 64
Original Message
Subject: Re: Unicode, SMS and year 2012
Date: Fri, 27 Apr 2012 07:54:41 -0400
From: a...@peoplestring.com
To: or...@secarica.ro, unicode@unicode.org
Hi!
I also had the same questions.
In addition I had a few more questions, of which the one below
Subject: Fwd: Re: Unicode, SMS and year 2012
Date: Fri, 27 Apr 2012 08:14:13 -0400
From: a...@peoplestring.com
To: or...@secarica.ro, unicode@unicode.org
Please note the following corrections to the mail below:
The number of codes supported with a given number of bits, n, is given by:
[2 ^ (n
Cristian Secară orice at secarica dot ro wrote:
It turned out that they (ETSI its groups) created a way to solve the
70 characters limitation, namely “National Language Single Shift” and
“National Language Locking Shift” mechanism. This is described in 3GPP
TS 23.038 standard and it was
Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use of all byte values.
Mark Davis mark at macchiato dot com wrote:
Actually, if the goal is to get as many characters in as possible,
Punycode might be the best solution. That is the encoding used for
internationalized domains. In that form, it uses a smaller number of
bytes per character, but a parameterization
Anbu Kaveeswarar Selvaraju anbu at peoplestring dot com wrote:
What if one had to send a text in multiple scripts, like in the case
of a text and its translation in the same message?
I thought maybe a new transition format or a new character encoding
would be suitable. I am currently working
În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris:
Actually, if the goal is to get as many characters in as possible,
Punycode might be the best solution. That is the encoding used for
internationalized domains. In that form, it uses a smaller number of
bytes per character, but
Hi
On 2012/04/28 00:23, a...@peoplestring.com wrote:
1. let 'x' be the position of a code positioned at an odd number eg when
we take the code '1001010110', the first '1' is positioned at location '1'
(so an odd number), the first '0' is positioned at location '2' (not an odd
number), the
That is not correct. One of the chief reasons that punycode was selected
was the reduction in size. Tests with the idnbrowser is not relevant. As I
said:
In that form, it uses a smaller number of
bytes per character, but a parameterization allows use of all byte
values.
That is, the
On 2012/04/28 4:26, Mark Davis ☕ wrote:
Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows
On 2012/04/28 7:29, Cristian Secară wrote:
În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris:
Actually, if the goal is to get as many characters in as possible,
Punycode might be the best solution. That is the encoding used for
internationalized domains. In that form, it uses a
În data de Fri, 27 Apr 2012 17:28:13 -0700, Mark Davis ☕ a scris:
That is not correct. One of the chief reasons that punycode was
selected was the reduction in size. Tests with the idnbrowser is not
relevant. As I said:
In that form, it uses a smaller number of
bytes per character, but a
On 2012/04/27 17:06, Cristian Secară wrote:
It turned out that they (ETSI its groups) created a way to solve the
70 characters limitation, namely “National Language Single Shift” and
“National Language Locking Shift” mechanism. This is described in 3GPP
TS 23.038 standard and it was introduced
38 matches
Mail list logo