Hello everyone:
The discussion threads with the subjects "Unicode, SMS, and year 2012"
and "ece" are now closed.
We have received some complaints about intellectual property concerns,
and assertions of IP that were raised in this thread.
All messages in the affected thread
Darcula and other novels aside, there are applications where text volume
definitely matters.
One I've come across in my work is transaction-log filtering. Logs, like
http logs, can generate rather interesting streams of text data, where
the volume easily becomes so large that merely attempting
On Sat, Apr 28, 2012 at 6:22 PM, Naena Guru wrote:
> How I see Unicode is as a
> set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit,
> and CJKV that use some sort of 16-bit paring.
That's one lens to see Unicode through, but in most cases it's
substantially distorting. Uni
On 2012/04/29 18:58, Szelp, A. Sz. wrote:
While there are good reasons the authors of HTML5 brought to ignore SCSU or
BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping
of Unicode codepoints to byte values seems shortsighted.
Well, except that it's hopelessly inefficien
Szelp, A. Sz. wrote:
Some people are simply opposed to additional encoding schemes. The
HTML5 specification explicitly forbids the use of UTF-32, SCSU, and
BOCU-1 (while allowing many non-Unicode legacy encodings and quietly
mapping others to Windows encodings); one committee member was quoted
a
On 04/29/2012 12:38 PM, a...@peoplestring.com wrote:
> Hi!
>
> I have noticed that I have created the previous definitions in a hurry to
> answer the question raised, as quick as possible.
> They are incomplete.
> I used the EBNF notation to express my encoding.
>
> Please refer Wikipedia (in Wikip
On 04/28/2012 07:54 AM, a...@peoplestring.com wrote:
>> I apologise for my poor explanation. I further assure, the codes are not
>> magically created, they are created by the EBNF below. I regenerated the
>> EBNF to make me as clear as possible, in fact, now they are two:
>>
>> 1(0|1){1(0|1)}{0(0|1
While there are good reasons the authors of HTML5 brought to ignore SCSU or
BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping
of Unicode codepoints to byte values seems shortsighted. We are talking
about the whole of Unicode, not just BMP.
/Sz
On Sat, Apr 28, 2012 at 2
Richard Wordingham wrote:
With SCSU that avoids Unicode mode and UQU whenever possible, most
alphabetic languages work fairly well. However, extra windows are
needed to cover the half-blocks from A480 to ABFF, 15 new codes. If I
were being miserly, I wouldn't cover A500-A5FF.
In November 201
On Friday, April 27, wrote:
In addition I had a few more questions, of which the one below is the
most significant:
What if one had to send a text in multiple scripts, like in the case
of a text and its translation in the same message?
I thought maybe a new transition format or a new character
Hi Cristian,
This is a bit of a deviation from the issues you raise, but it relates to
the subject in a different way.
The SMS char set does not seem to follow Unicode. How I see Unicode is as a
set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit,
and CJKV that use some sor
În data de Sat, 28 Apr 2012 12:41:51 -0600, Doug Ewell a scris:
> If I'm going to use a variable-length, non-byte-aligned encoding,
> where there is no chance of realigning in case of a flipped or
> dropped bit (which seems to be of great concern to many people), I
> might as well go ahead and use
wrote:
What are some of the reasons a new encoding will face challenges?
The main challenge to a new encoding is that UTF-8 is already present in
numerous applications and operating systems, and that any encoding
intended to serve as an alternative, let alone a replacement UTF-8, must
be "
În data de Sat, 28 Apr 2012 12:53:17 -0600, Doug Ewell a scris:
> Not to say this isn’t so, but can you point to a tool or site where a
> user can type a string and see the output with different
> parameterizations? Pretty much all of the “Convert to Punycode” pages
> I see are only able to conver
The question shall read as:
What are some of the reasons a new encoding will face challenges?
Original Message
Subject: Re: Fwd: Re: Unicode, SMS and year 2012
Date: Sat, 28 Apr 2012 15:32:47 -0400
From:
To:
> There are many reasons why a new encoding that is merely m
> There are many reasons why a new encoding that is merely more efficient
> than UTF-8, especially one that sacrifices byte-based processing or
> other design features, will face a severe uphill battle in trying to
> displace UTF-8.
What are some of the reasons a new encoding will face?
On Sat
I wrote:
0xxx - encodes U+ through U+007F
1xxx 0xxx - encodes U+0080 through U+3FFF
1xxx 1xxx - encodes U+4000 through U+10
(and onward to 0x1F)
Last code sequence should be 1xxx 1xxx 0xxx.
--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.
wrote:
This clearly shows that my design yields number of values more than
double that of UTF8
I didn't know we were competing against UTF-8 on efficiency. That's
easy. UTF-8 is not at all guaranteed to be the most efficient encoding
possible, or even reasonably possible. It was originally
Mark Davis 🚙 wrote:
>> I suspect the punycode goal is to take a wide character set into a
>> restricted character set, without caring much on resulting string
>> length; if the original string happens to be in other character set
>> than the target restricted character set, then the string length
On Sat, 28 Apr 2012 18:55:00 +0100
Richard Wordingham wrote:
I wrote:
> With SCSU that avoids Unicode mode and UQU whenever possible, most
> alphabetic languages work fairly well.
I meant:
"With SCSU that avoids Unicode mode and SQU whenever possible, most
alphabetic languages work fairly wel
wrote:
Document encoded in SCSU or BOCU-1, given that the document contains
only ASCII characters, may appear corrupt on a system that doesn't
recognise SCSU or BOCU-1.
This is the curious point of view that ASCII compatibility (or
transparency) is a bad thing. It does not apply to BOCU-1, w
On Fri, 27 Apr 2012 11:21:05 -0700
"Doug Ewell" wrote:
> SCSU works equally well, or almost so, with any text sample where the
> non-ASCII characters fit into a single block of 128 code points. For
> anything other than Latin-1 you need one byte of overhead, to switch
> to another window, and for
increments progressively by two bits.
Please refer attached Spreadsheet for more comparison of values.
Original Message
Subject: Re: Unicode, SMS and year 2012
Date: Sat, 28 Apr 2012 07:54:02 -0400
From:
To:
> How data is transformed to this string is
> undefined, which is a p
> How data is transformed to this string is
> undefined, which is a problem.
As mentioned in the mail, just like utf-8 is pre-installed in most
systems, this design would also be pre-installed in the systems intending
to use them. The example given above is not existing anywhere. One needs to
come
On 2012/04/27 17:06, Cristian Secară wrote:
It turned out that they (ETSI& its groups) created a way to solve the
70 characters limitation, namely “National Language Single Shift” and
“National Language Locking Shift” mechanism. This is described in 3GPP
TS 23.038 standard and it was introduced
În data de Fri, 27 Apr 2012 17:28:13 -0700, Mark Davis ☕ a scris:
> That is not correct. One of the chief reasons that punycode was
> selected was the reduction in size. Tests with the idnbrowser is not
> relevant. As I said:
>
> > In that form, it uses a smaller number of
> > bytes per character
On 2012/04/28 7:29, Cristian Secară wrote:
În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris:
Actually, if the goal is to get as many characters in as possible,
Punycode might be the best solution. That is the encoding used for
internationalized domains. In that form, it uses a s
On 2012/04/28 4:26, Mark Davis ☕ wrote:
Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use
That is not correct. One of the chief reasons that punycode was selected
was the reduction in size. Tests with the idnbrowser is not relevant. As I
said:
> In that form, it uses a smaller number of
> bytes per character, but a parameterization allows use of all byte
> values.
That is, the paramet
Hi
On 2012/04/28 00:23, a...@peoplestring.com wrote:
> 1. let 'x' be the position of a code positioned at an odd number eg when
> we take the code '1001010110', the first '1' is positioned at location '1'
> (so an odd number), the first '0' is positioned at location '2' (not an odd
> number), the
În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris:
> Actually, if the goal is to get as many characters in as possible,
> Punycode might be the best solution. That is the encoding used for
> internationalized domains. In that form, it uses a smaller number of
> bytes per character,
Anbu Kaveeswarar Selvaraju wrote:
> What if one had to send a text in multiple scripts, like in the case
> of a text and its translation in the same message?
>
> I thought maybe a new transition format or a new character encoding
> would be suitable. I am currently working on a new form of
> repr
Mark Davis 🍍 wrote:
> Actually, if the goal is to get as many characters in as possible,
> Punycode might be the best solution. That is the encoding used for
> internationalized domains. In that form, it uses a smaller number of
> bytes per character, but a parameterization allows use of all byte
Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use of all byte values.
Cristian Secară wrote:
> It turned out that they (ETSI & its groups) created a way to solve the
> 70 characters limitation, namely “National Language Single Shift” and
> “National Language Locking Shift” mechanism. This is described in 3GPP
> TS 23.038 standard and it was introduced since release
Subject: Fwd: Re: Unicode, SMS and year 2012
Date: Fri, 27 Apr 2012 08:14:13 -0400
From:
To: ,
Please note the following corrections to the mail below:
The number of codes supported with a given number of bits, n, is given by:
[2 ^ (n ÷ 2)] [n - 4]
The total number of codes supported with
- 4] - 64
Original Message
Subject: Re: Unicode, SMS and year 2012
Date: Fri, 27 Apr 2012 07:54:41 -0400
From:
To: ,
Hi!
I also had the same questions.
In addition I had a few more questions, of which the one below is the most
significant:
What if one had to send a text in
Few years ago there was a discussion here about Unicode and SMS
(Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is
the same, i.e. a SMS text message that uses characters from the GSM
character set can include 160 characters per message (stream of 7 bit ×
160), whereas a message
38 matches
Mail list logo