subject:"Re\: Unicode, SMS and year 2012"

Re: Unicode, SMS and year 2012

2012-05-01 Thread Sarasvati

Hello everyone:

The discussion threads with the subjects Unicode, SMS, and year 2012
and ece are now closed.

We have received some complaints about intellectual property concerns,
and assertions of IP that were raised in this thread.

All messages in the affected threads have been expunged from the
mail list archives. We apologize for any inconvenience this may cause.

Regards from your,
-- Sarasvati

Re: Unicode, SMS and year 2012

2012-04-30 Thread David Starner

On Sat, Apr 28, 2012 at 6:22 PM, Naena Guru naenag...@gmail.com wrote:
 How I see Unicode is as a
 set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit,
 and CJKV that use some sort of 16-bit paring.

That's one lens to see Unicode through, but in most cases it's
substantially distorting. Unicode is a set of 1112064 characters,
divided up into a flat section of 55,296 characters, a break of 2048
non-characters, and then another 1,054,720 characters. There's a
number of other ways to view it, but there's no guarantee that U+0370
won't be filled with an Egyptian hieroglyph, and any view of Unicode
that assumes that it won't, is thus not a correct view.

 As Unicode says, they are just
 numeric codes assigned to letters or whatever other ideas. It is the task if
 the devices to decide what they are and show them

That is the concept of a character encoding. It has continued to exist
since the first days of computing because plain text seems to encode
something important and distinct from higher levels.

 It shows perfectly when 'dressed' with a
 smartfont.

Except in IE, one of the most common browsers on the market. Except to
anyone using a screen reader.


 It takes about half the bandwidth to transmit that the double-byte set.

Who cares. SMS's restrictions are not technical ones. G.711, the most
common digital compression for telephony, uses 8 kb per second.* One
byte per character or two, that's faster then you can type. Outside
telephony, plain text is trivial; long novels, like Dracula, come in
at under a MB, and download instantaneously for me--partially because
it's automatically gzipped down to 330 KB. At 3 bytes per Even on
not-so-good connections the time taken to download a full novel is
nowhere near the time needed to read it, and is always a fraction of
time needed to download a song, and is less than 1% of the time needed
to download a TV show.

http://www.lovatasinhala.com/ is 4 kb of text and 8 kb of images. The
costs you're trying to impose on everyone to save 4 kb just aren't
worth it, especially as you're sending 177 kb of font to avoid it.

* Before anyone starts to mention kb = kilobytes, yes, 64 kilobits /
sec = 8 kb / sec.

 In the small market of Singhala, no font is present that
 goes typographically well with Arial Unicode. There is no incentive or money
 to make beautiful fonts for a minority language like Singhala.

I'm sorry; unfortunately, that's what's known as a Hard Problem. There
is nothing any character encoding can do about that.

 I hope both the mobile device industry and the PC side separate fonts and
 characters and allow the users to decide the default font sets in their
 devices.

It'd be nice, but that doesn't have much to do with Unicode.

This is eminently rational because the rendering of the font
 happens locally, whereas the characters travel across the network.

I don't see the connection. The font is almost always local, whether
or not it's user-selectable.

  This will
 also help those who like me who understand that their language is better
 served by a transliteration solution than a convoluted double-byte solution
 that discourages the natives to use their script.

I see no evidence that using an industry-standard solution that treats
all scripts equally discourages people from using the script. I do
think that Please get a browser that keeps with times discourages
people.

-- 
Kie ekzistas vivo, ekzistas espero.

Re: Unicode, SMS and year 2012

2012-04-30 Thread Asmus Freytag

Darcula and other novels aside, there are applications where text volume 
definitely matters.


One I've come across in my work is transaction-log filtering. Logs, like 
http logs, can generate rather interesting streams of text data, where 
the volume easily becomes so large that merely attempting to convert 
between character encoding forms can become too cost prohibitive in a 
given implementation.


E-mail and novels may be produced and consumed at human-limited rates, 
but the same is not true for all data streams that are text or text-like 
data.


Just something to keep in mind,

A./

Re: Unicode, SMS and year 2012

2012-04-29 Thread Szelp, A. Sz.

While there are good reasons the authors of HTML5 brought to ignore SCSU or
BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping
of Unicode codepoints to byte values seems shortsighted. We are talking
about the whole of Unicode, not just BMP.

/Sz



On Sat, Apr 28, 2012 at 21:48, Doug Ewell d...@ewellic.org wrote:

 anbu at peoplestring dot com wrote:

  What are some of the reasons a new encoding will face challenges?


 The main challenge to a new encoding is that UTF-8 is already present in
 numerous applications and operating systems, and that any encoding intended
 to serve as an alternative, let alone a replacement UTF-8, must be better
 enough to justify re-engineering of these systems.

 Some people are simply opposed to additional encoding schemes. The HTML5
 specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 (while
 allowing many non-Unicode legacy encodings and quietly mapping others to
 Windows encodings); one committee member was quoted as saying that other
 encodings of Unicode waste developer time.

 Any encoding that does not align code point boundaries along byte
 boundaries will be criticized for requiring excessive processing. The
 argument that I made will be made by others, that if it necessary to
 process bit-by-bit, one might as well use a general-purpose compression
 algorithm. It is popular to present gzip as the ideal compression approach,
 since it is widely available, especially on Linux-type systems, and
 publicly documented (and not IP-encumbered).

 I may have missed some other objections.


 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012

2012-04-29 Thread Mark E. Shoulson

On 04/28/2012 07:54 AM, a...@peoplestring.com wrote:
 I apologise for my poor explanation. I further assure, the codes are not
 magically created, they are created by the EBNF below. I regenerated the
 EBNF to make me as clear as possible, in fact, now they are two:

 1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1)

 1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1)



These oft-repeated incomprehensible strings of symbols would be a whole
lot more intuitively understandable if, say, you were to use a
_different_ symbol for either 0 or 1 and not (0|1) (and maybe some
spaces to split it up for the eye), and/or there were an actual
*explanation* of what they meant, as in:

1 X {1X}... {0X}... 0 X 1 X
1 X {0X}... {1X}... 1 X 0 X

and words like ... The bits in odd-numbered positions [counting from
zero] can be either value and hold the data being transferred; in the
even-numbered positions the first [zeroth] bit is 1, followed either by
a a string of 1s, then 0s, ending with 0 1; or else a string of 0s, then
1s, ending with 1 0. Or something like that, maybe done better. My eyes
glaze over at the sight of what looks like a random selection out of
[{}10|()]*, and I'm probably not the only one.

~mark

Re: Unicode, SMS and year 2012

2012-04-29 Thread Mark E. Shoulson

On 04/29/2012 12:38 PM, a...@peoplestring.com wrote:
 Hi!

 I have noticed that I have created the previous definitions in a hurry to
 answer the question raised, as quick as possible.
 They are incomplete.
 I used the EBNF notation to express my encoding.

 Please refer Wikipedia (in Wikipedia, especially 'Table of Symbols') or
 other sources on EBNF:

 http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form#Table_of_symbols

 I am creating a well defined one.

Yes, I know about EBNF notation. I didn't say it was wrong. I just said
it would be a lot easier to follow and understand.

~mark

Re: Unicode, SMS and year 2012

2012-04-29 Thread Doug Ewell


Szelp, A. Sz. wrote:


Some people are simply opposed to additional encoding schemes. The
HTML5 specification explicitly forbids the use of UTF-32, SCSU, and
BOCU-1 (while allowing many non-Unicode legacy encodings and quietly
mapping others to Windows encodings); one committee member was quoted
as saying that other encodings of Unicode waste developer time.


While there are good reasons the authors of HTML5 brought to ignore
SCSU or BOCU-1, having excluded UTF-32 which is the most direct,
one-to-one mapping of Unicode codepoints to byte values seems
shortsighted. We are talking about the whole of Unicode, not just BMP.


All UTFs (8, 16, 32) can represent all of Unicode, as can SCSU. The only 
Unicode encoding that can represent only the BMP is UCS-2, which AFAIK 
is no longer endorsed by UTC.


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012

2012-04-29 Thread Martin J. Dürst


On 2012/04/29 18:58, Szelp, A. Sz. wrote:

While there are good reasons the authors of HTML5 brought to ignore SCSU or
BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping
of Unicode codepoints to byte values seems shortsighted.


Well, except that it's hopelessly inefficient and therefore essentially 
nobody is using it.



We are talking about the whole of Unicode, not just BMP.


Yes. For transmission, use UTF-8 (or maybe UTF-16).

Regards,Martin.

Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu

 How data is transformed to this string is
 undefined, which is a problem.

As mentioned in the mail, just like utf-8 is pre-installed in most
systems, this design would also be pre-installed in the systems intending
to use them. The example given above is not existing anywhere. One needs to
come with the correct mapping based on the frequency of the use, lets say,
all ANSI characters not encoded in the eight bits would be encoded in 10
bits (instead of the 16 bits of UTF-8), all the Cyrillic characters would
be encoded in either 10 or 12 bits (instead of the 16 bits of UTF-8), all
the Tamil character would be assigned in 18 bits (instead of the 24 bits of
UTF-8) and so on. The above are possibilities. We assign each character of
the latter and former scripts to a code point in their specified range
(Please note that this is not yet done and possibly not the best, the
example in the previous mail is just a random assumption for
conceptualisation, not based on any theory). We generate a mapping
something like this. If we go by assigning all ANSI, then Cyrillic, then
the next suitable and so on, most of the population would be covered.

 Code words starting with an initial 1 code variable-length values,
 which are magically created.

As noted above, they are not going to be magically created (once the
design is complete), codes from this design need to be predefined to
characters. Please note that this encoding is a work in progress, so I am
stilling working on ways to assign the generated codes to the characters.
Maybe after I have completed that, you may get a clear picture of what I
want to do.

 * Code words starting with an initial 0 code literal 7-bit ASCII values,
 which follow the initial zero bit. 0MXX XXXL where M and L are MSB and
 LSB of the respective ASCII value.

Thanks! This is what I wanted to suggest here. No correction to this.

 Code words starting with an initial 1 code variable-length values,
 which are magically created. Read N bits until a 1 bit is encountered
 (inclusive) on an even position within the bit string (where the
 position of the initial code word bit is 0) following a 0 bit on an even
 position. The complete word is N+2 bits long, including the initial 1
bit.

I apologise for my poor explanation. I further assure, the codes are not
magically created, they are created by the EBNF below. I regenerated the
EBNF to make me as clear as possible, in fact, now they are two:

1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1)

1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1)

All the codes produced and only these codes produced by any of the EBNF
are valid. That is to say, a code produced independently from the first
EBNF is valid, similarly a code independently produced by the second EBNF
is also valid. There is one constraint on these EBNF's that at any given
point the code (sentence) produced must always be greater than 8 bits. That
is repeat any of the ones inside the curly braces {} till at least the code
is of 10 bits.

* Code words starting with an initial 1 code variable-length values,
which are created from any of the above EBNF. Read N bits until a 1 bit is
encountered
(inclusive) on an even position within the bit string (where the
position of the initial code word bit is 0) following a 0 bit on an even
position [This statement is correct and only valid if the bit on the third
position (Position 2, an even position) is a 1 bit]. If the bit on the
third position (Position 2, an even position) is a 0 bit, then, Read N bits
until a 0 bit is encountered (inclusive) on an even position within the bit
string (where the position of the initial code word bit is 0) following a 1
bit on an even position other than the first position (Position 0). The
complete word is N+2 bits long, including the initial 1 bit.

 Also, I wonder how efficiently your encoding can code general texts...
 Seeing as how your 10bit codes can only code 192 out of 512 possible
 values, 12 bit codes only 512 out of 2048 values and so on... This means
 you will have a massive amount of bits for rare-ish characters sooner or
 later...

As with the number of possible values, you are underestimating for future
codes.
The number of characters of (and the total number of characters till) 8
bits is given as 128 values.
The actual formula (for number of bits of only that point) goes like this,
for bits greater than 8 bits:

[number of bits - 4] [2 ^ (number of bits ÷ 2)]

8 bits - 128 values (cumulative: 128 values)
10 bits - 192 values (cumulative: 320 values)
12 bits - 512 values (cumulative: 704 values)
14 bits - 1280 values (cumulative: 1792 values)
16 bits - 3072 values (cumulative: 4352, this is double of what the UTF-8
provides = 128 (Basic Latin) + 1024 (all the 16 bit codes of UTF-8 count to
this))

Thank You! For Your Time. Please Contact me If you Need more
Clarification. I am always willing to clarify on this.

Regards,

Anbu

On Sat, 28 Apr 2012 01:46:58 +0200, Robert Abel
freak...@googlemail.com wrote:
 Hi
 
 On 2012/04/28

Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu

Please note some correction and additions in the comparison of the values

My design provides the following number values for the specified number of
bits:

8 bits - 128 values (Cumulative: 128 values)
10 bits - 192 values (Cumulative: 320 values)
12 bits - 512 values (Cumulative: 832 values)
14 bits - 1280 values (Cumulative: 2112 values)
16 bits - 3072 values (Cumulative: 5184 values)
Note: UTF8 has 2048 values of 16 bits (Cumulative: 2176)
This clearly shows that my design yields number of values more than double
that of UTF8
18 bits - 7168 values (Cumulative: 12353 values)
and so on,
At any given number of bits, my design yields more (With the exception of
48 bits/6 bytes only, where UTF 8 yields more values than my design, but in
the immediate next possible bits,50 bits, my design follows its trajectory
of having more values than UTF8)
Another advantage is that my design increments progressively by two bits.
Please refer attached Spreadsheet for more comparison of values.

 Original Message 
Subject: Re: Unicode, SMS and year 2012
Date: Sat, 28 Apr 2012 07:54:02 -0400
From: a...@peoplestring.com
To: freak...@googlemail.com

 How data is transformed to this string is
 undefined, which is a problem.

As mentioned in the mail, just like utf-8 is pre-installed in most
systems, this design would also be pre-installed in the systems intending
to use them. The example given above is not existing anywhere. One needs
to
come with the correct mapping based on the frequency of the use, lets say,
all ANSI characters not encoded in the eight bits would be encoded in 10
bits (instead of the 16 bits of UTF-8), all the Cyrillic characters would
be encoded in either 10 or 12 bits (instead of the 16 bits of UTF-8), all
the Tamil character would be assigned in 18 bits (instead of the 24 bits
of
UTF-8) and so on. The above are possibilities. We assign each character of
the latter and former scripts to a code point in their specified range
(Please note that this is not yet done and possibly not the best, the
example in the previous mail is just a random assumption for
conceptualisation, not based on any theory). We generate a mapping
something like this. If we go by assigning all ANSI, then Cyrillic, then
the next suitable and so on, most of the population would be covered.

 Code words starting with an initial 1 code variable-length values,
 which are magically created.

As noted above, they are not going to be magically created (once the
design is complete), codes from this design need to be predefined to
characters. Please note that this encoding is a work in progress, so I am
stilling working on ways to assign the generated codes to the characters.
Maybe after I have completed that, you may get a clear picture of what I
want to do.

 * Code words starting with an initial 0 code literal 7-bit ASCII values,
 which follow the initial zero bit. 0MXX XXXL where M and L are MSB and
 LSB of the respective ASCII value.

Thanks! This is what I wanted to suggest here. No correction to this.

 Code words starting with an initial 1 code variable-length values,
 which are magically created. Read N bits until a 1 bit is encountered
 (inclusive) on an even position within the bit string (where the
 position of the initial code word bit is 0) following a 0 bit on an even
 position. The complete word is N+2 bits long, including the initial 1
bit.

I apologise for my poor explanation. I further assure, the codes are not
magically created, they are created by the EBNF below. I regenerated the
EBNF to make me as clear as possible, in fact, now they are two:

1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1)

1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1)

All the codes produced and only these codes produced by any of the EBNF
are valid. That is to say, a code produced independently from the first
EBNF is valid, similarly a code independently produced by the second EBNF
is also valid. There is one constraint on these EBNF's that at any given
point the code (sentence) produced must always be greater than 8 bits.
That
is repeat any of the ones inside the curly braces {} till at least the
code
is of 10 bits.

* Code words starting with an initial 1 code variable-length values,
which are created from any of the above EBNF. Read N bits until a 1 bit is
encountered
(inclusive) on an even position within the bit string (where the
position of the initial code word bit is 0) following a 0 bit on an even
position [This statement is correct and only valid if the bit on the third
position (Position 2, an even position) is a 1 bit]. If the bit on the
third position (Position 2, an even position) is a 0 bit, then, Read N
bits
until a 0 bit is encountered (inclusive) on an even position within the
bit
string (where the position of the initial code word bit is 0) following a
1
bit on an even position other than the first position (Position 0). The
complete word is N+2 bits long, including the initial 1 bit.

 Also, I wonder how efficiently your encoding can code

Re: Unicode, SMS and year 2012

2012-04-28 Thread Richard Wordingham

On Fri, 27 Apr 2012 11:21:05 -0700
Doug Ewell d...@ewellic.org wrote:

 SCSU works equally well, or almost so, with any text sample where the
 non-ASCII characters fit into a single block of 128 code points. For
 anything other than Latin-1 you need one byte of overhead, to switch
 to another window, and for many scripts you need two, to define a
 window and switch to it. But again, two bytes is not what's holding
 anyone up.

With SCSU that avoids Unicode mode and UQU whenever possible, most
alphabetic languages work fairly well.  However, extra windows are
needed to cover the half-blocks from A480 to ABFF, 15 new codes.  If I
were being miserly, I wouldn't cover A500-A5FF.

SCSU doesn't work well with large syllabaries, especially if they
include a lot of unused characters within the half-blocks used.  Inuit
suffers badly from this, but still achieves noticeable compression.  I
experimented with compressing Yi transposed to a covered range, and
found that it achieved something like 10% compression.  Yi suffers from
needing the 8 dynamic windows to be switched between 10 half-blocks
(with occasionally excursions to an 11th.)  If the Yi characters had
been arranged by tone first and initial consonant second, 2 of the
half-blocks would never have been used in my sample!

Vai A500-A63F fits in 3 half-blocks, and I would expect non-Vai
characters in it to be in static blocks.  Given how well Yi performed, I
expect Vai to benefit from SCSU.

Has anyone investigated the performance of SCSU with Cuneiform or
Egyptian Hieroglyphics?  It might achieve better than 50% compression!
A fair comparison of Egyptian Hieroglyphics depends on the mark-up
used, for Unicode on its own does not enable one to write reasonable Middle
Egyptian.

Richard.

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell


anbu at peoplestring dot com wrote:


Document encoded in SCSU or BOCU-1, given that the document contains
only ASCII characters, may appear corrupt on a system that doesn't
recognise SCSU or BOCU-1.


This is the curious point of view that ASCII compatibility (or 
transparency) is a bad thing. It does not apply to BOCU-1, which is not 
ASCII-transparent.


Documents encoded in *any* format are likely to appear corrupt on a 
system that doesn't recognize the encoding. They are guaranteed to 
appear corrupt if character boundaries do not align with byte 
boundaries, which is what you propose here.



0111100101011001101110100101010110011000101010100101011101110101


If I'm going to use a variable-length, non-byte-aligned encoding, where 
there is no chance of realigning in case of a flipped or dropped bit 
(which seems to be of great concern to many people), I might as well go 
ahead and use a Huffman or LZ type of encoding (or a combination, like 
DEFLATE).


Is this the same encoding you were proposing a little over a year ago, 
or an outgrowth of the same ideas?


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012 - SQU, not UQU

2012-04-28 Thread Richard Wordingham

On Sat, 28 Apr 2012 18:55:00 +0100
Richard Wordingham richard.wording...@ntlworld.com wrote:

I wrote:
 
 With SCSU that avoids Unicode mode and UQU whenever possible, most
 alphabetic languages work fairly well.

I meant:

With SCSU that avoids Unicode mode and SQU whenever possible, most
alphabetic languages work fairly well.

UQU only occurs in Unicode mode, and escapes tag bytes.  SQU does not
use a window for a character, but passes it as 2 bytes of following
data.  Of course, an initial byte-order mark may be emitted using SQU;
this has only a small impact on performance.

Richard.

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell

Mark Davis  wrote:

 I suspect the punycode goal is to take a wide character set into a
 restricted character set, without caring much on resulting string
 length; if the original string happens to be in other character set
 than the target restricted character set, then the string length
 increases too much to be of interest in the SMS discussion.

 That is not correct. One of the chief reasons that punycode was
 selected was the reduction in size.

But certainly the main motivation behind the development of Punycode, or any of 
the ACEs (ASCII-Compatible Encodings) that came before it, was to provide a 
compact encoding given the constraints of the set of characters allowed in 
domain names. The extensibility of the algorithm to target character sets of 
different sizes was definitely an advantage.

 Tests with the idnbrowser is not relevant. As I said: 

 In that form, it uses a smaller number of
 bytes per character, but a parameterization allows use of all byte
 values.

 That is, the parameterization of punycode for IDNA is restricted to
 the 36 IDNA values per byte, thus roughly 5 bits. When you
 parameterize punycode for a full 8 bits per byte, you get considerably
 different results.

Not to say this isn’t so, but can you point to a tool or site where a user can 
type a string and see the output with different parameterizations? Pretty much 
all of the “Convert to Punycode” pages I see are only able to convert to the 
IDNA target.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell


anbu at peoplestring dot com wrote:


This clearly shows that my design yields number of values more than
double that of UTF8


I didn't know we were competing against UTF-8 on efficiency. That's 
easy. UTF-8 is not at all guaranteed to be the most efficient encoding 
possible, or even reasonably possible. It was originally scoped to be 
not extravagant in terms of space, while providing other design 
features like byte boundaries, full ASCII transparency, easy detection, 
and prefixes that quickly indicate the length of the sequence.


It's easy to beat the efficiency of UTF-8 in a byte-based encoding, if 
many of its other design features are ignored:


0xxx - encodes U+ through U+007F
1xxx 0xxx - encodes U+0080 through U+3FFF
1xxx 1xxx - encodes U+4000 through U+10
(and onward to 0x1F)

This is a well-known and freely available technique, sometimes called 
self-delimiting numeric values (RFC 6256) and sometimes by other 
names.


There are many reasons why a new encoding that is merely more efficient 
than UTF-8, especially one that sacrifices byte-based processing or 
other design features, will face a severe uphill battle in trying to 
displace UTF-8.


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell


I wrote:


0xxx - encodes U+ through U+007F
1xxx 0xxx - encodes U+0080 through U+3FFF
1xxx 1xxx - encodes U+4000 through U+10
(and onward to 0x1F)


Last code sequence should be 1xxx 1xxx 0xxx.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu

 There are many reasons why a new encoding that is merely more efficient 
 than UTF-8, especially one that sacrifices byte-based processing or 
 other design features, will face a severe uphill battle in trying to 
 displace UTF-8.

What are some of the reasons a new encoding will face?

On Sat, 28 Apr 2012 13:15:48 -0600, Doug Ewell d...@ewellic.org
wrote:
 anbu at peoplestring dot com wrote:
 
 This clearly shows that my design yields number of values more than
 double that of UTF8
 
 I didn't know we were competing against UTF-8 on efficiency. That's 
 easy. UTF-8 is not at all guaranteed to be the most efficient encoding 
 possible, or even reasonably possible. It was originally scoped to be 
 not extravagant in terms of space, while providing other design 
 features like byte boundaries, full ASCII transparency, easy detection, 
 and prefixes that quickly indicate the length of the sequence.
 
 It's easy to beat the efficiency of UTF-8 in a byte-based encoding, if 
 many of its other design features are ignored:
 
 0xxx - encodes U+ through U+007F
 1xxx 0xxx - encodes U+0080 through U+3FFF
 1xxx 1xxx - encodes U+4000 through U+10
 (and onward to 0x1F)
 
 This is a well-known and freely available technique, sometimes called 
 self-delimiting numeric values (RFC 6256) and sometimes by other 
 names.
 
 There are many reasons why a new encoding that is merely more efficient 
 than UTF-8, especially one that sacrifices byte-based processing or 
 other design features, will face a severe uphill battle in trying to 
 displace UTF-8.
 
 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell

Fwd: Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread anbu

The question shall read as:

What are some of the reasons a new encoding will face challenges?

 Original Message 
Subject: Re: Fwd: Re: Unicode, SMS and year 2012
Date: Sat, 28 Apr 2012 15:32:47 -0400
From: a...@peoplestring.com
To: d...@ewellic.org

 There are many reasons why a new encoding that is merely more efficient 
 than UTF-8, especially one that sacrifices byte-based processing or 
 other design features, will face a severe uphill battle in trying to 
 displace UTF-8.

What are some of the reasons a new encoding will face?

On Sat, 28 Apr 2012 13:15:48 -0600, Doug Ewell d...@ewellic.org
wrote:
 anbu at peoplestring dot com wrote:

 This clearly shows that my design yields number of values more than
 double that of UTF8

 I didn't know we were competing against UTF-8 on efficiency. That's 
 easy. UTF-8 is not at all guaranteed to be the most efficient encoding 
 possible, or even reasonably possible. It was originally scoped to be 
 not extravagant in terms of space, while providing other design 
 features like byte boundaries, full ASCII transparency, easy detection, 
 and prefixes that quickly indicate the length of the sequence.

 It's easy to beat the efficiency of UTF-8 in a byte-based encoding, if 
 many of its other design features are ignored:

 0xxx - encodes U+ through U+007F
 1xxx 0xxx - encodes U+0080 through U+3FFF
 1xxx 1xxx - encodes U+4000 through U+10
 (and onward to 0x1F)

 This is a well-known and freely available technique, sometimes called 
 self-delimiting numeric values (RFC 6256) and sometimes by other 
 names.

 There are many reasons why a new encoding that is merely more efficient 
 than UTF-8, especially one that sacrifices byte-based processing or 
 other design features, will face a severe uphill battle in trying to 
 displace UTF-8.

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012

2012-04-28 Thread Cristian Secară

În data de Sat, 28 Apr 2012 12:53:17 -0600, Doug Ewell a scris:

 Not to say this isn’t so, but can you point to a tool or site where a
 user can type a string and see the output with different
 parameterizations? Pretty much all of the “Convert to Punycode” pages
 I see are only able to convert to the IDNA target.

Not sure when, but I will try to take a look at the code provided here,
maybe I can figure what parameter must be altered in order to do a test
http://phlymail.com/en/downloads/idna/download/
(though not very hopeful, php being not my primary attraction)

Cristi

-- 
Cristian Secară
http://www.secarica.ro

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell


anbu at peoplestring dot com wrote:


What are some of the reasons a new encoding will face challenges?


The main challenge to a new encoding is that UTF-8 is already present in 
numerous applications and operating systems, and that any encoding 
intended to serve as an alternative, let alone a replacement UTF-8, must 
be better enough to justify re-engineering of these systems.


Some people are simply opposed to additional encoding schemes. The HTML5 
specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 
(while allowing many non-Unicode legacy encodings and quietly mapping 
others to Windows encodings); one committee member was quoted as saying 
that other encodings of Unicode waste developer time.


Any encoding that does not align code point boundaries along byte 
boundaries will be criticized for requiring excessive processing. The 
argument that I made will be made by others, that if it necessary to 
process bit-by-bit, one might as well use a general-purpose compression 
algorithm. It is popular to present gzip as the ideal compression 
approach, since it is widely available, especially on Linux-type 
systems, and publicly documented (and not IP-encumbered).


I may have missed some other objections.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012

2012-04-28 Thread Cristian Secară

În data de Sat, 28 Apr 2012 12:41:51 -0600, Doug Ewell a scris:

 If I'm going to use a variable-length, non-byte-aligned encoding,
 where there is no chance of realigning in case of a flipped or
 dropped bit (which seems to be of great concern to many people), I
 might as well go ahead and use a Huffman or LZ type of encoding (or a
 combination, like DEFLATE).

The standard 3GPP TS 23.042  [1] provides a Huffman compression method
for SMS, yet it seems to me it needs the language to be known at the
time of writing (or at least at the time of effective sending). It also
provides per-language defined dictionaries using 850 or 437 codepage,
but I have not finished reading all the details, so my overview may be
distorted. While in theory this standard is promising (and was issued
long time ago, probably that's why the IBM-like encoding), in practice
I am not aware about its implementation (for sure in my device or the
provided PC application it is not).

Cristi

[1] http://www.3gpp.org/ftp/Specs/html-info/23042.htm

-- 
Cristian Secară
http://www.secarica.ro

Re: Unicode, SMS and year 2012

2012-04-28 Thread Naena Guru

Hi Cristian,

This is a bit of a deviation from the issues you raise, but it relates to
the subject in a different way.

The SMS char set does not seem to follow Unicode. How I see Unicode is as a
set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit,
and CJKV that use some sort of 16-bit paring. As Unicode says, they are
just numeric codes assigned to letters or whatever other ideas. It is the
task if the devices to decide what they are and show them

You say that there are only two character sets in GSM: 7-bit, which is a
reassignment of codes to a select Latin letter shapes, and 16-bit for the
rest. It appears as if they decided that a certain set of letters are
common to some preferred markets, and that it is efficient to reassign the
established Unicode characters to this newly selected letter shapes. Had
they simply used the 8-bit ISO-8859-1 set, the number of characters per SMS
would limit to 140 instead of 160. (Is that why Twitter limits the # of
chars to 140?). Of course, that would not have included some users whose
letters are 16-bit characters under Unicode.

I made a comprehensive transliteration for the Singhala script
(Singhala+Sanskrit+Pali). It shows perfectly when 'dressed' with a
smartfont. The following are two web sites that illustrate this solution
(every character is ISO-8859-1, except for the occasional ZWNJ, which
actually should be 8-bit NBH that somebody decided to leave undefined. Use
any browser except IE. IE does not understand Open Type)
http://www.lovatasinhala.com (hand coded)
http://www.ahangama.com/  (WordPress blog)

All Indic languages could be transliterated this way. it makes Indic
similar to Latin based European languages with intuitive typing and
orthographic results, which Unicode Sinhala can't do. It takes about half
the bandwidth to transmit that the double-byte set. I just noticed that
transliterated Singhala would not be fully covered with SMS 7-bit because
some Unicode 8-bit characters are not in this set.

Looking at my iPhone, I see that the International icon brings up
key-layout plus font pairs. I think what they should do is to separate
fonts and key-layouts.This way, the user could select the key layout for
input and whatever font they want to use to show it. The next thing I am
going to say made many readers here very angry, but may I say it again?

The idea of Last Resort Font that makes basic editors Plain Text is a ploy
to brag that the computer can show all the world's languages that most you
cannot read anyway. The text runs of foreign languages should show as
series of Glyph Not Found character or the specific hint glyph of a
language. The user of a foreign language would know where to download fonts
of their native language. In the small market of Singhala, no font is
present that goes typographically well with Arial Unicode. There is no
incentive or money to make beautiful fonts for a minority language like
Singhala. The plain text result for Singhala is ugly. The OS makers
unnecessarily made hodge-podge Last Resort Fonts

I hope both the mobile device industry and the PC side separate fonts and
characters and allow the users to decide the default font sets in their
devices. This is eminently rational because the rendering of the font
happens locally, whereas the characters travel across the network. This
will also help those who like me who understand that their language is
better served by a transliteration solution than a convoluted double-byte
solution that discourages the natives to use their script.

Actually, this is causing bilingual Singhalese to abandon their native
language. The government is making special emphasis on English, as Singhala
is terribly difficult to use in the modern setting. This is a grave problem
for a society of near 100% literacy rate, and just a few million.


On Fri, Apr 27, 2012 at 3:06 AM, Cristian Secară or...@secarica.ro wrote:

 Few years ago there was a discussion here about Unicode and SMS
 (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is
 the same, i.e. a SMS text message that uses characters from the GSM
 character set can include 160 characters per message (stream of 7 bit ×
 160), whereas a message that uses everything else can include only 70
 characters per message (stream of UCS2 16 bit × 70).

 Although my language (Romanian) was and is affected by this
 discrepancy, then I was skeptical about the possibility to improve
 something in the area, mostly because at that time both the PC and
 mobile market suffered about other critical language problems for me
 (like missing gliphs in fonts, or improper keyboard implementation).

 Things evolved and now the perspectives are much better. Regarding the
 SMS, at that time Richard Wordingham pointed that the SCSU might be a
 proper solution for the SMS encoding [when it comes to non-GSM
 characters].

 Recently I studied as much aspects as I could about the SMS
 standardization, in a step that I started approx a

Re: Fwd: Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell

On Friday, April 27, anbu at peoplestring dot com wrote:

In addition I had a few more questions, of which the one below is the
most significant:

What if one had to send a text in multiple scripts, like in the case
of a text and its translation in the same message?

I thought maybe a new transition format or a new character encoding
would be suitable.

As a test, I took the first sentence from Article 1 of the UDHR (an
increasingly common benchmark), and used Google Translate to derive the
Hindi and Tamil equivalents:

All human beings are born free and equal in dignity and rights.
सभी मनुष्य स्वतंत्र और गरिमा और अधिकारों में बराबर पैदा होते हैं.
எல்லா மனிதர்களும் இலவச மற்றும் கௌரவம் மற்றும் உரிமைகள் சம பிறக்கின்றன.

(I don't vouch for the correctness of these translations; if you know
Hindi or Tamil and disagree with them, please provide your own.)

This is 84 characters from the Basic Latin block (including spaces used
in all three languages), 53 from Devanagari, and 62 from Tamil.

I encoded the resulting text in SCSU, with each line terminating in CRLF
and with the U+FEFF signature (0E FE FF) at the beginning. The
Devanagari passage is encoded as one byte per Unicode character,
preceded by a single SC4 tag byte to select window 4, which is
predefined to the Devanagari block. The Tamil passage is also encoded as
one byte per character, preceded by a two-byte SD3 tag to define a
window into the Tamil block and select it.

The total size of these three lines of text in SCSU, including signature
and CRLF, is 211 bytes. That's probably about as good as any
non-general-purpose Unicode compression encoding can achieve, and better
than most. I'm curious how well Anbu's proprietary encoding will stack
up.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012

2012-04-28 Thread Doug Ewell


Richard Wordingham wrote:


With SCSU that avoids Unicode mode and UQU whenever possible, most
alphabetic languages work fairly well.  However, extra windows are
needed to cover the half-blocks from A480 to ABFF, 15 new codes.  If I
were being miserly, I wouldn't cover A500-A5FF.


In November 2010 I proposed updating the SCSU spec to do just that.
(There were a couple of other suggestions in the proposal, all
severable.) Reaction to the proposal was not encouraging:

http://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0005.html
http://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0008.html


SCSU doesn't work well with large syllabaries, especially if they
include a lot of unused characters within the half-blocks used.  Inuit
suffers badly from this, but still achieves noticeable compression.
I experimented with compressing Yi transposed to a covered range, and
found that it achieved something like 10% compression.  Yi suffers
from needing the 8 dynamic windows to be switched between 10 half-
blocks (with occasionally excursions to an 11th.)  If the Yi
characters had been arranged by tone first and initial consonant
second, 2 of the half-blocks would never have been used in my sample!


Medium-sized writing systems such as syllabaries, that span more than
one or two 128-blocks and cross among them constantly (not just for
isolated characters), have always been the Achilles heel of SCSU. You
can't realistically encode something like Canadian Syllabics on its own
using 7 bits per character, or even 8. The best hope is to be able to
use windows, and hope that window switching can be kept to a minimum. As
you noted with Yi, how successful that is depends on character frequency
and whether common characters are concentrated in one or two
half-blocks, or whether they are scattered.

The design goal of SCSU was to encode text about as efficiently as in
legacy encodings. For small alphabetic scripts, the examples were the
numerous 8-bit encodings for Latin and Cyrillic and Greek, as well as
things like ARMSCII and ISCII. Unicode mode was meant for really large
scripts like Han and precomposed Hangul, where 16 bits per character was
considered acceptable (and better than UTF-8). The design goal was met,
but medium-sized scripts (with no legacy encodings to compete against)
didn't fare so well. There is no mechanism in SCSU to encode a character
in a non-integral number of bytes, and that's probably good; such a
mechanism would have made SCSU, already criticized for its complexity,
much more complex.

Note that most of the above applies to BOCU-1 as well, for what it's
worth.


Vai A500-A63F fits in 3 half-blocks, and I would expect non-Vai
characters in it to be in static blocks.  Given how well Yi performed,
I expect Vai to benefit from SCSU.


It does benefit by comparison to UTF-8. Addition of window offset bytes
to point to this area would help further, but see not encouraging
above.


Has anyone investigated the performance of SCSU with Cuneiform or
Egyptian Hieroglyphics?  It might achieve better than 50% compression!
A fair comparison of Egyptian Hieroglyphics depends on the mark-up
used, for Unicode on its own does not enable one to write reasonable
Middle Egyptian.


If you have realistic samples of text in these scripts that you could
send (privately), I could experiment.  Most of my samples for
experimentation in compression have lately come from the UDHR in Unicode
project.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Fwd: Re: Unicode, SMS and year 2012

2012-04-27 Thread anbu

Please note the following corrections to the mail below:

The number of codes supported with a given number of bits, n, is given by:
[2 ^ (n ÷ 2)] [n - 4]

The total number of codes supported with a given number of bits, n, and
all the number of bits less than it is given by:
3 [2 ^ (n ÷ 2)] [n - 4] - 64

 Original Message 
Subject: Re: Unicode, SMS and year 2012
Date: Fri, 27 Apr 2012 07:54:41 -0400
From: a...@peoplestring.com
To: or...@secarica.ro, unicode@unicode.org

Hi!

I also had the same questions.
In addition I had a few more questions, of which the one below is the most
significant:

What if one had to send a text in multiple scripts, like in the case of a
text and its translation in the same message?

I thought maybe a new transition format or a new character encoding would
be suitable.
I am currently working on a new form of representation. This is how it
goes:

All the characters of the block C0 Controls and Basic Latin are
included, with their design unaltered, that is they are encoded in eight
bits (including the initial zero), given by 0xxx.

All the other codes would surely be designed greater than the eight bits.
I was assuming the design, given by the following EBNF, would help:

1(0|1){1(0|1)}(0|1)(0|1){0(0|1)}0(0|1)1(0|1)

Please note that this design produces codes whose number of bits are even
numbers greater than eight. That, is 10, 12, 14, 16, 18, 20, 22, ... and
so
on.

The number of codes supported with a given number of bits, n, is given by:
2 ^ (n ÷ 2)] [n - 4]

The total number of codes supported with a given number of bits, n, and
all the number of bits less than it is given by:
3 [2 ^ (n ÷ 2) - 1] [n - 4] + 74

Please note that the sign '^' represents raised to the power of, just as
in most computer applications.
Further, note that this design is still under development so may be
subject to minor corrections.

I chose to design codes whose number of bits are even numbers only, rather
than all integers, so that in the event of a corruption of a byte, lets
say
maybe due to network failure, somewhere between other bytes that conform
to
this standard, only the part where there is the corrupt byte and a few
consecutive bytes would be affected, making the effect of the byte loss to
be minimal.

All the information given above in this mail are my intellectual property
and my concern is to be sought before using them for any purpose.

Regards,

Anbu Kaveeswarar Selvaraju

On Fri, 27 Apr 2012 11:06:23 +0300, Cristian Secară
or...@secarica.ro wrote:
 Few years ago there was a discussion here about Unicode and SMS
 (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is
 the same, i.e. a SMS text message that uses characters from the GSM
 character set can include 160 characters per message (stream of 7 bit ×
 160), whereas a message that uses everything else can include only 70
 characters per message (stream of UCS2 16 bit × 70).
 
 Although my language (Romanian) was and is affected by this
 discrepancy, then I was skeptical about the possibility to improve
 something in the area, mostly because at that time both the PC and
 mobile market suffered about other critical language problems for me
 (like missing gliphs in fonts, or improper keyboard implementation).
 
 Things evolved and now the perspectives are much better. Regarding the
 SMS, at that time Richard Wordingham pointed that the SCSU might be a
 proper solution for the SMS encoding [when it comes to non-GSM
 characters].
 
 Recently I studied as much aspects as I could about the SMS
 standardization, in a step that I started approx a year ago regarding
 the SMS language discrimination just because of the difference in
 message length and cost over a same sentence written with diacritical
 marks (written correctly for that language) or without diacritical
 marks (written incorrectly for that language). Or, for the same reason,
 language discrimination between (say) a French message and (say) a
 Romanian message, both written correctly.
 
 It turned out that they (ETSI  its groups) created a way to solve the
 70 characters limitation, namely “National Language Single Shift” and
 “National Language Locking Shift” mechanism. This is described in 3GPP
 TS 23.038 standard and it was introduced since release 8. In short, it
 is about a character substitution table, per character or per message,
 per-language defined.
 
 Personally I find this to be a stone-age-like approach, which in my
 opinion does not work at all if I enter the message from my PC keyboard
 via the phone's PC application (because the language cannot always be
 predicted, mainly if I am using dead keys). It is true that the actual
 SMS stream limit is not much generous, but I wonder if the SCSU would
 have been a better approach in terms of i18n. I also don't know if the
 SCSU requires a language to be prior declared, or it simply guess by
 itself the required window for each character.
 
 Apparently the SCSU seems to be ok

Fwd: Re: Unicode, SMS and year 2012

2012-04-27 Thread anbu

Further correction

I was assuming the design, given by the following EBNF, would help:

1(0|1){1(0|1)}(0|1)(0|1)(0|1)(0|1){0(0|1)}0(0|1)1(0|1)

The number of codes supported with a given number of bits (greater than
eight bits), n, is given by:
2 ^ (n ÷ 2)] [n - 4]

 Original Message 
Subject: Fwd: Re: Unicode, SMS and year 2012
Date: Fri, 27 Apr 2012 08:14:13 -0400
From: a...@peoplestring.com
To: or...@secarica.ro, unicode@unicode.org

Please note the following corrections to the mail below:

The number of codes supported with a given number of bits, n, is given by:
[2 ^ (n ÷ 2)] [n - 4]

The total number of codes supported with a given number of bits, n, and
all the number of bits less than it is given by:
3 [2 ^ (n ÷ 2)] [n - 4] - 64

 Original Message 
Subject: Re: Unicode, SMS and year 2012
Date: Fri, 27 Apr 2012 07:54:41 -0400
From: a...@peoplestring.com
To: or...@secarica.ro, unicode@unicode.org

Hi!

I also had the same questions.
In addition I had a few more questions, of which the one below is the most
significant:

What if one had to send a text in multiple scripts, like in the case of a
text and its translation in the same message?

I thought maybe a new transition format or a new character encoding would
be suitable.
I am currently working on a new form of representation. This is how it
goes:

All the characters of the block C0 Controls and Basic Latin are
included, with their design unaltered, that is they are encoded in eight
bits (including the initial zero), given by 0xxx.

All the other codes would surely be designed greater than the eight bits.
I was assuming the design, given by the following EBNF, would help:

1(0|1){1(0|1)}(0|1)(0|1){0(0|1)}0(0|1)1(0|1)

Please note that this design produces codes whose number of bits are even
numbers greater than eight. That, is 10, 12, 14, 16, 18, 20, 22, ... and
so
on.

The number of codes supported with a given number of bits, n, is given by:
2 ^ (n ÷ 2)] [n - 4]

The total number of codes supported with a given number of bits, n, and
all the number of bits less than it is given by:
3 [2 ^ (n ÷ 2) - 1] [n - 4] + 74

Please note that the sign '^' represents raised to the power of, just as
in most computer applications.
Further, note that this design is still under development so may be
subject to minor corrections.

I chose to design codes whose number of bits are even numbers only, rather
than all integers, so that in the event of a corruption of a byte, lets
say
maybe due to network failure, somewhere between other bytes that conform
to
this standard, only the part where there is the corrupt byte and a few
consecutive bytes would be affected, making the effect of the byte loss to
be minimal.

All the information given above in this mail are my intellectual property
and my concern is to be sought before using them for any purpose.

Regards,

Anbu Kaveeswarar Selvaraju

On Fri, 27 Apr 2012 11:06:23 +0300, Cristian Secară
or...@secarica.ro wrote:
 Few years ago there was a discussion here about Unicode and SMS
 (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is
 the same, i.e. a SMS text message that uses characters from the GSM
 character set can include 160 characters per message (stream of 7 bit ×
 160), whereas a message that uses everything else can include only 70
 characters per message (stream of UCS2 16 bit × 70).
 
 Although my language (Romanian) was and is affected by this
 discrepancy, then I was skeptical about the possibility to improve
 something in the area, mostly because at that time both the PC and
 mobile market suffered about other critical language problems for me
 (like missing gliphs in fonts, or improper keyboard implementation).
 
 Things evolved and now the perspectives are much better. Regarding the
 SMS, at that time Richard Wordingham pointed that the SCSU might be a
 proper solution for the SMS encoding [when it comes to non-GSM
 characters].
 
 Recently I studied as much aspects as I could about the SMS
 standardization, in a step that I started approx a year ago regarding
 the SMS language discrimination just because of the difference in
 message length and cost over a same sentence written with diacritical
 marks (written correctly for that language) or without diacritical
 marks (written incorrectly for that language). Or, for the same reason,
 language discrimination between (say) a French message and (say) a
 Romanian message, both written correctly.
 
 It turned out that they (ETSI  its groups) created a way to solve the
 70 characters limitation, namely “National Language Single Shift” and
 “National Language Locking Shift” mechanism. This is described in 3GPP
 TS 23.038 standard and it was introduced since release 8. In short, it
 is about a character substitution table, per character or per message,
 per-language defined.
 
 Personally I find this to be a stone-age-like approach, which in my
 opinion does not work at all if I enter the message from my

Re: Unicode, SMS and year 2012

2012-04-27 Thread Doug Ewell

Cristian Secară orice at secarica dot ro wrote:

 It turned out that they (ETSI  its groups) created a way to solve the
 70 characters limitation, namely “National Language Single Shift” and
 “National Language Locking Shift” mechanism. This is described in 3GPP
 TS 23.038 standard and it was introduced since release 8. In short, it
 is about a character substitution table, per character or per message,
 per-language defined.

 Personally I find this to be a stone-age-like approach, which in my
 opinion does not work at all if I enter the message from my PC
 keyboard via the phone's PC application (because the language cannot
 always be predicted, mainly if I am using dead keys). It is true that
 the actual SMS stream limit is not much generous, but I wonder if the
 SCSU would have been a better approach in terms of i18n. I also don't
 know if the SCSU requires a language to be prior declared, or it
 simply guess by itself the required window for each character.

I agree that treating character repertoire as simply a matter of
language selection, and creating language-specific code pages, is a
backward-looking solution. Not only is language tagging not always an
option, as Cristian points out, but people don't want to be tied to the
absolute minimum character repertoire that someone decided was necessary
to write a given language, even in a text message. Just look at the rise
of emoji in text messages.

And, of course, I agree that SCSU would have been a much better
solution. Most of the current arguments against SCSU wouldn't apply to
SMS: the cross-site scripting argument wouldn't apply if SCSU were the
only extended encoding, or if the protocol tagged it, and the
complex-encoder argument wouldn't apply to any phone from the last 5
years that can take pictures and shoot videos and scan bar codes and run
numerous apps simultaneously. (SCSU doesn't require a complex encoder
anyway, although it can benefit incrementally from one.)

Interestingly, one of the first mentions I can find on the Unicode list
of SCSU-like compression — actually a description of RCSU, the
predecessor to SCSU — was in the context of SMS message compression:

http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML001/0242.html

Neither RCSU nor SCSU quite fits the original bill, which was to
represent Unicode in 7 bits per character (with some overhead) and thus
achieve 160 characters per message. Both schemes use 8-bit code units.
Still, 140 characters is much better than 70.

 Apparently the SCSU seems to be ok for my language, or Hungarian, or
 Bulgarian, etc., but is this ok also for non-Latin and non-Cyrillic
 scripts ? This versus the language shift mechanism, which is still 7
 bit. Release 10 of that standard includes language locking shift
 tables for Turkish, Portuguese, Bengali, Gujarati, Hindi, Kannada,
 Malayalam, Oriya, Punjabi, Tamil, Telugu and Urdu.

SCSU works equally well, or almost so, with any text sample where the
non-ASCII characters fit into a single block of 128 code points. For
anything other than Latin-1 you need one byte of overhead, to switch to
another window, and for many scripts you need two, to define a window
and switch to it. But again, two bytes is not what's holding anyone up.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012

2012-04-27 Thread Mark Davis ☕

Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use of all byte values.

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Fri, Apr 27, 2012 at 11:21, Doug Ewell d...@ewellic.org wrote:

 Cristian Secară orice at secarica dot ro wrote:

  It turned out that they (ETSI  its groups) created a way to solve the
  70 characters limitation, namely “National Language Single Shift” and
  “National Language Locking Shift” mechanism. This is described in 3GPP
  TS 23.038 standard and it was introduced since release 8. In short, it
  is about a character substitution table, per character or per message,
  per-language defined.
 
  Personally I find this to be a stone-age-like approach, which in my
  opinion does not work at all if I enter the message from my PC
  keyboard via the phone's PC application (because the language cannot
  always be predicted, mainly if I am using dead keys). It is true that
  the actual SMS stream limit is not much generous, but I wonder if the
  SCSU would have been a better approach in terms of i18n. I also don't
  know if the SCSU requires a language to be prior declared, or it
  simply guess by itself the required window for each character.

 I agree that treating character repertoire as simply a matter of
 language selection, and creating language-specific code pages, is a
 backward-looking solution. Not only is language tagging not always an
 option, as Cristian points out, but people don't want to be tied to the
 absolute minimum character repertoire that someone decided was necessary
 to write a given language, even in a text message. Just look at the rise
 of emoji in text messages.

 And, of course, I agree that SCSU would have been a much better
 solution. Most of the current arguments against SCSU wouldn't apply to
 SMS: the cross-site scripting argument wouldn't apply if SCSU were the
 only extended encoding, or if the protocol tagged it, and the
 complex-encoder argument wouldn't apply to any phone from the last 5
 years that can take pictures and shoot videos and scan bar codes and run
 numerous apps simultaneously. (SCSU doesn't require a complex encoder
 anyway, although it can benefit incrementally from one.)

 Interestingly, one of the first mentions I can find on the Unicode list
 of SCSU-like compression — actually a description of RCSU, the
 predecessor to SCSU — was in the context of SMS message compression:

 http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML001/0242.html

 Neither RCSU nor SCSU quite fits the original bill, which was to
 represent Unicode in 7 bits per character (with some overhead) and thus
 achieve 160 characters per message. Both schemes use 8-bit code units.
 Still, 140 characters is much better than 70.

  Apparently the SCSU seems to be ok for my language, or Hungarian, or
  Bulgarian, etc., but is this ok also for non-Latin and non-Cyrillic
  scripts ? This versus the language shift mechanism, which is still 7
  bit. Release 10 of that standard includes language locking shift
  tables for Turkish, Portuguese, Bengali, Gujarati, Hindi, Kannada,
  Malayalam, Oriya, Punjabi, Tamil, Telugu and Urdu.

 SCSU works equally well, or almost so, with any text sample where the
 non-ASCII characters fit into a single block of 128 code points. For
 anything other than Latin-1 you need one byte of overhead, to switch to
 another window, and for many scripts you need two, to define a window
 and switch to it. But again, two bytes is not what's holding anyone up.

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell

RE: Unicode, SMS and year 2012

2012-04-27 Thread Doug Ewell

Mark Davis  mark at macchiato dot com wrote:

 Actually, if the goal is to get as many characters in as possible,
 Punycode might be the best solution. That is the encoding used for
 internationalized domains. In that form, it uses a smaller number of
 bytes per character, but a parameterization allows use of all byte
 values.

That might work well if the goal is to find a compact encoding to 7-bit
code units, then express 8 such code units in 7 bytes. It would
certainly be more economical than UTF-7-over-7, which is fine for ASCII
and awful for anything else.

I don't usually think of Punycode as an ideal general-purpose
compression encoding, especially with lines of arbitrary length or
consisting primarily of non-ASCII content (Cristian's example), but it's
certainly worth experimenting. One advantage might be that encoders and
decoders for Punycode already exist, probably in greater numbers than
for SCSU.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012

2012-04-27 Thread Doug Ewell

Anbu Kaveeswarar Selvaraju anbu at peoplestring dot com wrote:

 What if one had to send a text in multiple scripts, like in the case
 of a text and its translation in the same message?

 I thought maybe a new transition format or a new character encoding
 would be suitable. I am currently working on a new form of
 representation. This is how it goes:

I don't see how this is better than SCSU. Perhaps if you can provide
some examples of text strings and how they would be represented in your
encoding, we can judge.

On the other hand...

 All the information given above in this mail are my intellectual
 property and my concern is to be sought before using them for any
 purpose.

Never mind. Not interested. If I wanted a compression encoding that was
encumbered with IP restrictions, I'd choose BOCU-1.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Re: Unicode, SMS and year 2012

2012-04-27 Thread Cristian Secară

În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris:

 Actually, if the goal is to get as many characters in as possible,
 Punycode might be the best solution. That is the encoding used for
 internationalized domains. In that form, it uses a smaller number of
 bytes per character, but a parameterization allows use of all byte
 values.

I suspect the punycode goal is to take a wide character set into a
restricted character set, without caring much on resulting string
length; if the original string happens to be in other character set
than the target restricted character set, then the string length
increases too much to be of interest in the SMS discussion.

Just do a test: write something in a non-Latin alphabetic script into
this page here http://demo.icu-project.org/icu-bin/idnbrowser

Cristi

-- 
Cristian Secară
http://www.secarica.ro

Re: Unicode, SMS and year 2012

2012-04-27 Thread Robert Abel

Hi

On 2012/04/28 00:23, a...@peoplestring.com wrote:
 1. let 'x' be the position of a code positioned at an odd number eg when
 we take the code '1001010110', the first '1' is positioned at location '1'
 (so an odd number), the first '0' is positioned at location '2' (not an odd
 number), the next '0' is positioned at location '3' (an odd number) and so
 on.

 2. the program takes into memory all the bits till it reaches the end
 (whether they are at position 'x' or not), till it has reached the end

 3. the program checks each consecutive bit at position 'x'.

 4. The program finds the end by the theory 'The bit before the last bit of
 the code is reached if and only if the bit value at 'x' has changed twice'.
 Changing twice is that the bit value must change from the initial '1' to
 '0', then back to '1'. The last bit is immediately after the '1' at
 position 'x', which in turn itself comes after a '0' at position 'x'.

 5. Here we find this doesn't need much or complicated arithmetic. Simple
 logic is enough.
You stated that way too complicated... From what I understand from your
description:

* Read data as string of bits. How data is transformed to this string is
undefined, which is a problem.

* Code words starting with an initial 0 code literal 7-bit ASCII values,
which follow the initial zero bit. 0MXX XXXL where M and L are MSB and
LSB of the respective ASCII value.

* Code words starting with an initial 1 code variable-length values,
which are magically created. Read N bits until a 1 bit is encountered
(inclusive) on an even position within the bit string (where the
position of the initial code word bit is 0) following a 0 bit on an even
position. The complete word is N+2 bits long, including the initial 1 bit.


Also, I wonder how efficiently your encoding can code general texts...
Seeing as how your 10bit codes can only code 192 out of 512 possible
values, 12 bit codes only 512 out of 2048 values and so on... This means
you will have a massive amount of bits for rare-ish characters sooner or
later...

Regards,

Robert

Re: Unicode, SMS and year 2012

2012-04-27 Thread Mark Davis ☕

That is not correct. One of the chief reasons that punycode was selected
was the reduction in size. Tests with the idnbrowser is not relevant. As I
said:

 In that form, it uses a smaller number of
 bytes per character, but a parameterization allows use of all byte
 values.

That is, the parameterization of punycode for IDNA is restricted to the 36
IDNA values per byte, thus roughly 5 bits. When you parameterize punycode
for a full 8 bits per byte, you get considerably different results.

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



2012/4/27 Cristian Secară or...@secarica.ro

 În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris:

  Actually, if the goal is to get as many characters in as possible,
  Punycode might be the best solution. That is the encoding used for
  internationalized domains. In that form, it uses a smaller number of
  bytes per character, but a parameterization allows use of all byte
  values.

 I suspect the punycode goal is to take a wide character set into a
 restricted character set, without caring much on resulting string
 length; if the original string happens to be in other character set
 than the target restricted character set, then the string length
 increases too much to be of interest in the SMS discussion.

 Just do a test: write something in a non-Latin alphabetic script into
 this page here http://demo.icu-project.org/icu-bin/idnbrowser

 Cristi

 --
 Cristian Secară
 http://www.secarica.ro

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst


On 2012/04/28 4:26, Mark Davis ☕ wrote:

Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use of all byte values.


Because punycode encodes differences between character numbers, not the 
character numbers themselves, it can indeed be quite efficient in 
particular if the characters used are tightly packed (e.g. Greek, 
Hebrew,...). For languages with Latin script and accented characters, 
the question is how close these accented characters are in Unicode.


However, punycode also codes character positions. Because of this, it 
gets less efficient for longer text.


[Because punycode uses (circular) position differences rather than 
simple positions, this contribution is limited by the (rounded-up binary 
logarithm of the) weighted average distance between two same characters 
in the text/language.]


My guess is therefore that punycode won't necessarily be super-efficient 
for texts in the 100+ character range. It's difficult to test quickly 
because the punycode converters on the Web limit the output to 63 
characters, the maximum length of a label in a domain name.


Regards,Martin.

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst


On 2012/04/28 7:29, Cristian Secară wrote:

În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris:


Actually, if the goal is to get as many characters in as possible,
Punycode might be the best solution. That is the encoding used for
internationalized domains. In that form, it uses a smaller number of
bytes per character, but a parameterization allows use of all byte
values.


I suspect the punycode goal is to take a wide character set into a
restricted character set, without caring much on resulting string
length; if the original string happens to be in other character set
than the target restricted character set, then the string length
increases too much to be of interest in the SMS discussion.


Not exactly. Compression was very much a goal when designing punycode. 
It won against a number of other algorithms as the choice for IDNs and 
is clearly very good for that purpose.




Just do a test: write something in a non-Latin alphabetic script into
this page here http://demo.icu-project.org/icu-bin/idnbrowser


Well, as a silly example, what about 
α?

(that's 57 α characters). The result is
xn--mxa,
which is 63 characters long.

Regards,   Martin.

Re: Unicode, SMS and year 2012

2012-04-27 Thread Cristian Secară

În data de Fri, 27 Apr 2012 17:28:13 -0700, Mark Davis ☕ a scris:

 That is not correct. One of the chief reasons that punycode was
 selected was the reduction in size. Tests with the idnbrowser is not
 relevant. As I said:
 
  In that form, it uses a smaller number of
  bytes per character, but a parameterization allows use of all byte
  values.

Sorry, I didn't understand right from the start your point about using
all byte values. Will think about it.

Cristi

-- 
Cristian Secară
http://www.secarica.ro

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst


On 2012/04/27 17:06, Cristian Secară wrote:


It turned out that they (ETSI  its groups) created a way to solve the
70 characters limitation, namely “National Language Single Shift” and
“National Language Locking Shift” mechanism. This is described in 3GPP
TS 23.038 standard and it was introduced since release 8. In short, it
is about a character substitution table, per character or per message,
per-language defined.

Personally I find this to be a stone-age-like approach,


Fully agreed.


which in my
opinion does not work at all if I enter the message from my PC keyboard
via the phone's PC application (because the language cannot always be
predicted, mainly if I am using dead keys). It is true that the actual
SMS stream limit is not much generous, but I wonder if the SCSU would
have been a better approach in terms of i18n. I also don't know if the
SCSU requires a language to be prior declared, or it simply guess by
itself the required window for each character.


The right approach in this case isn't to discuss clever compression 
techniques (I've indulged in this in my other mails, too, sorry), but to 
realize that the underlying mobile/wireless technology has advanced a lot.


SMSes are simply a relict of outdated technology, sold at a horrendous 
price. For more information, see e.g. 
http://mobile.slashdot.org/comments.pl?sid=433536cid=22219254 or 
http://gthing.net/the-true-price-of-sms-messages. That's even for the 
case of pure ASCII messages.


The solution is simply to stop using SMSes, and upgrade to a better 
technology.


Regards,   Martin.

37 matches

Mail list logo