Re: Unicode, SMS and year 2012
Hello everyone: The discussion threads with the subjects Unicode, SMS, and year 2012 and ece are now closed. We have received some complaints about intellectual property concerns, and assertions of IP that were raised in this thread. All messages in the affected threads have been expunged from the mail list archives. We apologize for any inconvenience this may cause. Regards from your, -- Sarasvati
Re: Unicode, SMS and year 2012
On Sat, Apr 28, 2012 at 6:22 PM, Naena Guru naenag...@gmail.com wrote: How I see Unicode is as a set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit, and CJKV that use some sort of 16-bit paring. That's one lens to see Unicode through, but in most cases it's substantially distorting. Unicode is a set of 1112064 characters, divided up into a flat section of 55,296 characters, a break of 2048 non-characters, and then another 1,054,720 characters. There's a number of other ways to view it, but there's no guarantee that U+0370 won't be filled with an Egyptian hieroglyph, and any view of Unicode that assumes that it won't, is thus not a correct view. As Unicode says, they are just numeric codes assigned to letters or whatever other ideas. It is the task if the devices to decide what they are and show them That is the concept of a character encoding. It has continued to exist since the first days of computing because plain text seems to encode something important and distinct from higher levels. It shows perfectly when 'dressed' with a smartfont. Except in IE, one of the most common browsers on the market. Except to anyone using a screen reader. It takes about half the bandwidth to transmit that the double-byte set. Who cares. SMS's restrictions are not technical ones. G.711, the most common digital compression for telephony, uses 8 kb per second.* One byte per character or two, that's faster then you can type. Outside telephony, plain text is trivial; long novels, like Dracula, come in at under a MB, and download instantaneously for me--partially because it's automatically gzipped down to 330 KB. At 3 bytes per Even on not-so-good connections the time taken to download a full novel is nowhere near the time needed to read it, and is always a fraction of time needed to download a song, and is less than 1% of the time needed to download a TV show. http://www.lovatasinhala.com/ is 4 kb of text and 8 kb of images. The costs you're trying to impose on everyone to save 4 kb just aren't worth it, especially as you're sending 177 kb of font to avoid it. * Before anyone starts to mention kb = kilobytes, yes, 64 kilobits / sec = 8 kb / sec. In the small market of Singhala, no font is present that goes typographically well with Arial Unicode. There is no incentive or money to make beautiful fonts for a minority language like Singhala. I'm sorry; unfortunately, that's what's known as a Hard Problem. There is nothing any character encoding can do about that. I hope both the mobile device industry and the PC side separate fonts and characters and allow the users to decide the default font sets in their devices. It'd be nice, but that doesn't have much to do with Unicode. This is eminently rational because the rendering of the font happens locally, whereas the characters travel across the network. I don't see the connection. The font is almost always local, whether or not it's user-selectable. This will also help those who like me who understand that their language is better served by a transliteration solution than a convoluted double-byte solution that discourages the natives to use their script. I see no evidence that using an industry-standard solution that treats all scripts equally discourages people from using the script. I do think that Please get a browser that keeps with times discourages people. -- Kie ekzistas vivo, ekzistas espero.
Re: Unicode, SMS and year 2012
Darcula and other novels aside, there are applications where text volume definitely matters. One I've come across in my work is transaction-log filtering. Logs, like http logs, can generate rather interesting streams of text data, where the volume easily becomes so large that merely attempting to convert between character encoding forms can become too cost prohibitive in a given implementation. E-mail and novels may be produced and consumed at human-limited rates, but the same is not true for all data streams that are text or text-like data. Just something to keep in mind, A./
Re: Unicode, SMS and year 2012
While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. We are talking about the whole of Unicode, not just BMP. /Sz On Sat, Apr 28, 2012 at 21:48, Doug Ewell d...@ewellic.org wrote: anbu at peoplestring dot com wrote: What are some of the reasons a new encoding will face challenges? The main challenge to a new encoding is that UTF-8 is already present in numerous applications and operating systems, and that any encoding intended to serve as an alternative, let alone a replacement UTF-8, must be better enough to justify re-engineering of these systems. Some people are simply opposed to additional encoding schemes. The HTML5 specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 (while allowing many non-Unicode legacy encodings and quietly mapping others to Windows encodings); one committee member was quoted as saying that other encodings of Unicode waste developer time. Any encoding that does not align code point boundaries along byte boundaries will be criticized for requiring excessive processing. The argument that I made will be made by others, that if it necessary to process bit-by-bit, one might as well use a general-purpose compression algorithm. It is popular to present gzip as the ideal compression approach, since it is widely available, especially on Linux-type systems, and publicly documented (and not IP-encumbered). I may have missed some other objections. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
On 04/28/2012 07:54 AM, a...@peoplestring.com wrote: I apologise for my poor explanation. I further assure, the codes are not magically created, they are created by the EBNF below. I regenerated the EBNF to make me as clear as possible, in fact, now they are two: 1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1) 1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1) These oft-repeated incomprehensible strings of symbols would be a whole lot more intuitively understandable if, say, you were to use a _different_ symbol for either 0 or 1 and not (0|1) (and maybe some spaces to split it up for the eye), and/or there were an actual *explanation* of what they meant, as in: 1 X {1X}... {0X}... 0 X 1 X 1 X {0X}... {1X}... 1 X 0 X and words like ... The bits in odd-numbered positions [counting from zero] can be either value and hold the data being transferred; in the even-numbered positions the first [zeroth] bit is 1, followed either by a a string of 1s, then 0s, ending with 0 1; or else a string of 0s, then 1s, ending with 1 0. Or something like that, maybe done better. My eyes glaze over at the sight of what looks like a random selection out of [{}10|()]*, and I'm probably not the only one. ~mark
Re: Unicode, SMS and year 2012
On 04/29/2012 12:38 PM, a...@peoplestring.com wrote: Hi! I have noticed that I have created the previous definitions in a hurry to answer the question raised, as quick as possible. They are incomplete. I used the EBNF notation to express my encoding. Please refer Wikipedia (in Wikipedia, especially 'Table of Symbols') or other sources on EBNF: http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form#Table_of_symbols I am creating a well defined one. Yes, I know about EBNF notation. I didn't say it was wrong. I just said it would be a lot easier to follow and understand. ~mark
Re: Unicode, SMS and year 2012
Szelp, A. Sz. wrote: Some people are simply opposed to additional encoding schemes. The HTML5 specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 (while allowing many non-Unicode legacy encodings and quietly mapping others to Windows encodings); one committee member was quoted as saying that other encodings of Unicode waste developer time. While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. We are talking about the whole of Unicode, not just BMP. All UTFs (8, 16, 32) can represent all of Unicode, as can SCSU. The only Unicode encoding that can represent only the BMP is UCS-2, which AFAIK is no longer endorsed by UTC. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
On 2012/04/29 18:58, Szelp, A. Sz. wrote: While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. Well, except that it's hopelessly inefficient and therefore essentially nobody is using it. We are talking about the whole of Unicode, not just BMP. Yes. For transmission, use UTF-8 (or maybe UTF-16). Regards,Martin.
Re: Unicode, SMS and year 2012
How data is transformed to this string is undefined, which is a problem. As mentioned in the mail, just like utf-8 is pre-installed in most systems, this design would also be pre-installed in the systems intending to use them. The example given above is not existing anywhere. One needs to come with the correct mapping based on the frequency of the use, lets say, all ANSI characters not encoded in the eight bits would be encoded in 10 bits (instead of the 16 bits of UTF-8), all the Cyrillic characters would be encoded in either 10 or 12 bits (instead of the 16 bits of UTF-8), all the Tamil character would be assigned in 18 bits (instead of the 24 bits of UTF-8) and so on. The above are possibilities. We assign each character of the latter and former scripts to a code point in their specified range (Please note that this is not yet done and possibly not the best, the example in the previous mail is just a random assumption for conceptualisation, not based on any theory). We generate a mapping something like this. If we go by assigning all ANSI, then Cyrillic, then the next suitable and so on, most of the population would be covered. Code words starting with an initial 1 code variable-length values, which are magically created. As noted above, they are not going to be magically created (once the design is complete), codes from this design need to be predefined to characters. Please note that this encoding is a work in progress, so I am stilling working on ways to assign the generated codes to the characters. Maybe after I have completed that, you may get a clear picture of what I want to do. * Code words starting with an initial 0 code literal 7-bit ASCII values, which follow the initial zero bit. 0MXX XXXL where M and L are MSB and LSB of the respective ASCII value. Thanks! This is what I wanted to suggest here. No correction to this. Code words starting with an initial 1 code variable-length values, which are magically created. Read N bits until a 1 bit is encountered (inclusive) on an even position within the bit string (where the position of the initial code word bit is 0) following a 0 bit on an even position. The complete word is N+2 bits long, including the initial 1 bit. I apologise for my poor explanation. I further assure, the codes are not magically created, they are created by the EBNF below. I regenerated the EBNF to make me as clear as possible, in fact, now they are two: 1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1) 1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1) All the codes produced and only these codes produced by any of the EBNF are valid. That is to say, a code produced independently from the first EBNF is valid, similarly a code independently produced by the second EBNF is also valid. There is one constraint on these EBNF's that at any given point the code (sentence) produced must always be greater than 8 bits. That is repeat any of the ones inside the curly braces {} till at least the code is of 10 bits. * Code words starting with an initial 1 code variable-length values, which are created from any of the above EBNF. Read N bits until a 1 bit is encountered (inclusive) on an even position within the bit string (where the position of the initial code word bit is 0) following a 0 bit on an even position [This statement is correct and only valid if the bit on the third position (Position 2, an even position) is a 1 bit]. If the bit on the third position (Position 2, an even position) is a 0 bit, then, Read N bits until a 0 bit is encountered (inclusive) on an even position within the bit string (where the position of the initial code word bit is 0) following a 1 bit on an even position other than the first position (Position 0). The complete word is N+2 bits long, including the initial 1 bit. Also, I wonder how efficiently your encoding can code general texts... Seeing as how your 10bit codes can only code 192 out of 512 possible values, 12 bit codes only 512 out of 2048 values and so on... This means you will have a massive amount of bits for rare-ish characters sooner or later... As with the number of possible values, you are underestimating for future codes. The number of characters of (and the total number of characters till) 8 bits is given as 128 values. The actual formula (for number of bits of only that point) goes like this, for bits greater than 8 bits: [number of bits - 4] [2 ^ (number of bits ÷ 2)] 8 bits - 128 values (cumulative: 128 values) 10 bits - 192 values (cumulative: 320 values) 12 bits - 512 values (cumulative: 704 values) 14 bits - 1280 values (cumulative: 1792 values) 16 bits - 3072 values (cumulative: 4352, this is double of what the UTF-8 provides = 128 (Basic Latin) + 1024 (all the 16 bit codes of UTF-8 count to this)) Thank You! For Your Time. Please Contact me If you Need more Clarification. I am always willing to clarify on this. Regards, Anbu On Sat, 28 Apr 2012 01:46:58 +0200, Robert Abel freak...@googlemail.com wrote: Hi On 2012/04/28
Fwd: Re: Unicode, SMS and year 2012
Please note some correction and additions in the comparison of the values My design provides the following number values for the specified number of bits: 8 bits - 128 values (Cumulative: 128 values) 10 bits - 192 values (Cumulative: 320 values) 12 bits - 512 values (Cumulative: 832 values) 14 bits - 1280 values (Cumulative: 2112 values) 16 bits - 3072 values (Cumulative: 5184 values) Note: UTF8 has 2048 values of 16 bits (Cumulative: 2176) This clearly shows that my design yields number of values more than double that of UTF8 18 bits - 7168 values (Cumulative: 12353 values) and so on, At any given number of bits, my design yields more (With the exception of 48 bits/6 bytes only, where UTF 8 yields more values than my design, but in the immediate next possible bits,50 bits, my design follows its trajectory of having more values than UTF8) Another advantage is that my design increments progressively by two bits. Please refer attached Spreadsheet for more comparison of values. Original Message Subject: Re: Unicode, SMS and year 2012 Date: Sat, 28 Apr 2012 07:54:02 -0400 From: a...@peoplestring.com To: freak...@googlemail.com How data is transformed to this string is undefined, which is a problem. As mentioned in the mail, just like utf-8 is pre-installed in most systems, this design would also be pre-installed in the systems intending to use them. The example given above is not existing anywhere. One needs to come with the correct mapping based on the frequency of the use, lets say, all ANSI characters not encoded in the eight bits would be encoded in 10 bits (instead of the 16 bits of UTF-8), all the Cyrillic characters would be encoded in either 10 or 12 bits (instead of the 16 bits of UTF-8), all the Tamil character would be assigned in 18 bits (instead of the 24 bits of UTF-8) and so on. The above are possibilities. We assign each character of the latter and former scripts to a code point in their specified range (Please note that this is not yet done and possibly not the best, the example in the previous mail is just a random assumption for conceptualisation, not based on any theory). We generate a mapping something like this. If we go by assigning all ANSI, then Cyrillic, then the next suitable and so on, most of the population would be covered. Code words starting with an initial 1 code variable-length values, which are magically created. As noted above, they are not going to be magically created (once the design is complete), codes from this design need to be predefined to characters. Please note that this encoding is a work in progress, so I am stilling working on ways to assign the generated codes to the characters. Maybe after I have completed that, you may get a clear picture of what I want to do. * Code words starting with an initial 0 code literal 7-bit ASCII values, which follow the initial zero bit. 0MXX XXXL where M and L are MSB and LSB of the respective ASCII value. Thanks! This is what I wanted to suggest here. No correction to this. Code words starting with an initial 1 code variable-length values, which are magically created. Read N bits until a 1 bit is encountered (inclusive) on an even position within the bit string (where the position of the initial code word bit is 0) following a 0 bit on an even position. The complete word is N+2 bits long, including the initial 1 bit. I apologise for my poor explanation. I further assure, the codes are not magically created, they are created by the EBNF below. I regenerated the EBNF to make me as clear as possible, in fact, now they are two: 1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1) 1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1) All the codes produced and only these codes produced by any of the EBNF are valid. That is to say, a code produced independently from the first EBNF is valid, similarly a code independently produced by the second EBNF is also valid. There is one constraint on these EBNF's that at any given point the code (sentence) produced must always be greater than 8 bits. That is repeat any of the ones inside the curly braces {} till at least the code is of 10 bits. * Code words starting with an initial 1 code variable-length values, which are created from any of the above EBNF. Read N bits until a 1 bit is encountered (inclusive) on an even position within the bit string (where the position of the initial code word bit is 0) following a 0 bit on an even position [This statement is correct and only valid if the bit on the third position (Position 2, an even position) is a 1 bit]. If the bit on the third position (Position 2, an even position) is a 0 bit, then, Read N bits until a 0 bit is encountered (inclusive) on an even position within the bit string (where the position of the initial code word bit is 0) following a 1 bit on an even position other than the first position (Position 0). The complete word is N+2 bits long, including the initial 1 bit. Also, I wonder how efficiently your encoding can code
Re: Unicode, SMS and year 2012
On Fri, 27 Apr 2012 11:21:05 -0700 Doug Ewell d...@ewellic.org wrote: SCSU works equally well, or almost so, with any text sample where the non-ASCII characters fit into a single block of 128 code points. For anything other than Latin-1 you need one byte of overhead, to switch to another window, and for many scripts you need two, to define a window and switch to it. But again, two bytes is not what's holding anyone up. With SCSU that avoids Unicode mode and UQU whenever possible, most alphabetic languages work fairly well. However, extra windows are needed to cover the half-blocks from A480 to ABFF, 15 new codes. If I were being miserly, I wouldn't cover A500-A5FF. SCSU doesn't work well with large syllabaries, especially if they include a lot of unused characters within the half-blocks used. Inuit suffers badly from this, but still achieves noticeable compression. I experimented with compressing Yi transposed to a covered range, and found that it achieved something like 10% compression. Yi suffers from needing the 8 dynamic windows to be switched between 10 half-blocks (with occasionally excursions to an 11th.) If the Yi characters had been arranged by tone first and initial consonant second, 2 of the half-blocks would never have been used in my sample! Vai A500-A63F fits in 3 half-blocks, and I would expect non-Vai characters in it to be in static blocks. Given how well Yi performed, I expect Vai to benefit from SCSU. Has anyone investigated the performance of SCSU with Cuneiform or Egyptian Hieroglyphics? It might achieve better than 50% compression! A fair comparison of Egyptian Hieroglyphics depends on the mark-up used, for Unicode on its own does not enable one to write reasonable Middle Egyptian. Richard.
Re: Unicode, SMS and year 2012
anbu at peoplestring dot com wrote: Document encoded in SCSU or BOCU-1, given that the document contains only ASCII characters, may appear corrupt on a system that doesn't recognise SCSU or BOCU-1. This is the curious point of view that ASCII compatibility (or transparency) is a bad thing. It does not apply to BOCU-1, which is not ASCII-transparent. Documents encoded in *any* format are likely to appear corrupt on a system that doesn't recognize the encoding. They are guaranteed to appear corrupt if character boundaries do not align with byte boundaries, which is what you propose here. 0111100101011001101110100101010110011000101010100101011101110101 If I'm going to use a variable-length, non-byte-aligned encoding, where there is no chance of realigning in case of a flipped or dropped bit (which seems to be of great concern to many people), I might as well go ahead and use a Huffman or LZ type of encoding (or a combination, like DEFLATE). Is this the same encoding you were proposing a little over a year ago, or an outgrowth of the same ideas? -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012 - SQU, not UQU
On Sat, 28 Apr 2012 18:55:00 +0100 Richard Wordingham richard.wording...@ntlworld.com wrote: I wrote: With SCSU that avoids Unicode mode and UQU whenever possible, most alphabetic languages work fairly well. I meant: With SCSU that avoids Unicode mode and SQU whenever possible, most alphabetic languages work fairly well. UQU only occurs in Unicode mode, and escapes tag bytes. SQU does not use a window for a character, but passes it as 2 bytes of following data. Of course, an initial byte-order mark may be emitted using SQU; this has only a small impact on performance. Richard.
Re: Unicode, SMS and year 2012
Mark Davis wrote: I suspect the punycode goal is to take a wide character set into a restricted character set, without caring much on resulting string length; if the original string happens to be in other character set than the target restricted character set, then the string length increases too much to be of interest in the SMS discussion. That is not correct. One of the chief reasons that punycode was selected was the reduction in size. But certainly the main motivation behind the development of Punycode, or any of the ACEs (ASCII-Compatible Encodings) that came before it, was to provide a compact encoding given the constraints of the set of characters allowed in domain names. The extensibility of the algorithm to target character sets of different sizes was definitely an advantage. Tests with the idnbrowser is not relevant. As I said: In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. That is, the parameterization of punycode for IDNA is restricted to the 36 IDNA values per byte, thus roughly 5 bits. When you parameterize punycode for a full 8 bits per byte, you get considerably different results. Not to say this isn’t so, but can you point to a tool or site where a user can type a string and see the output with different parameterizations? Pretty much all of the “Convert to Punycode” pages I see are only able to convert to the IDNA target. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Fwd: Re: Unicode, SMS and year 2012
anbu at peoplestring dot com wrote: This clearly shows that my design yields number of values more than double that of UTF8 I didn't know we were competing against UTF-8 on efficiency. That's easy. UTF-8 is not at all guaranteed to be the most efficient encoding possible, or even reasonably possible. It was originally scoped to be not extravagant in terms of space, while providing other design features like byte boundaries, full ASCII transparency, easy detection, and prefixes that quickly indicate the length of the sequence. It's easy to beat the efficiency of UTF-8 in a byte-based encoding, if many of its other design features are ignored: 0xxx - encodes U+ through U+007F 1xxx 0xxx - encodes U+0080 through U+3FFF 1xxx 1xxx - encodes U+4000 through U+10 (and onward to 0x1F) This is a well-known and freely available technique, sometimes called self-delimiting numeric values (RFC 6256) and sometimes by other names. There are many reasons why a new encoding that is merely more efficient than UTF-8, especially one that sacrifices byte-based processing or other design features, will face a severe uphill battle in trying to displace UTF-8. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
I wrote: 0xxx - encodes U+ through U+007F 1xxx 0xxx - encodes U+0080 through U+3FFF 1xxx 1xxx - encodes U+4000 through U+10 (and onward to 0x1F) Last code sequence should be 1xxx 1xxx 0xxx. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Fwd: Re: Unicode, SMS and year 2012
There are many reasons why a new encoding that is merely more efficient than UTF-8, especially one that sacrifices byte-based processing or other design features, will face a severe uphill battle in trying to displace UTF-8. What are some of the reasons a new encoding will face? On Sat, 28 Apr 2012 13:15:48 -0600, Doug Ewell d...@ewellic.org wrote: anbu at peoplestring dot com wrote: This clearly shows that my design yields number of values more than double that of UTF8 I didn't know we were competing against UTF-8 on efficiency. That's easy. UTF-8 is not at all guaranteed to be the most efficient encoding possible, or even reasonably possible. It was originally scoped to be not extravagant in terms of space, while providing other design features like byte boundaries, full ASCII transparency, easy detection, and prefixes that quickly indicate the length of the sequence. It's easy to beat the efficiency of UTF-8 in a byte-based encoding, if many of its other design features are ignored: 0xxx - encodes U+ through U+007F 1xxx 0xxx - encodes U+0080 through U+3FFF 1xxx 1xxx - encodes U+4000 through U+10 (and onward to 0x1F) This is a well-known and freely available technique, sometimes called self-delimiting numeric values (RFC 6256) and sometimes by other names. There are many reasons why a new encoding that is merely more efficient than UTF-8, especially one that sacrifices byte-based processing or other design features, will face a severe uphill battle in trying to displace UTF-8. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Fwd: Re: Fwd: Re: Unicode, SMS and year 2012
The question shall read as: What are some of the reasons a new encoding will face challenges? Original Message Subject: Re: Fwd: Re: Unicode, SMS and year 2012 Date: Sat, 28 Apr 2012 15:32:47 -0400 From: a...@peoplestring.com To: d...@ewellic.org There are many reasons why a new encoding that is merely more efficient than UTF-8, especially one that sacrifices byte-based processing or other design features, will face a severe uphill battle in trying to displace UTF-8. What are some of the reasons a new encoding will face? On Sat, 28 Apr 2012 13:15:48 -0600, Doug Ewell d...@ewellic.org wrote: anbu at peoplestring dot com wrote: This clearly shows that my design yields number of values more than double that of UTF8 I didn't know we were competing against UTF-8 on efficiency. That's easy. UTF-8 is not at all guaranteed to be the most efficient encoding possible, or even reasonably possible. It was originally scoped to be not extravagant in terms of space, while providing other design features like byte boundaries, full ASCII transparency, easy detection, and prefixes that quickly indicate the length of the sequence. It's easy to beat the efficiency of UTF-8 in a byte-based encoding, if many of its other design features are ignored: 0xxx - encodes U+ through U+007F 1xxx 0xxx - encodes U+0080 through U+3FFF 1xxx 1xxx - encodes U+4000 through U+10 (and onward to 0x1F) This is a well-known and freely available technique, sometimes called self-delimiting numeric values (RFC 6256) and sometimes by other names. There are many reasons why a new encoding that is merely more efficient than UTF-8, especially one that sacrifices byte-based processing or other design features, will face a severe uphill battle in trying to displace UTF-8. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
În data de Sat, 28 Apr 2012 12:53:17 -0600, Doug Ewell a scris: Not to say this isn’t so, but can you point to a tool or site where a user can type a string and see the output with different parameterizations? Pretty much all of the “Convert to Punycode” pages I see are only able to convert to the IDNA target. Not sure when, but I will try to take a look at the code provided here, maybe I can figure what parameter must be altered in order to do a test http://phlymail.com/en/downloads/idna/download/ (though not very hopeful, php being not my primary attraction) Cristi -- Cristian Secară http://www.secarica.ro
Re: Unicode, SMS and year 2012
anbu at peoplestring dot com wrote: What are some of the reasons a new encoding will face challenges? The main challenge to a new encoding is that UTF-8 is already present in numerous applications and operating systems, and that any encoding intended to serve as an alternative, let alone a replacement UTF-8, must be better enough to justify re-engineering of these systems. Some people are simply opposed to additional encoding schemes. The HTML5 specification explicitly forbids the use of UTF-32, SCSU, and BOCU-1 (while allowing many non-Unicode legacy encodings and quietly mapping others to Windows encodings); one committee member was quoted as saying that other encodings of Unicode waste developer time. Any encoding that does not align code point boundaries along byte boundaries will be criticized for requiring excessive processing. The argument that I made will be made by others, that if it necessary to process bit-by-bit, one might as well use a general-purpose compression algorithm. It is popular to present gzip as the ideal compression approach, since it is widely available, especially on Linux-type systems, and publicly documented (and not IP-encumbered). I may have missed some other objections. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
În data de Sat, 28 Apr 2012 12:41:51 -0600, Doug Ewell a scris: If I'm going to use a variable-length, non-byte-aligned encoding, where there is no chance of realigning in case of a flipped or dropped bit (which seems to be of great concern to many people), I might as well go ahead and use a Huffman or LZ type of encoding (or a combination, like DEFLATE). The standard 3GPP TS 23.042 [1] provides a Huffman compression method for SMS, yet it seems to me it needs the language to be known at the time of writing (or at least at the time of effective sending). It also provides per-language defined dictionaries using 850 or 437 codepage, but I have not finished reading all the details, so my overview may be distorted. While in theory this standard is promising (and was issued long time ago, probably that's why the IBM-like encoding), in practice I am not aware about its implementation (for sure in my device or the provided PC application it is not). Cristi [1] http://www.3gpp.org/ftp/Specs/html-info/23042.htm -- Cristian Secară http://www.secarica.ro
Re: Unicode, SMS and year 2012
Hi Cristian, This is a bit of a deviation from the issues you raise, but it relates to the subject in a different way. The SMS char set does not seem to follow Unicode. How I see Unicode is as a set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit, and CJKV that use some sort of 16-bit paring. As Unicode says, they are just numeric codes assigned to letters or whatever other ideas. It is the task if the devices to decide what they are and show them You say that there are only two character sets in GSM: 7-bit, which is a reassignment of codes to a select Latin letter shapes, and 16-bit for the rest. It appears as if they decided that a certain set of letters are common to some preferred markets, and that it is efficient to reassign the established Unicode characters to this newly selected letter shapes. Had they simply used the 8-bit ISO-8859-1 set, the number of characters per SMS would limit to 140 instead of 160. (Is that why Twitter limits the # of chars to 140?). Of course, that would not have included some users whose letters are 16-bit characters under Unicode. I made a comprehensive transliteration for the Singhala script (Singhala+Sanskrit+Pali). It shows perfectly when 'dressed' with a smartfont. The following are two web sites that illustrate this solution (every character is ISO-8859-1, except for the occasional ZWNJ, which actually should be 8-bit NBH that somebody decided to leave undefined. Use any browser except IE. IE does not understand Open Type) http://www.lovatasinhala.com (hand coded) http://www.ahangama.com/ (WordPress blog) All Indic languages could be transliterated this way. it makes Indic similar to Latin based European languages with intuitive typing and orthographic results, which Unicode Sinhala can't do. It takes about half the bandwidth to transmit that the double-byte set. I just noticed that transliterated Singhala would not be fully covered with SMS 7-bit because some Unicode 8-bit characters are not in this set. Looking at my iPhone, I see that the International icon brings up key-layout plus font pairs. I think what they should do is to separate fonts and key-layouts.This way, the user could select the key layout for input and whatever font they want to use to show it. The next thing I am going to say made many readers here very angry, but may I say it again? The idea of Last Resort Font that makes basic editors Plain Text is a ploy to brag that the computer can show all the world's languages that most you cannot read anyway. The text runs of foreign languages should show as series of Glyph Not Found character or the specific hint glyph of a language. The user of a foreign language would know where to download fonts of their native language. In the small market of Singhala, no font is present that goes typographically well with Arial Unicode. There is no incentive or money to make beautiful fonts for a minority language like Singhala. The plain text result for Singhala is ugly. The OS makers unnecessarily made hodge-podge Last Resort Fonts I hope both the mobile device industry and the PC side separate fonts and characters and allow the users to decide the default font sets in their devices. This is eminently rational because the rendering of the font happens locally, whereas the characters travel across the network. This will also help those who like me who understand that their language is better served by a transliteration solution than a convoluted double-byte solution that discourages the natives to use their script. Actually, this is causing bilingual Singhalese to abandon their native language. The government is making special emphasis on English, as Singhala is terribly difficult to use in the modern setting. This is a grave problem for a society of near 100% literacy rate, and just a few million. On Fri, Apr 27, 2012 at 3:06 AM, Cristian Secară or...@secarica.ro wrote: Few years ago there was a discussion here about Unicode and SMS (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is the same, i.e. a SMS text message that uses characters from the GSM character set can include 160 characters per message (stream of 7 bit × 160), whereas a message that uses everything else can include only 70 characters per message (stream of UCS2 16 bit × 70). Although my language (Romanian) was and is affected by this discrepancy, then I was skeptical about the possibility to improve something in the area, mostly because at that time both the PC and mobile market suffered about other critical language problems for me (like missing gliphs in fonts, or improper keyboard implementation). Things evolved and now the perspectives are much better. Regarding the SMS, at that time Richard Wordingham pointed that the SCSU might be a proper solution for the SMS encoding [when it comes to non-GSM characters]. Recently I studied as much aspects as I could about the SMS standardization, in a step that I started approx a
Re: Fwd: Re: Unicode, SMS and year 2012
On Friday, April 27, anbu at peoplestring dot com wrote: In addition I had a few more questions, of which the one below is the most significant: What if one had to send a text in multiple scripts, like in the case of a text and its translation in the same message? I thought maybe a new transition format or a new character encoding would be suitable. As a test, I took the first sentence from Article 1 of the UDHR (an increasingly common benchmark), and used Google Translate to derive the Hindi and Tamil equivalents: All human beings are born free and equal in dignity and rights. सभी मनुष्य स्वतंत्र और गरिमा और अधिकारों में बराबर पैदा होते हैं. எல்லா மனிதர்களும் இலவச மற்றும் கௌரவம் மற்றும் உரிமைகள் சம பிறக்கின்றன. (I don't vouch for the correctness of these translations; if you know Hindi or Tamil and disagree with them, please provide your own.) This is 84 characters from the Basic Latin block (including spaces used in all three languages), 53 from Devanagari, and 62 from Tamil. I encoded the resulting text in SCSU, with each line terminating in CRLF and with the U+FEFF signature (0E FE FF) at the beginning. The Devanagari passage is encoded as one byte per Unicode character, preceded by a single SC4 tag byte to select window 4, which is predefined to the Devanagari block. The Tamil passage is also encoded as one byte per character, preceded by a two-byte SD3 tag to define a window into the Tamil block and select it. The total size of these three lines of text in SCSU, including signature and CRLF, is 211 bytes. That's probably about as good as any non-general-purpose Unicode compression encoding can achieve, and better than most. I'm curious how well Anbu's proprietary encoding will stack up. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
Richard Wordingham wrote: With SCSU that avoids Unicode mode and UQU whenever possible, most alphabetic languages work fairly well. However, extra windows are needed to cover the half-blocks from A480 to ABFF, 15 new codes. If I were being miserly, I wouldn't cover A500-A5FF. In November 2010 I proposed updating the SCSU spec to do just that. (There were a couple of other suggestions in the proposal, all severable.) Reaction to the proposal was not encouraging: http://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0005.html http://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0008.html SCSU doesn't work well with large syllabaries, especially if they include a lot of unused characters within the half-blocks used. Inuit suffers badly from this, but still achieves noticeable compression. I experimented with compressing Yi transposed to a covered range, and found that it achieved something like 10% compression. Yi suffers from needing the 8 dynamic windows to be switched between 10 half- blocks (with occasionally excursions to an 11th.) If the Yi characters had been arranged by tone first and initial consonant second, 2 of the half-blocks would never have been used in my sample! Medium-sized writing systems such as syllabaries, that span more than one or two 128-blocks and cross among them constantly (not just for isolated characters), have always been the Achilles heel of SCSU. You can't realistically encode something like Canadian Syllabics on its own using 7 bits per character, or even 8. The best hope is to be able to use windows, and hope that window switching can be kept to a minimum. As you noted with Yi, how successful that is depends on character frequency and whether common characters are concentrated in one or two half-blocks, or whether they are scattered. The design goal of SCSU was to encode text about as efficiently as in legacy encodings. For small alphabetic scripts, the examples were the numerous 8-bit encodings for Latin and Cyrillic and Greek, as well as things like ARMSCII and ISCII. Unicode mode was meant for really large scripts like Han and precomposed Hangul, where 16 bits per character was considered acceptable (and better than UTF-8). The design goal was met, but medium-sized scripts (with no legacy encodings to compete against) didn't fare so well. There is no mechanism in SCSU to encode a character in a non-integral number of bytes, and that's probably good; such a mechanism would have made SCSU, already criticized for its complexity, much more complex. Note that most of the above applies to BOCU-1 as well, for what it's worth. Vai A500-A63F fits in 3 half-blocks, and I would expect non-Vai characters in it to be in static blocks. Given how well Yi performed, I expect Vai to benefit from SCSU. It does benefit by comparison to UTF-8. Addition of window offset bytes to point to this area would help further, but see not encouraging above. Has anyone investigated the performance of SCSU with Cuneiform or Egyptian Hieroglyphics? It might achieve better than 50% compression! A fair comparison of Egyptian Hieroglyphics depends on the mark-up used, for Unicode on its own does not enable one to write reasonable Middle Egyptian. If you have realistic samples of text in these scripts that you could send (privately), I could experiment. Most of my samples for experimentation in compression have lately come from the UDHR in Unicode project. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Fwd: Re: Unicode, SMS and year 2012
Please note the following corrections to the mail below: The number of codes supported with a given number of bits, n, is given by: [2 ^ (n ÷ 2)] [n - 4] The total number of codes supported with a given number of bits, n, and all the number of bits less than it is given by: 3 [2 ^ (n ÷ 2)] [n - 4] - 64 Original Message Subject: Re: Unicode, SMS and year 2012 Date: Fri, 27 Apr 2012 07:54:41 -0400 From: a...@peoplestring.com To: or...@secarica.ro, unicode@unicode.org Hi! I also had the same questions. In addition I had a few more questions, of which the one below is the most significant: What if one had to send a text in multiple scripts, like in the case of a text and its translation in the same message? I thought maybe a new transition format or a new character encoding would be suitable. I am currently working on a new form of representation. This is how it goes: All the characters of the block C0 Controls and Basic Latin are included, with their design unaltered, that is they are encoded in eight bits (including the initial zero), given by 0xxx. All the other codes would surely be designed greater than the eight bits. I was assuming the design, given by the following EBNF, would help: 1(0|1){1(0|1)}(0|1)(0|1){0(0|1)}0(0|1)1(0|1) Please note that this design produces codes whose number of bits are even numbers greater than eight. That, is 10, 12, 14, 16, 18, 20, 22, ... and so on. The number of codes supported with a given number of bits, n, is given by: 2 ^ (n ÷ 2)] [n - 4] The total number of codes supported with a given number of bits, n, and all the number of bits less than it is given by: 3 [2 ^ (n ÷ 2) - 1] [n - 4] + 74 Please note that the sign '^' represents raised to the power of, just as in most computer applications. Further, note that this design is still under development so may be subject to minor corrections. I chose to design codes whose number of bits are even numbers only, rather than all integers, so that in the event of a corruption of a byte, lets say maybe due to network failure, somewhere between other bytes that conform to this standard, only the part where there is the corrupt byte and a few consecutive bytes would be affected, making the effect of the byte loss to be minimal. All the information given above in this mail are my intellectual property and my concern is to be sought before using them for any purpose. Regards, Anbu Kaveeswarar Selvaraju On Fri, 27 Apr 2012 11:06:23 +0300, Cristian Secară or...@secarica.ro wrote: Few years ago there was a discussion here about Unicode and SMS (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is the same, i.e. a SMS text message that uses characters from the GSM character set can include 160 characters per message (stream of 7 bit × 160), whereas a message that uses everything else can include only 70 characters per message (stream of UCS2 16 bit × 70). Although my language (Romanian) was and is affected by this discrepancy, then I was skeptical about the possibility to improve something in the area, mostly because at that time both the PC and mobile market suffered about other critical language problems for me (like missing gliphs in fonts, or improper keyboard implementation). Things evolved and now the perspectives are much better. Regarding the SMS, at that time Richard Wordingham pointed that the SCSU might be a proper solution for the SMS encoding [when it comes to non-GSM characters]. Recently I studied as much aspects as I could about the SMS standardization, in a step that I started approx a year ago regarding the SMS language discrimination just because of the difference in message length and cost over a same sentence written with diacritical marks (written correctly for that language) or without diacritical marks (written incorrectly for that language). Or, for the same reason, language discrimination between (say) a French message and (say) a Romanian message, both written correctly. It turned out that they (ETSI its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was introduced since release 8. In short, it is about a character substitution table, per character or per message, per-language defined. Personally I find this to be a stone-age-like approach, which in my opinion does not work at all if I enter the message from my PC keyboard via the phone's PC application (because the language cannot always be predicted, mainly if I am using dead keys). It is true that the actual SMS stream limit is not much generous, but I wonder if the SCSU would have been a better approach in terms of i18n. I also don't know if the SCSU requires a language to be prior declared, or it simply guess by itself the required window for each character. Apparently the SCSU seems to be ok
Fwd: Re: Unicode, SMS and year 2012
Further correction I was assuming the design, given by the following EBNF, would help: 1(0|1){1(0|1)}(0|1)(0|1)(0|1)(0|1){0(0|1)}0(0|1)1(0|1) The number of codes supported with a given number of bits (greater than eight bits), n, is given by: 2 ^ (n ÷ 2)] [n - 4] Original Message Subject: Fwd: Re: Unicode, SMS and year 2012 Date: Fri, 27 Apr 2012 08:14:13 -0400 From: a...@peoplestring.com To: or...@secarica.ro, unicode@unicode.org Please note the following corrections to the mail below: The number of codes supported with a given number of bits, n, is given by: [2 ^ (n ÷ 2)] [n - 4] The total number of codes supported with a given number of bits, n, and all the number of bits less than it is given by: 3 [2 ^ (n ÷ 2)] [n - 4] - 64 Original Message Subject: Re: Unicode, SMS and year 2012 Date: Fri, 27 Apr 2012 07:54:41 -0400 From: a...@peoplestring.com To: or...@secarica.ro, unicode@unicode.org Hi! I also had the same questions. In addition I had a few more questions, of which the one below is the most significant: What if one had to send a text in multiple scripts, like in the case of a text and its translation in the same message? I thought maybe a new transition format or a new character encoding would be suitable. I am currently working on a new form of representation. This is how it goes: All the characters of the block C0 Controls and Basic Latin are included, with their design unaltered, that is they are encoded in eight bits (including the initial zero), given by 0xxx. All the other codes would surely be designed greater than the eight bits. I was assuming the design, given by the following EBNF, would help: 1(0|1){1(0|1)}(0|1)(0|1){0(0|1)}0(0|1)1(0|1) Please note that this design produces codes whose number of bits are even numbers greater than eight. That, is 10, 12, 14, 16, 18, 20, 22, ... and so on. The number of codes supported with a given number of bits, n, is given by: 2 ^ (n ÷ 2)] [n - 4] The total number of codes supported with a given number of bits, n, and all the number of bits less than it is given by: 3 [2 ^ (n ÷ 2) - 1] [n - 4] + 74 Please note that the sign '^' represents raised to the power of, just as in most computer applications. Further, note that this design is still under development so may be subject to minor corrections. I chose to design codes whose number of bits are even numbers only, rather than all integers, so that in the event of a corruption of a byte, lets say maybe due to network failure, somewhere between other bytes that conform to this standard, only the part where there is the corrupt byte and a few consecutive bytes would be affected, making the effect of the byte loss to be minimal. All the information given above in this mail are my intellectual property and my concern is to be sought before using them for any purpose. Regards, Anbu Kaveeswarar Selvaraju On Fri, 27 Apr 2012 11:06:23 +0300, Cristian Secară or...@secarica.ro wrote: Few years ago there was a discussion here about Unicode and SMS (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is the same, i.e. a SMS text message that uses characters from the GSM character set can include 160 characters per message (stream of 7 bit × 160), whereas a message that uses everything else can include only 70 characters per message (stream of UCS2 16 bit × 70). Although my language (Romanian) was and is affected by this discrepancy, then I was skeptical about the possibility to improve something in the area, mostly because at that time both the PC and mobile market suffered about other critical language problems for me (like missing gliphs in fonts, or improper keyboard implementation). Things evolved and now the perspectives are much better. Regarding the SMS, at that time Richard Wordingham pointed that the SCSU might be a proper solution for the SMS encoding [when it comes to non-GSM characters]. Recently I studied as much aspects as I could about the SMS standardization, in a step that I started approx a year ago regarding the SMS language discrimination just because of the difference in message length and cost over a same sentence written with diacritical marks (written correctly for that language) or without diacritical marks (written incorrectly for that language). Or, for the same reason, language discrimination between (say) a French message and (say) a Romanian message, both written correctly. It turned out that they (ETSI its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was introduced since release 8. In short, it is about a character substitution table, per character or per message, per-language defined. Personally I find this to be a stone-age-like approach, which in my opinion does not work at all if I enter the message from my
Re: Unicode, SMS and year 2012
Cristian Secară orice at secarica dot ro wrote: It turned out that they (ETSI its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was introduced since release 8. In short, it is about a character substitution table, per character or per message, per-language defined. Personally I find this to be a stone-age-like approach, which in my opinion does not work at all if I enter the message from my PC keyboard via the phone's PC application (because the language cannot always be predicted, mainly if I am using dead keys). It is true that the actual SMS stream limit is not much generous, but I wonder if the SCSU would have been a better approach in terms of i18n. I also don't know if the SCSU requires a language to be prior declared, or it simply guess by itself the required window for each character. I agree that treating character repertoire as simply a matter of language selection, and creating language-specific code pages, is a backward-looking solution. Not only is language tagging not always an option, as Cristian points out, but people don't want to be tied to the absolute minimum character repertoire that someone decided was necessary to write a given language, even in a text message. Just look at the rise of emoji in text messages. And, of course, I agree that SCSU would have been a much better solution. Most of the current arguments against SCSU wouldn't apply to SMS: the cross-site scripting argument wouldn't apply if SCSU were the only extended encoding, or if the protocol tagged it, and the complex-encoder argument wouldn't apply to any phone from the last 5 years that can take pictures and shoot videos and scan bar codes and run numerous apps simultaneously. (SCSU doesn't require a complex encoder anyway, although it can benefit incrementally from one.) Interestingly, one of the first mentions I can find on the Unicode list of SCSU-like compression — actually a description of RCSU, the predecessor to SCSU — was in the context of SMS message compression: http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML001/0242.html Neither RCSU nor SCSU quite fits the original bill, which was to represent Unicode in 7 bits per character (with some overhead) and thus achieve 160 characters per message. Both schemes use 8-bit code units. Still, 140 characters is much better than 70. Apparently the SCSU seems to be ok for my language, or Hungarian, or Bulgarian, etc., but is this ok also for non-Latin and non-Cyrillic scripts ? This versus the language shift mechanism, which is still 7 bit. Release 10 of that standard includes language locking shift tables for Turkish, Portuguese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil, Telugu and Urdu. SCSU works equally well, or almost so, with any text sample where the non-ASCII characters fit into a single block of 128 code points. For anything other than Latin-1 you need one byte of overhead, to switch to another window, and for many scripts you need two, to define a window and switch to it. But again, two bytes is not what's holding anyone up. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, Apr 27, 2012 at 11:21, Doug Ewell d...@ewellic.org wrote: Cristian Secară orice at secarica dot ro wrote: It turned out that they (ETSI its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was introduced since release 8. In short, it is about a character substitution table, per character or per message, per-language defined. Personally I find this to be a stone-age-like approach, which in my opinion does not work at all if I enter the message from my PC keyboard via the phone's PC application (because the language cannot always be predicted, mainly if I am using dead keys). It is true that the actual SMS stream limit is not much generous, but I wonder if the SCSU would have been a better approach in terms of i18n. I also don't know if the SCSU requires a language to be prior declared, or it simply guess by itself the required window for each character. I agree that treating character repertoire as simply a matter of language selection, and creating language-specific code pages, is a backward-looking solution. Not only is language tagging not always an option, as Cristian points out, but people don't want to be tied to the absolute minimum character repertoire that someone decided was necessary to write a given language, even in a text message. Just look at the rise of emoji in text messages. And, of course, I agree that SCSU would have been a much better solution. Most of the current arguments against SCSU wouldn't apply to SMS: the cross-site scripting argument wouldn't apply if SCSU were the only extended encoding, or if the protocol tagged it, and the complex-encoder argument wouldn't apply to any phone from the last 5 years that can take pictures and shoot videos and scan bar codes and run numerous apps simultaneously. (SCSU doesn't require a complex encoder anyway, although it can benefit incrementally from one.) Interestingly, one of the first mentions I can find on the Unicode list of SCSU-like compression — actually a description of RCSU, the predecessor to SCSU — was in the context of SMS message compression: http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML001/0242.html Neither RCSU nor SCSU quite fits the original bill, which was to represent Unicode in 7 bits per character (with some overhead) and thus achieve 160 characters per message. Both schemes use 8-bit code units. Still, 140 characters is much better than 70. Apparently the SCSU seems to be ok for my language, or Hungarian, or Bulgarian, etc., but is this ok also for non-Latin and non-Cyrillic scripts ? This versus the language shift mechanism, which is still 7 bit. Release 10 of that standard includes language locking shift tables for Turkish, Portuguese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil, Telugu and Urdu. SCSU works equally well, or almost so, with any text sample where the non-ASCII characters fit into a single block of 128 code points. For anything other than Latin-1 you need one byte of overhead, to switch to another window, and for many scripts you need two, to define a window and switch to it. But again, two bytes is not what's holding anyone up. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
RE: Unicode, SMS and year 2012
Mark Davis mark at macchiato dot com wrote: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. That might work well if the goal is to find a compact encoding to 7-bit code units, then express 8 such code units in 7 bytes. It would certainly be more economical than UTF-7-over-7, which is fine for ASCII and awful for anything else. I don't usually think of Punycode as an ideal general-purpose compression encoding, especially with lines of arbitrary length or consisting primarily of non-ASCII content (Cristian's example), but it's certainly worth experimenting. One advantage might be that encoders and decoders for Punycode already exist, probably in greater numbers than for SCSU. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
Anbu Kaveeswarar Selvaraju anbu at peoplestring dot com wrote: What if one had to send a text in multiple scripts, like in the case of a text and its translation in the same message? I thought maybe a new transition format or a new character encoding would be suitable. I am currently working on a new form of representation. This is how it goes: I don't see how this is better than SCSU. Perhaps if you can provide some examples of text strings and how they would be represented in your encoding, we can judge. On the other hand... All the information given above in this mail are my intellectual property and my concern is to be sought before using them for any purpose. Never mind. Not interested. If I wanted a compression encoding that was encumbered with IP restrictions, I'd choose BOCU-1. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Unicode, SMS and year 2012
În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. I suspect the punycode goal is to take a wide character set into a restricted character set, without caring much on resulting string length; if the original string happens to be in other character set than the target restricted character set, then the string length increases too much to be of interest in the SMS discussion. Just do a test: write something in a non-Latin alphabetic script into this page here http://demo.icu-project.org/icu-bin/idnbrowser Cristi -- Cristian Secară http://www.secarica.ro
Re: Unicode, SMS and year 2012
Hi On 2012/04/28 00:23, a...@peoplestring.com wrote: 1. let 'x' be the position of a code positioned at an odd number eg when we take the code '1001010110', the first '1' is positioned at location '1' (so an odd number), the first '0' is positioned at location '2' (not an odd number), the next '0' is positioned at location '3' (an odd number) and so on. 2. the program takes into memory all the bits till it reaches the end (whether they are at position 'x' or not), till it has reached the end 3. the program checks each consecutive bit at position 'x'. 4. The program finds the end by the theory 'The bit before the last bit of the code is reached if and only if the bit value at 'x' has changed twice'. Changing twice is that the bit value must change from the initial '1' to '0', then back to '1'. The last bit is immediately after the '1' at position 'x', which in turn itself comes after a '0' at position 'x'. 5. Here we find this doesn't need much or complicated arithmetic. Simple logic is enough. You stated that way too complicated... From what I understand from your description: * Read data as string of bits. How data is transformed to this string is undefined, which is a problem. * Code words starting with an initial 0 code literal 7-bit ASCII values, which follow the initial zero bit. 0MXX XXXL where M and L are MSB and LSB of the respective ASCII value. * Code words starting with an initial 1 code variable-length values, which are magically created. Read N bits until a 1 bit is encountered (inclusive) on an even position within the bit string (where the position of the initial code word bit is 0) following a 0 bit on an even position. The complete word is N+2 bits long, including the initial 1 bit. Also, I wonder how efficiently your encoding can code general texts... Seeing as how your 10bit codes can only code 192 out of 512 possible values, 12 bit codes only 512 out of 2048 values and so on... This means you will have a massive amount of bits for rare-ish characters sooner or later... Regards, Robert
Re: Unicode, SMS and year 2012
That is not correct. One of the chief reasons that punycode was selected was the reduction in size. Tests with the idnbrowser is not relevant. As I said: In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. That is, the parameterization of punycode for IDNA is restricted to the 36 IDNA values per byte, thus roughly 5 bits. When you parameterize punycode for a full 8 bits per byte, you get considerably different results. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** 2012/4/27 Cristian Secară or...@secarica.ro În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. I suspect the punycode goal is to take a wide character set into a restricted character set, without caring much on resulting string length; if the original string happens to be in other character set than the target restricted character set, then the string length increases too much to be of interest in the SMS discussion. Just do a test: write something in a non-Latin alphabetic script into this page here http://demo.icu-project.org/icu-bin/idnbrowser Cristi -- Cristian Secară http://www.secarica.ro
Re: Unicode, SMS and year 2012
On 2012/04/28 4:26, Mark Davis ☕ wrote: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. Because punycode encodes differences between character numbers, not the character numbers themselves, it can indeed be quite efficient in particular if the characters used are tightly packed (e.g. Greek, Hebrew,...). For languages with Latin script and accented characters, the question is how close these accented characters are in Unicode. However, punycode also codes character positions. Because of this, it gets less efficient for longer text. [Because punycode uses (circular) position differences rather than simple positions, this contribution is limited by the (rounded-up binary logarithm of the) weighted average distance between two same characters in the text/language.] My guess is therefore that punycode won't necessarily be super-efficient for texts in the 100+ character range. It's difficult to test quickly because the punycode converters on the Web limit the output to 63 characters, the maximum length of a label in a domain name. Regards,Martin.
Re: Unicode, SMS and year 2012
On 2012/04/28 7:29, Cristian Secară wrote: În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. I suspect the punycode goal is to take a wide character set into a restricted character set, without caring much on resulting string length; if the original string happens to be in other character set than the target restricted character set, then the string length increases too much to be of interest in the SMS discussion. Not exactly. Compression was very much a goal when designing punycode. It won against a number of other algorithms as the choice for IDNs and is clearly very good for that purpose. Just do a test: write something in a non-Latin alphabetic script into this page here http://demo.icu-project.org/icu-bin/idnbrowser Well, as a silly example, what about α? (that's 57 α characters). The result is xn--mxa, which is 63 characters long. Regards, Martin.
Re: Unicode, SMS and year 2012
În data de Fri, 27 Apr 2012 17:28:13 -0700, Mark Davis ☕ a scris: That is not correct. One of the chief reasons that punycode was selected was the reduction in size. Tests with the idnbrowser is not relevant. As I said: In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. Sorry, I didn't understand right from the start your point about using all byte values. Will think about it. Cristi -- Cristian Secară http://www.secarica.ro
Re: Unicode, SMS and year 2012
On 2012/04/27 17:06, Cristian Secară wrote: It turned out that they (ETSI its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was introduced since release 8. In short, it is about a character substitution table, per character or per message, per-language defined. Personally I find this to be a stone-age-like approach, Fully agreed. which in my opinion does not work at all if I enter the message from my PC keyboard via the phone's PC application (because the language cannot always be predicted, mainly if I am using dead keys). It is true that the actual SMS stream limit is not much generous, but I wonder if the SCSU would have been a better approach in terms of i18n. I also don't know if the SCSU requires a language to be prior declared, or it simply guess by itself the required window for each character. The right approach in this case isn't to discuss clever compression techniques (I've indulged in this in my other mails, too, sorry), but to realize that the underlying mobile/wireless technology has advanced a lot. SMSes are simply a relict of outdated technology, sold at a horrendous price. For more information, see e.g. http://mobile.slashdot.org/comments.pl?sid=433536cid=22219254 or http://gthing.net/the-true-price-of-sms-messages. That's even for the case of pure ASCII messages. The solution is simply to stop using SMSes, and upgrade to a better technology. Regards, Martin.