Re: benefits of unicode
On Wed, Apr 18, 2001 at 02:09:30PM -0500, Ayers, Mike wrote: that the extra symbols can make the read a little easier, but they are not considered[1] necessary. We were discussing adequcy, not excellence, and to me the two are quite distinct. THEN WHY WASTE A WHOLE BIT ON UPPER CASE? THEY CERTAINLY ARE NOT NECCESSARY AND I HAVE FREQUENTLY SEEN PEOPLE NOT USE THEM WHEN AVAILABLE. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org "I don't care if Bill personally has my name and reads my email and laughs at me. In fact, I'd be rather honored." - Joseph_Greg
Re: Latin w/ diacritics (was Re: benefits of unicode)
From: "Jungshik Shin" [EMAIL PROTECTED] As long as specific markets remain resistant to the idea of this work being done, this is no mere myth -- it is a reality. As a general statement, I might agree to the above. However, I'm a bit confused as to what you're specifically talking about here (that is, what you meant by 'this work' and 'specific markets'). I guess I'm supposed to read between lines, but I'm rather slow here. Could you elaborate a bit? I know that there has been resistance for CHT, CHS, JPN, and KOR solutions that involved anything that would de-emphasize the existing system of specific ideographs for specific code points and the support for 100% round tripping of data to and from Unicode. Because of this, any attempt to "synthesize" characters, whether from strokes, vowels, consonants, or pieces of chewing gum, has met with resistance. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: Latin w/ diacritics (was Re: benefits of unicode)
On Wed, 18 Apr 2001, Michael (michka) Kaplan wrote: From: "Jungshik Shin" [EMAIL PROTECTED] As long as specific markets remain resistant to the idea of this work being done, this is no mere myth -- it is a reality. As a general statement, I might agree to the above. However, I'm a bit confused as to what you're specifically talking about here (that is, what you meant by 'this work' and 'specific markets'). I guess I'm supposed to read between lines, but I'm rather slow here. Could you elaborate a bit? I know that there has been resistance for CHT, CHS, JPN, and KOR solutions that involved anything that would de-emphasize the existing system of specific ideographs for specific code points and the support for 100% round tripping of data to and from Unicode. Because of this, any attempt to "synthesize" characters, whether from strokes, vowels, consonants, or pieces of chewing gum, has met with resistance. How on earth can 'ideographs' be synthesized from consonants and vowels? Moreover, when I wrote that 'CJK don't always go together', I wasn't talking about Chinese characters(ideographs) at all. I was talking about Korean Hangul only (I think it was pretty clear in the part of my message you didn't quote where I talked about Thai/Indic scripts and Hangul) Also, I have no clue why potentially drastic reduction (in principle/theory) of the font size for Korean by dynamic glyph shaping has anything to do with the round-trip of existing data to and from Unicode. Jungshik Shin
Re: Latin w/ diacritics (was Re: benefits of unicode)
How on earth can 'ideographs' be synthesized from consonants and vowels? Moreover, when I wrote that 'CJK don't always go together', I wasn't talking about Chinese characters(ideographs) at all. I was talking about Korean Hangul only (I think it was pretty clear in the part of my message you didn't quote where I talked about Thai/Indic scripts and Hangul) I think you kind of missed my point: although you are dealing with three different scripts that have three different sets of issues, there are some similarities at a high level. The thing that is similar here is that in each case there are champions of the current system. Although it may be useful to talk about font technologies that allow for much smaller font sizes, I doubt that anyone believes that the 12.8 mb for the Guilm ttc file (containing Gulim/GulimChe/Dotum/DotumChe) is made up only of Hangul -- as opposed to Hanja. Heck, I doubt you could claim its even mainly made up of Hangul. The fact is that there are folks who are opposed to this type of issue and are very sensitive about attempts to change things. Though of course if a font used such a method internally and no one ever really knew, then I suppose no one would be unhappy, right? A similar issue exists for Chinese where a different proposal often surfaces to try to synthesize characters from the various strokes and radicals. This also is met with opposition, and sometimes the arguments against such ideas have no more merit than any other such case. I guess I was trying to stress that this is no mere "myth to be dispelled in the i18n community" but is a real issue in the minds of some (many?) customers. Also, I have no clue why potentially drastic reduction (in principle/theory) of the font size for Korean by dynamic glyph shaping has anything to do with the round-trip of existing data to and from Unicode. I think I kind of covered this above... if no one knows thats what is happening in the font, then who will be the wiser? In fact I would hazard a guess that there are indeed fonts out there today that do this. It does not (of course( change the fact that some people are opposed to the idea, just as there are some who are opposed to such "solutions" to large Chinese fonts, etc. michka
RE: Latin w/ diacritics (was Re: benefits of unicode)
Carl Brown wrote: If these folks really want Unicode everywhere I will write Unicode for the IBM 1401 if they are willing to foot the bill. Seriously I would never agree to such a ludicrous idea. Thanks, Carl, but if "these folks" is me, I don't even know what an IBM 1401 is, let alone needing you to write Unicode support for it. If I am allowed to introduce one more anachronism, there exist a concept called "portability". So, once one of these nutshell implementations of Unicode exists (on, say, a DOS box with a bitmapped font), it would not be necessary to re-write it from scratch for each next "end-of-lifed unsupported OS's" or embedded device. I hope this may cast a slightly different light on the effort-to-usefulness ratio of this. Can you imagine a Unicode 3.1 character properties table that uses 16bit addressing? I am not sure what you mean but, yes, I can imagine it very well. But it would be an unnecessary waste to load the whole databases in memory, although it would be possible: the vers. 3.1 character properties contains only about 13,000 lines. Multiply this by the 32-bits of a DOS "far pointer", and you obtain an array that still fits in a 64KB segment. OK: this array would crash as soon as 3,000 more characters are added to Unicode... But loading whole tables (or fonts) in memory is not really the way to go; you wouldn't do this even in much more powerful environments. It would be much better to keep the data on a file and access it through an efficient file indexing method and a well-tuned cache algorithm. Unicode take lots of memory. I promise that I won't use the word "myth" for at least a week. But my impression is that it is rather systems like OpenType and ATSUI that take lots of memory. And this is not a surprise nor a scandal: these systems are designed for OS's that require lots of memory for *everything*. But this should not draw us to the conclusion that Unicode itself is a memory-eating monster. It is just a character set! The memory and storage requirements of Unicode are not so terribly more complex than, say, older double byte systems. _ Marco
Byte Order Marks
Hi, A quick question relating to the Byte Order Mark of UCS-2. If its absent is it safe to assume any particular order (i.e. Big or Little Endian?). I am writing a function to rearrange from Big to little endian but without a byte order mark I'm not sure what the order is. Is there any specification I could refer to? Thanks. Tom Tomas McGuinness Consultant -- University Technology Park* +353 21 4933 277 Curraheen Rd, Cork *+353 21 4933 201 * [EMAIL PROTECTED] -- CMG Telecom Products Division Product Development, Cork --
OT Porting to older OSes was RE: Latin w/ diacritics (was Re: benefits of unicode)
Marco, I still remember the Univac I which had memory tubes about the size of your fist (The Univac II used core). The 1401 however, was a fully transistorized computer. It used core memory which ranged in size from 1400 to 16,000 6 bit bytes. (Unicode on 6 bit machines is another challenge). You are right about font files being big. However, there is no Unicode font so you have the same large font files even without Unicode. Large font files is why some printers have there own disk drives. Part of the reason that Unicode implementations are so large is that we need translation tables to maintain compatibility with old code pages. Eliminate these code pages and we reduce the size of the Unicode implementation. At least Windows is going in the right direction. All future scripts will be Unicode only. This way they don't have to carry the other baggage. People may talk about line breaking, collation, fonts etc. being resource hogs. In actuality you need the same resources for code page systems as well. With Unicode however you get to reuse some of these resources if you support multiple scripts. The limit for systems like Windows were systems like the Arabic/French systems. Beyond that you really need to use Unicode or you will have a real code bloat. Unicode is the only practical solution for multi-lingual systems. Carl -Original Message- From: Marco Cimarosti [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 19, 2001 2:36 AM To: Unicode List Cc: 'Carl W. Brown'; 'Kenneth Whistler' Subject: RE: Latin w/ diacritics (was Re: benefits of unicode) Carl Brown wrote: If these folks really want Unicode everywhere I will write Unicode for the IBM 1401 if they are willing to foot the bill. Seriously I would never agree to such a ludicrous idea. Thanks, Carl, but if "these folks" is me, I don't even know what an IBM 1401 is, let alone needing you to write Unicode support for it. If I am allowed to introduce one more anachronism, there exist a concept called "portability". So, once one of these nutshell implementations of Unicode exists (on, say, a DOS box with a bitmapped font), it would not be necessary to re-write it from scratch for each next "end-of-lifed unsupported OS's" or embedded device. I hope this may cast a slightly different light on the effort-to-usefulness ratio of this. Can you imagine a Unicode 3.1 character properties table that uses 16bit addressing? I am not sure what you mean but, yes, I can imagine it very well. But it would be an unnecessary waste to load the whole databases in memory, although it would be possible: the vers. 3.1 character properties contains only about 13,000 lines. Multiply this by the 32-bits of a DOS "far pointer", and you obtain an array that still fits in a 64KB segment. OK: this array would crash as soon as 3,000 more characters are added to Unicode... But loading whole tables (or fonts) in memory is not really the way to go; you wouldn't do this even in much more powerful environments. It would be much better to keep the data on a file and access it through an efficient file indexing method and a well-tuned cache algorithm. Unicode take lots of memory. I promise that I won't use the word "myth" for at least a week. But my impression is that it is rather systems like OpenType and ATSUI that take lots of memory. And this is not a surprise nor a scandal: these systems are designed for OS's that require lots of memory for *everything*. But this should not draw us to the conclusion that Unicode itself is a memory-eating monster. It is just a character set! The memory and storage requirements of Unicode are not so terribly more complex than, say, older double byte systems. _ Marco
Re: Byte Order Marks
There is an RFC about UTF-16 that explains this: If the text is labeled by the protocol as charset=UTF-16 then the first two bytes are the byte order mark charset=UTF-16BE then it is big-endian and the first two bytes are just text charset=UTF-16LE then it is little-endian and the first two bytes are just text If you don't have any clue about the byte order, but you know it is UTF-16, then assume BE. Similar for UTF-32[BE/LE]. If you don't know anything about your text, then you may start some heuristics or reject the text... markus Tomas McGuinness wrote: A quick question relating to the Byte Order Mark of UCS-2. If its absent is it safe to assume any particular order (i.e. Big or Little Endian?).
Re: Latin w/ diacritics (was Re: benefits of unicode)
MC Well, I am not saying that it would be easy, or that it would be worth MC doing, but would it really take *millions* of dollars for implementing MC Unicode on DOS or Windows 3.1? MC BTW, I don't know in detail the current status of Unicode support MC on Linux, but I know that projects are ongoing. Okay, I'll byte, although I prefer to speak of ``free Unix-like systems'' rather than Linux only. The easiest way of browsing the Multilingual web on a 386 with 4 MB of memory and a 10 MB hard disk is probably to use the text-mode ``lynx'' browser in a terminal emulator that supports (a sufficiently large subset of) Unicode. One such terminal emulator is the Linux console, which only supports the very basics of Unicode. An alternative is the XFree86 version of XTerm, which also supports single combining characters and double-width glyphs. (Enough, for example, for Chinese or Thai, but not for Arabic.) In order to use that on a machine such as the one outlined above, you'll probably need to build a custom X server to save space, but it's definitely doable. (Drop me a note if you need a hand.) I know of the existence of fairly lightweight and fully internationalised graphic browsers for Unix-like systems (Konqueror comes to mind), but I doubt you'll get away with much less than a fast 486 with 12 MB memory and 100 MB of disk. Regards, Juliusz
RE: Byte Order Marks
If you don't have any clue about the byte order, but you know it is UTF-16, then assume BE. Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not UTF16_BigEndian? I know that was a difference between ICU and my library, and when I asked this question a while ago I was told that despite what some litterature suggests, w/o any clue, platform endianness should be used. That's contradictory. YA
Fwd: Re: Byte Order Marks
Date: Thu, 19 Apr 2001 12:59:43 -0700 To: Tomas McGuinness [EMAIL PROTECTED] From: Asmus Freytag [EMAIL PROTECTED] Subject: Re: Byte Order Marks At 02:58 PM 4/19/01 +0200, you wrote: If its absent is it safe to assume any particular order (i.e. Big or Little Endian?) The default order is Big endian, but I wouldn't call that a 'safe' assumption. In the most general case I would attempt an autorecognition in the unlabelled case. Where a particular protocol's specification reinforces that the default order SHALL apply for the unlabelled case, the assumption becomes that much stronger, of course. A./ PS: as an aside: the SCSU encoder can be used to do this form of autorecognition. If text shows much better compression in one byte order than the other, that byte order is overwhelmingly likely to be the true one. The exception would be strings of pure Han ideographs. For these it's necessary to
Unicode motivation/horror stories (was RE: benefits of unicode)
Date: Wed, 18 Apr 2001 13:23:40 -0700 (PDT) From: Kenneth Whistler [EMAIL PROTECTED] Subject: RE: benefits of unicode To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] X-Sun-Charset: US-ASCII I wonder if we could add a page in this vein to the Unicode site, or failing that, to Tex's Benefits pages? That is, invite people to say which problems brought them to Unicode, and how Unicode addresses those problems. If you like the idea, let us take the discussion back on the list. It might be kind of fun to have a section of individual stories, "How I ended up doing Unicode", on the website. I wouldn't be the one organizing it, but you could float the idea on the list to see if others would like to participate. Tex might actually be a good place to start, since he already is doing the benefits stuff for the Progress site. --Ken Ken told me offline that is was the lack of an IBM type ball with the schwa character that set him on this path. In my case, apart from a lifelong involvement in languages, math, and music, the proverbial last straw was that "Smart Quotes" in Pagemaker 3 wrecked my APL listings. It took me two months to discover the cause and turn them off permanently. I first learned about ISO 10646 as a direct result of work on the ISO/ANSI APL standard, and about Unicode from John Dvorak's column in PC Magazine. We know about Joe Becker's work at Xerox, and about Peter and Michael's work creating writing systems. I'm sure the rest of you have stories worth hearing. So, what do you think? Shall we? Where? -- Edward Cherlin Generalist "A knot!" exclaimed Alice. "Oh, do let me help to undo it." Alice in Wonderland
RE: benefits of unicode
From: David Starner [mailto:[EMAIL PROTECTED]] THEN WHY WASTE A WHOLE BIT ON UPPER CASE? THEY CERTAINLY ARE NOT NECCESSARY AND I HAVE FREQUENTLY SEEN PEOPLE NOT USE THEM WHEN AVAILABLE. Good point. We didn't need 'em to get "Huckleberry Finn", so how necessary can they be? /|/|IKE P.S. They are needed for capitalizing sentences, titles, and names, of course!
Re: benefits of unicode
On Thu, Apr 19, 2001 at 06:37:35PM -0500, Ayers, Mike wrote: P.S. They are needed for capitalizing sentences, titles, and names, of course! So? In your previous email, you said: The message carried by the most beautifully typeset works of the English language can be communicated effectively in ASCII Which, to the extent which this is true (show me how you plan to handle The Art of Computer Programming or the Dragon book, for example), is equally true of upper case. Capitalizing sentences is redundant with punctuation, and any additional information can be almost always be inferred from context (the best you can say for ASCII - on two different dingbats may a meaning that will be lost in ASCII, or two names seperated only by a accent.) In my book, adequate computing in a language means that the message gets across without causing pain to the reader. Most readers of English , I am willing to posit, are not aesthetically sensitive enough to be pained by poor typography I'm sure that most of the readers of Space:1889 would be pained by the lack of the pound sign or an asterix instead of a proper multiplication sign. I'm sure that few of the audience of the Anarchist Cookbook were pained by the all-caps in various sections of that document. [1] I judge consideration here by external parties. For instance, many symbols, such as copyright, trademark, section, etc. are not used in environments where they are available. This would imply that these symbols are not considered necessary by at least some of the folks who have access to them. They aren't available on the keyboard (no, alt-some obscure code doesn't count.) If I couldn't type lower-case on my keyboard with exceeding difficulty, I'd send out a lot of messages in all upper-case, or get another keyboard. Since no common US keyboard has more than the ASCII characters, well . . . I'm sure a lot of people using foriegn languages have sent out ASCII messages using transliteration that never would have printed a book in that transliteration. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org "I don't care if Bill personally has my name and reads my email and laughs at me. In fact, I'd be rather honored." - Joseph_Greg
Re: Byte Order Marks
Yves Arrouye wrote: If you don't have any clue about the byte order, but you know it is UTF-16, then assume BE. Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not UTF16_BigEndian? ICU does not do Unicode-signature or other encoding detection as part of a converter. When you get text from some protocol, you need to instantiate a converter according to what you know about the encoding. Note that guessing big-endian is only the last, desperate part of detecting the encoding. It is not the first choice. If the text is properly tagged (including maybe a signature), then you will never have to open a "UTF-16" converter. On the other hand, if you get a file from your platform and it is in 16-bit Unicode, then you would appreciate the convenience of the auto-endian alias. markus
one question
Well this is just a technical question, that I imagine that unicoder find a way of resolving. I am finishing a volume of a journal that I am editing, and one text has a summary in arabic - with Office2000 used on a Win98 Pan-European platform I can enter the summary letter for letter, but where is the right-to-left space? All the best, Emil Herak, Zagreb (Croatia)
Re: Byte Order Marks
On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote: On the other hand, if you get a file from your platform and it is in 16-bit Unicode, then you would appreciate the convenience of the auto-endian alias. But nothing should be spitting out platform-endian UTF-16! In the case that there's a lot of unmarked big-endian UTF-16 around (as I understand the ISO-10646 standard recommends), then that assumption that everything emits unmaked platform-dependent UTF-16 will be wrong. (It's never right to have a program emit platform-dependent-endian UTF-16 except in the case of system-local cache files. That breaks interoperating between your program on different systems.) -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org "I don't care if Bill personally has my name and reads my email and laughs at me. In fact, I'd be rather honored." - Joseph_Greg