Re: benefits of unicode

2001-04-19 Thread David Starner

On Wed, Apr 18, 2001 at 02:09:30PM -0500, Ayers, Mike wrote:
 that the extra symbols can make the read a little easier, but they are not
 considered[1] necessary.  We were discussing adequcy, not excellence, and to
 me the two are quite distinct.

THEN WHY WASTE A WHOLE BIT ON UPPER CASE? THEY CERTAINLY ARE NOT 
NECCESSARY AND I HAVE FREQUENTLY SEEN PEOPLE NOT USE THEM WHEN 
AVAILABLE.

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg




Re: Latin w/ diacritics (was Re: benefits of unicode)

2001-04-19 Thread Michael \(michka\) Kaplan

From: "Jungshik Shin" [EMAIL PROTECTED]

  As long as specific markets remain resistant to the idea of this work
being
  done, this is no mere myth -- it is a reality.

   As a general statement, I might agree to the above. However, I'm a bit
 confused as to what you're specifically talking about here (that is,
 what you meant by 'this work' and 'specific markets').  I guess  I'm
 supposed to read between lines, but I'm rather slow here. Could you
 elaborate a bit?

I know that there has been resistance for CHT, CHS, JPN, and KOR solutions
that involved anything that would de-emphasize the existing system of
specific ideographs for specific code points and the support for 100% round
tripping of data to and from Unicode. Because of this, any attempt to
"synthesize" characters, whether from strokes, vowels, consonants, or pieces
of chewing gum, has met with resistance.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: Latin w/ diacritics (was Re: benefits of unicode)

2001-04-19 Thread Jungshik Shin




On Wed, 18 Apr 2001, Michael (michka) Kaplan wrote:

 From: "Jungshik Shin" [EMAIL PROTECTED]

   As long as specific markets remain resistant to the idea of this work
 being
   done, this is no mere myth -- it is a reality.
 
As a general statement, I might agree to the above. However, I'm a bit
  confused as to what you're specifically talking about here (that is,
  what you meant by 'this work' and 'specific markets').  I guess  I'm
  supposed to read between lines, but I'm rather slow here. Could you
  elaborate a bit?

 I know that there has been resistance for CHT, CHS, JPN, and KOR solutions
 that involved anything that would de-emphasize the existing system of
 specific ideographs for specific code points and the support for 100% round
 tripping of data to and from Unicode. Because of this, any attempt to
 "synthesize" characters, whether from strokes, vowels, consonants, or pieces
 of chewing gum, has met with resistance.

 How on earth can 'ideographs' be synthesized from consonants and
vowels?  Moreover, when I wrote that 'CJK don't always go together', I
wasn't talking about Chinese characters(ideographs) at all. I was talking
about Korean Hangul only (I think it was pretty clear in the part of
my message you didn't quote where I talked about Thai/Indic scripts
and Hangul) Also, I have no clue why potentially drastic reduction
(in principle/theory) of  the font size for Korean by  dynamic glyph
shaping  has anything to do with the round-trip of existing data to and
from Unicode.

  Jungshik Shin





Re: Latin w/ diacritics (was Re: benefits of unicode)

2001-04-19 Thread Michael \(michka\) Kaplan

  How on earth can 'ideographs' be synthesized from consonants and
 vowels?  Moreover, when I wrote that 'CJK don't always go together', I
 wasn't talking about Chinese characters(ideographs) at all. I was talking
 about Korean Hangul only (I think it was pretty clear in the part of
 my message you didn't quote where I talked about Thai/Indic scripts
 and Hangul)

I think you kind of missed my point: although you are dealing with three
different scripts that have three different sets of issues, there are some
similarities at a high level. The thing that is similar here is that in each
case there are champions of the current system.

Although it may be useful to talk about font technologies that allow for
much smaller font sizes, I doubt that anyone believes that the 12.8 mb for
the Guilm ttc file (containing Gulim/GulimChe/Dotum/DotumChe) is made up
only of Hangul -- as opposed to Hanja. Heck, I doubt you could claim its
even mainly made up of Hangul. The fact is that there are folks who are
opposed to this type of issue and are very sensitive about attempts to
change things. Though of course if a font used such a method internally and
no one ever really knew, then I suppose no one would be unhappy, right?

A similar issue exists for Chinese where a different proposal often surfaces
to try to synthesize characters from the various strokes and radicals. This
also is met with opposition, and sometimes the arguments against such ideas
have no more merit than any other such case.

I guess I was trying to stress that this is no mere "myth to be dispelled in
the i18n community" but is a real issue in the minds of some (many?)
customers.

 Also, I have no clue why potentially drastic reduction
 (in principle/theory) of  the font size for Korean by  dynamic glyph
 shaping  has anything to do with the round-trip of existing data to and
 from Unicode.

I think I kind of covered this above... if no one knows thats what is
happening in the font, then who will be the wiser? In fact I would hazard a
guess that there are indeed fonts out there today that do this. It does not
(of course( change the fact that some people are opposed to the idea, just
as there are some who are opposed to such "solutions" to large Chinese
fonts, etc.

michka






RE: Latin w/ diacritics (was Re: benefits of unicode)

2001-04-19 Thread Marco Cimarosti

Carl Brown wrote:
 If these folks really want Unicode everywhere I will write 
 Unicode for the IBM 1401 if they are willing to foot the
 bill.  Seriously I would never agree to such a ludicrous
 idea.

Thanks, Carl, but if "these folks" is me, I don't even know what an IBM 1401
is, let alone needing you to write Unicode support for it.

If I am allowed to introduce one more anachronism, there exist a concept
called "portability". So, once one of these nutshell implementations of
Unicode exists (on, say, a DOS box with a bitmapped font), it would not be
necessary to re-write it from scratch for each next "end-of-lifed
unsupported OS's" or embedded device.

I hope this may cast a slightly different light on the effort-to-usefulness
ratio of this.

 Can you imagine a Unicode 3.1 character properties table that 
 uses 16bit addressing?

I am not sure what you mean but, yes, I can imagine it very well.

But it would be an unnecessary waste to load the whole databases in memory,
although it would be possible: the vers. 3.1 character properties contains
only about 13,000 lines. Multiply this by the 32-bits of a DOS "far
pointer", and you obtain an array that still fits in a 64KB segment. OK:
this array would crash as soon as 3,000 more characters are added to
Unicode...

But loading whole tables (or fonts) in memory is not really the way to go;
you wouldn't do this even in much more powerful environments. It would be
much better to keep the data on a file and access it through an efficient
file indexing method and a well-tuned cache algorithm.

 Unicode take lots of memory.

I promise that I won't use the word "myth" for at least a week.

But my impression is that it is rather systems like OpenType and ATSUI that
take lots of memory. And this is not a surprise nor a scandal: these systems
are designed for OS's that require lots of memory for *everything*.

But this should not draw us to the conclusion that Unicode itself is a
memory-eating monster. It is just a character set! The memory and storage
requirements of Unicode are not so terribly more complex than, say, older
double byte systems.

_ Marco




Byte Order Marks

2001-04-19 Thread Tomas McGuinness

Hi,

A quick question relating to the Byte Order Mark of UCS-2. If its absent is
it safe to assume any particular order (i.e. Big or Little Endian?).

I am writing a function to rearrange from Big to little endian but without a
byte order mark I'm not sure what the order is. Is there any
specification I could refer to?

Thanks.

Tom

Tomas McGuinness   Consultant
 --
 
 University Technology Park*   +353 21 4933 277 
  Curraheen Rd, Cork  *+353 21 4933 201
 * [EMAIL PROTECTED]
 --
 
 CMG   Telecom Products Division
   Product Development, Cork 
 --
 
 
 
 




OT Porting to older OSes was RE: Latin w/ diacritics (was Re: benefits of unicode)

2001-04-19 Thread Carl W. Brown

Marco,

I still remember the Univac I which had memory tubes about the size of your
fist (The Univac II used core).  The 1401 however, was a fully
transistorized computer.  It used core memory which ranged in size from 1400
to 16,000 6 bit bytes.  (Unicode on 6 bit machines is another challenge).

You are right about font files being big.  However, there is no Unicode font
so you have the same large font files even without Unicode.  Large font
files is why some printers have there own disk drives.

Part of the reason that Unicode implementations are so large is that we need
translation tables to maintain compatibility with old code pages.  Eliminate
these code pages and we reduce the size of the Unicode implementation.  At
least Windows is going in the right direction.  All future scripts will be
Unicode only.  This way they don't have to carry the other baggage.

People may talk about line breaking, collation, fonts etc. being resource
hogs.  In actuality you need the same resources for code page systems as
well.  With Unicode however you get to reuse some of these resources if you
support multiple scripts.

The limit for systems like Windows were systems like the Arabic/French
systems.  Beyond that you really need to use Unicode or you will have a real
code bloat.  Unicode is the only practical solution for multi-lingual
systems.

Carl



-Original Message-
From: Marco Cimarosti [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 19, 2001 2:36 AM
To: Unicode List
Cc: 'Carl W. Brown'; 'Kenneth Whistler'
Subject: RE: Latin w/ diacritics (was Re: benefits of unicode)


Carl Brown wrote:
 If these folks really want Unicode everywhere I will write
 Unicode for the IBM 1401 if they are willing to foot the
 bill.  Seriously I would never agree to such a ludicrous
 idea.

Thanks, Carl, but if "these folks" is me, I don't even know what an IBM 1401
is, let alone needing you to write Unicode support for it.

If I am allowed to introduce one more anachronism, there exist a concept
called "portability". So, once one of these nutshell implementations of
Unicode exists (on, say, a DOS box with a bitmapped font), it would not be
necessary to re-write it from scratch for each next "end-of-lifed
unsupported OS's" or embedded device.

I hope this may cast a slightly different light on the effort-to-usefulness
ratio of this.

 Can you imagine a Unicode 3.1 character properties table that
 uses 16bit addressing?

I am not sure what you mean but, yes, I can imagine it very well.

But it would be an unnecessary waste to load the whole databases in memory,
although it would be possible: the vers. 3.1 character properties contains
only about 13,000 lines. Multiply this by the 32-bits of a DOS "far
pointer", and you obtain an array that still fits in a 64KB segment. OK:
this array would crash as soon as 3,000 more characters are added to
Unicode...

But loading whole tables (or fonts) in memory is not really the way to go;
you wouldn't do this even in much more powerful environments. It would be
much better to keep the data on a file and access it through an efficient
file indexing method and a well-tuned cache algorithm.

 Unicode take lots of memory.

I promise that I won't use the word "myth" for at least a week.

But my impression is that it is rather systems like OpenType and ATSUI that
take lots of memory. And this is not a surprise nor a scandal: these systems
are designed for OS's that require lots of memory for *everything*.

But this should not draw us to the conclusion that Unicode itself is a
memory-eating monster. It is just a character set! The memory and storage
requirements of Unicode are not so terribly more complex than, say, older
double byte systems.

_ Marco





Re: Byte Order Marks

2001-04-19 Thread Markus Scherer

There is an RFC about UTF-16 that explains this:

If the text is labeled by the protocol as
charset=UTF-16 then the first two bytes are the byte order mark
charset=UTF-16BE then it is big-endian and the first two bytes are just text
charset=UTF-16LE then it is little-endian and the first two bytes are just text

If you don't have any clue about the byte order, but you know it is UTF-16, then 
assume BE.

Similar for UTF-32[BE/LE].

If you don't know anything about your text, then you may start some heuristics or 
reject the text...

markus

Tomas McGuinness wrote:
 A quick question relating to the Byte Order Mark of UCS-2. If its absent is
 it safe to assume any particular order (i.e. Big or Little Endian?).




Re: Latin w/ diacritics (was Re: benefits of unicode)

2001-04-19 Thread Juliusz Chroboczek

MC Well, I am not saying that it would be easy, or that it would be worth
MC doing, but would it really take *millions* of dollars for implementing
MC Unicode on DOS or Windows 3.1?

MC BTW, I don't know in detail the current status of Unicode support
MC on Linux, but I know that projects are ongoing.

Okay, I'll byte, although I prefer to speak of ``free Unix-like
systems'' rather than Linux only.

The easiest way of browsing the Multilingual web on a 386 with 4 MB of
memory and a 10 MB hard disk is probably to use the text-mode ``lynx''
browser in a terminal emulator that supports (a sufficiently large
subset of) Unicode.

One such terminal emulator is the Linux console, which only supports
the very basics of Unicode.  An alternative is the XFree86 version of
XTerm, which also supports single combining characters and
double-width glyphs.  (Enough, for example, for Chinese or Thai, but
not for Arabic.)  In order to use that on a machine such as the one
outlined above, you'll probably need to build a custom X server to
save space, but it's definitely doable.  (Drop me a note if you need a
hand.)

I know of the existence of fairly lightweight and fully
internationalised graphic browsers for Unix-like systems (Konqueror
comes to mind), but I doubt you'll get away with much less than a fast
486 with 12 MB memory and 100 MB of disk.

Regards,

Juliusz




RE: Byte Order Marks

2001-04-19 Thread Yves Arrouye

 If you don't have any clue about the byte order, but you know it is
UTF-16, then assume BE.

Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not
UTF16_BigEndian? I know that was a difference between ICU and my library,
and when I asked this question a while ago I was told that despite what some
litterature suggests, w/o any clue, platform endianness should be used.
That's contradictory.

YA




Fwd: Re: Byte Order Marks

2001-04-19 Thread Asmus Freytag


Date: Thu, 19 Apr 2001 12:59:43 -0700
To: Tomas McGuinness [EMAIL PROTECTED]
From: Asmus Freytag [EMAIL PROTECTED]
Subject: Re: Byte Order Marks

At 02:58 PM 4/19/01 +0200, you wrote:
If its absent is it safe to assume any particular order (i.e. Big or 
Little Endian?)


The default order is Big endian, but I wouldn't call that a 'safe' 
assumption. In the most general case I would attempt an autorecognition in 
the unlabelled case. Where a particular protocol's specification reinforces 
that the default order SHALL apply for the unlabelled case, the assumption 
becomes that much stronger, of course.

A./

PS: as an aside: the SCSU encoder can be used to do this form of 
autorecognition. If text shows much better compression in one byte order 
than the other, that byte order is overwhelmingly likely to be the true 
one. The exception would be strings of pure Han ideographs. For these it's 
necessary to





Unicode motivation/horror stories (was RE: benefits of unicode)

2001-04-19 Thread Edward Cherlin

Date: Wed, 18 Apr 2001 13:23:40 -0700 (PDT)
From: Kenneth Whistler [EMAIL PROTECTED]
Subject: RE: benefits of unicode
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
X-Sun-Charset: US-ASCII


  I wonder if we could add a page in this vein to the Unicode site, or
  failing that, to Tex's Benefits pages? That is, invite people to say
  which problems brought them to Unicode, and how Unicode addresses
  those problems. If you like the idea, let us take the discussion back
  on the list.


It might be kind of fun to have a section of individual
stories, "How I ended up doing Unicode", on the website.
I wouldn't be the one organizing it, but you could float the
idea on the list to see if others would like to participate.
Tex might actually be a good place to start, since he already
is doing the benefits stuff for the Progress site.

--Ken

Ken told me offline that is was the lack of an IBM type ball with the 
schwa character that set him on this path. In my case, apart from a 
lifelong involvement in languages, math, and music, the proverbial 
last straw was that "Smart Quotes" in Pagemaker 3 wrecked my APL 
listings. It took me two months to discover the cause and turn them 
off permanently.

I first learned about ISO 10646 as a direct result of work on the 
ISO/ANSI APL standard, and about Unicode from John Dvorak's column in 
PC Magazine.

We know about Joe Becker's work at Xerox, and about Peter and 
Michael's work creating writing systems. I'm sure the rest of you 
have stories worth hearing.

So, what do you think? Shall we? Where?
-- 

Edward Cherlin
Generalist
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland




RE: benefits of unicode

2001-04-19 Thread Ayers, Mike


 From: David Starner [mailto:[EMAIL PROTECTED]]
 
 THEN WHY WASTE A WHOLE BIT ON UPPER CASE? THEY CERTAINLY ARE NOT 
 NECCESSARY AND I HAVE FREQUENTLY SEEN PEOPLE NOT USE THEM WHEN 
 AVAILABLE.
 

Good point.  We didn't need 'em to get "Huckleberry Finn", so how
necessary can they be?


/|/|IKE

P.S.  They are needed for capitalizing sentences, titles, and names, of
course!




Re: benefits of unicode

2001-04-19 Thread David Starner

On Thu, Apr 19, 2001 at 06:37:35PM -0500, Ayers, Mike wrote:
 P.S.  They are needed for capitalizing sentences, titles, and names, of
 course!

So? In your previous email, you said:

 The message carried by the most beautifully typeset works of the 
 English language can be communicated effectively in ASCII  

Which, to the extent which this is true (show me how you plan to
handle The Art of Computer Programming or the Dragon book, for
example), is equally true of upper case. Capitalizing sentences is
redundant with punctuation, and any additional information can be
almost always be inferred from context (the best you can say for
ASCII - on two different dingbats may a meaning that will be
lost in ASCII, or two names seperated only by a accent.)

 In my book, adequate computing in a language means that the message
 gets across without causing pain to the reader.  Most readers of
 English , I am willing to posit, are not aesthetically sensitive
 enough to be pained by poor typography

I'm sure that most of the readers of Space:1889 would be pained by
the lack of the pound sign or an asterix instead of a proper
multiplication sign. I'm sure that few of the audience of the
Anarchist Cookbook were pained by the all-caps in various sections
of that document. 

 [1] I judge consideration here by external parties.  For instance,
 many symbols, such as copyright, trademark, section, etc. are not
 used in environments where they are available.  This would imply
 that these symbols are not considered necessary by at least some of
 the folks who have access to them.

They aren't available on the keyboard (no, alt-some obscure code
doesn't count.) If I couldn't type lower-case on my keyboard with
exceeding difficulty, I'd send out a lot of messages in all
upper-case, or get another keyboard. Since no common US keyboard has
more than the ASCII characters, well . . . I'm sure a lot of people
using foriegn languages have sent out ASCII messages using
transliteration that never would have printed a book in that
transliteration.

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg




Re: Byte Order Marks

2001-04-19 Thread Markus Scherer

Yves Arrouye wrote:
  If you don't have any clue about the byte order, but you know it is
 UTF-16, then assume BE.

 Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not
 UTF16_BigEndian?

ICU does not do Unicode-signature or other encoding detection as part of a converter. 
When you get text from some protocol, you need to instantiate a converter according to 
what you know about the encoding.

Note that guessing big-endian is only the last, desperate part of detecting the 
encoding. It is not the first choice. If the text is properly tagged (including maybe 
a signature), then you will never have to open a "UTF-16" converter.

On the other hand, if you get a file from your platform and it is in 16-bit Unicode, 
then you would appreciate the convenience of the auto-endian alias.

markus




one question

2001-04-19 Thread Emil Herak

Well this is just a technical question, that I imagine that unicoder find a
way of resolving. I am finishing a volume of a journal that I am editing,
and one text has a summary in arabic - with Office2000 used on a Win98
Pan-European platform I can enter the summary letter for letter, but where
is the right-to-left space?

All the best,

Emil Herak,
Zagreb (Croatia)





Re: Byte Order Marks

2001-04-19 Thread David Starner

On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote:
 On the other hand, if you get a file from your platform and it is in 16-bit Unicode, 
then you would appreciate the convenience of the auto-endian alias.

But nothing should be spitting out platform-endian UTF-16! In the
case that there's a lot of unmarked big-endian UTF-16 around (as I
understand the ISO-10646 standard recommends), then that assumption
that everything emits unmaked platform-dependent UTF-16 will be
wrong. (It's never right to have a program emit
platform-dependent-endian UTF-16 except in the case of system-local
cache files. That breaks interoperating between your program on
different systems.)

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg