RE: Giga Character Set: Nothing but noise

2000-10-19 Thread Marco . Cimarosti

Jon Babcock wrote:
 BTW, Marco, as near as I can recall, the above quotation in not from
 me.

Did it again! Shame on me! Sorry!

_ Marco



RE: Giga Character Set: Nothing but noise

2000-10-18 Thread Marco . Cimarosti

Jon Babcock wrote:
 It seems to me that if not for that, how could anyone
 make a Chinese font? Who is going to sit down and
 draw a *myriad* or more characters? Since elements
 recur, this reduces the amount of labour required
 greatly.

I too would have bet that all CJK foundries used some form of (automatic?)
composition to build their fonts.

But, after a few enquiries, it seem that they don't (or they do, but
zealously guard the secret).

_ Marco



RE: Giga Character Set: Nothing but noise

2000-10-18 Thread James E. Agenbroad

On Wed, 18 Oct 2000 [EMAIL PROTECTED] wrote:

 Jon Babcock wrote:
  It seems to me that if not for that, how could anyone
  make a Chinese font? Who is going to sit down and
  draw a *myriad* or more characters? Since elements
  recur, this reduces the amount of labour required
  greatly.
 
 I too would have bet that all CJK foundries used some form of (automatic?)
 composition to build their fonts.
 
 But, after a few enquiries, it seem that they don't (or they do, but
 zealously guard the secret).
 
 _ Marco
 
Wednesday, October 18, 2000
If I had to make a guess it would be that transforming the glyphs of parts
of characters so they will fit together in a pleasing fashion would take
about as much effort (or more) than designing separate glyphs for each new
character.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Giga Character Set: Nothing but noise

2000-10-15 Thread 11digitboy

It seems to me that if not for that, how could anyone
make a Chinese font? Who is going to sit down and
draw a *myriad* or more characters? Since elements
recur, this reduces the amount of labour required
greatly.
..
..

[OT] Are there any character-encoding schemes that
have CENTESIMAL DIGIT TEN, CENTESIMAL DIGIT ELEVEN,
... CENTESIMAL DIGIT NINETY-NINE? I had a clock with
SEXAGESIMAL DIGITs ZERO through FIFTY-NINE on one
wheel. Then I got sick of the noise it made sometimes
and ripped the digits out.  It seems to me that
in vertical text, what would be better than

san
zen
go
hyaku
yon
juu
hachi
nin

or

3
5
4
8
nin

would be

35
48
nin

but this is not allowed, is it?




 Jon Babcock [EMAIL PROTECTED] wrote:
 
 "Carl W. Brown" [EMAIL PROTECTED] wrote:
 
  If you were to start all over again with no interest
 in
  compatibility with existing code pages, you could
 drop the preformed
  characters.
 
 Since I've commented about the possibility of using
 a set of less than
 2000 or so characters to represent all Chinese
 graphs more than once
 on this mailing list over the past few years, I'll
 be brief this time.
 
 Such a system was developed nearly fifty years
 ago by Peter
 A. Boodberg, at the Department of Oriental Languages
 at the University
 of California, Berkeley. His work was based directly
 on a study of
 Chinese sources, especially the Shuowenjiezi Dictionary.
  I was
 fortunate to be able to study under Professor Boodberg
 during his last
 couple years at Berkeley, shortly before his death
 in 1972. I've
 rewritten some of his ideas and placed them on
 my web site (kanji.com)
 under the name of CHA (Chinese Hemigram Annotation).
  And because it
 is difficult to find his original writings on this
 subject, I intend
 to host a few of Boodberg's key 'cedules' soon.
 
 When I first heard about Unicode (probably in late
 1991), I naively
 assumed that it would employ some version of the
 Boodberg approach,
 i.e., the use of a 'small' subset of Chinese from
 which the entirety
 is composed. But, as has been stated many times
 on this list, the
 preferred approach was to base the Unicode Han
 repertoire on lists of
 precomposed hanzi/hanja/kanji that were actually
 in use in computers
 and, for the most part, were sanctioned by national
 governments. This
 was natural given the fact that the details (and
 here the details mean
 everything) of a system such as the one Dr. Boodberg
 envisioned were
 probably not available to the Unicode people, not
 were they in use by
 any national, commercial, or even academic body.
 In other words, it
 would have meant that such an approach would have
 had to have been
 developed by what came to be known as the Unicode
 Consortium itself.
 
 Although difficult, I believe that within the decade,
 the composition
 of the Chinese script will be recognized and well-understood,
 and the
 option to treat each of the tens of thousands of
 Chinese graphs,
 including new ones but excluding of course the
 300 or so unsegmentable
 wen, as a digraph that can be decomposed into hemigrams
 will be made
 available, perhaps even in Unicode.
 
 In the meantime, vis-a-vis Unicode and the Han
 repertoire, it's a case
 of 'get over it'. I had to.
 
 Jon
 
 -- 
 Jon Babcock [EMAIL PROTECTED]
 

___
Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at http://www.bolt.com




RE: Giga Character Set: Nothing but noise

2000-10-15 Thread Michael Everson

Ar 18:30 -0800 2000-10-14, scríobh Doug Ewell:

Yes, but 1500 times faster?  I don't know if 11-Digit Boy was right
about using Intercal, but their Unicode implementation must have been
really slow.

Speed is an issue, it seems. The two third-party Mac demos that use the
Unicode keyboards under Mac OS 9 are very slow indeed.

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





RE: Giga Character Set: Nothing but noise

2000-10-15 Thread Carl W. Brown

Michael,

Windows NT has had this problem.  However all Unicode applications can run
at close to the same speed.  It took years to get there and the battle will
not be one until Me is replaced so that there is an all Unicode platform and
a new crop of applications is written as pure Unicode applications.

It is largely a chicken and egg issue.

Carl

-Original Message-
From: Michael Everson [mailto:[EMAIL PROTECTED]]
Sent: Sunday, October 15, 2000 5:22 AM
To: Unicode List
Subject: RE: "Giga Character Set": Nothing but noise


Ar 18:30 -0800 2000-10-14, scríobh Doug Ewell:

Yes, but 1500 times faster?  I don't know if 11-Digit Boy was right
about using Intercal, but their Unicode implementation must have been
really slow.

Speed is an issue, it seems. The two third-party Mac demos that use the
Unicode keyboards under Mac OS 9 are very slow indeed.

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





RE: Giga Character Set: Nothing but noise

2000-10-15 Thread Doug Ewell

Michael Everson [EMAIL PROTECTED] wrote:

 Speed is an issue, it seems. The two third-party Mac demos that use
 the Unicode keyboards under Mac OS 9 are very slow indeed.

and "Carl W. Brown" [EMAIL PROTECTED] responded:

 Windows NT has had this problem.  However all Unicode applications
 can run at close to the same speed.  It took years to get there and
 the battle will not be one until Me is replaced so that there is an
 all Unicode platform and a new crop of applications is written as
 pure Unicode applications.

But none of this proves that Unicode is inherently slower or less
efficient than any competing character encoding, only that an optimized
solution is better than an unoptimized, hybrid one.

-Doug Ewell
 Fullerton, California



RE: Giga Character Set: Nothing but noise

2000-10-14 Thread Doug Ewell

"Carl W. Brown" [EMAIL PROTECTED] wrote:

 The problem with languages like Korean is that they are carrying a
 lot of history.  Today with the newer font technology there is no
 reason to have preformed characters.  If you were to start all over
 again with no interest in compatibility with existing code pages, you
 could drop the preformed characters.

Yes, I agree that it is more sensible (at least for some purposes) to
use jamos for Hangul rather than allocating 11,000 code points for
precomposed characters.  Of course, we all know that compatibility with
existing code pages was a deliberate design decision, without which
Unicode would have been much less likely to succeed.

 This may be what they are talking about being more efficient.

Yes, but 1500 times faster?  I don't know if 11-Digit Boy was right
about using Intercal, but their Unicode implementation must have been
really slow.

 You can come close to selecting han based on radicals.  They probably
 have a way to select among  duplicate matches.  Then you could cut
 the character set down the bopomofo or even the Latin pinyin.

I don't know enough about Chinese input methods to comment on the rest
of this, but if Unicode had implemented anything that merely "came
close," you would never hear the end of how inadequate it was.

-Doug Ewell
 Fullerton, California



Re: Giga Character Set: Nothing but noise

2000-10-14 Thread Jon Babcock


"Carl W. Brown" [EMAIL PROTECTED] wrote:

 If you were to start all over again with no interest in
 compatibility with existing code pages, you could drop the preformed
 characters.

Since I've commented about the possibility of using a set of less than
2000 or so characters to represent all Chinese graphs more than once
on this mailing list over the past few years, I'll be brief this time.

Such a system was developed nearly fifty years ago by Peter
A. Boodberg, at the Department of Oriental Languages at the University
of California, Berkeley. His work was based directly on a study of
Chinese sources, especially the Shuowenjiezi Dictionary.  I was
fortunate to be able to study under Professor Boodberg during his last
couple years at Berkeley, shortly before his death in 1972. I've
rewritten some of his ideas and placed them on my web site (kanji.com)
under the name of CHA (Chinese Hemigram Annotation).  And because it
is difficult to find his original writings on this subject, I intend
to host a few of Boodberg's key 'cedules' soon.

When I first heard about Unicode (probably in late 1991), I naively
assumed that it would employ some version of the Boodberg approach,
i.e., the use of a 'small' subset of Chinese from which the entirety
is composed. But, as has been stated many times on this list, the
preferred approach was to base the Unicode Han repertoire on lists of
precomposed hanzi/hanja/kanji that were actually in use in computers
and, for the most part, were sanctioned by national governments. This
was natural given the fact that the details (and here the details mean
everything) of a system such as the one Dr. Boodberg envisioned were
probably not available to the Unicode people, not were they in use by
any national, commercial, or even academic body. In other words, it
would have meant that such an approach would have had to have been
developed by what came to be known as the Unicode Consortium itself.

Although difficult, I believe that within the decade, the composition
of the Chinese script will be recognized and well-understood, and the
option to treat each of the tens of thousands of Chinese graphs,
including new ones but excluding of course the 300 or so unsegmentable
wen, as a digraph that can be decomposed into hemigrams will be made
available, perhaps even in Unicode.

In the meantime, vis-a-vis Unicode and the Han repertoire, it's a case
of 'get over it'. I had to.

Jon

-- 
Jon Babcock [EMAIL PROTECTED]



Re: Giga Character Set: Nothing but noise

2000-10-14 Thread Jon Babcock


"Carl W. Brown" [EMAIL PROTECTED] wrote:

 If you were to start all over again with no interest in
 compatibility with existing code pages, you could drop the preformed
 characters.

Since I've commented about the possibility of using a set of less than
2000 or so characters to represent all Chinese graphs more than once
on this mailing list over the past few years, I'll be brief this time.

Such a system was developed nearly fifty years ago by Peter
A. Boodberg, at the Department of Oriental Languages at the University
of California, Berkeley. His work was based directly on a study of
Chinese sources, especially the Shuowenjiezi Dictionary.  I was
fortunate to be able to study under Professor Boodberg during his last
couple years at Berkeley, shortly before his death in 1972. I've
rewritten some of his ideas and placed them on my web site (kanji.com)
under the name of CHA (Chinese Hemigram Annotation).  And because it
is difficult to find his original writings on this subject, I intend
to host a few of Boodberg's key 'cedules' soon.

When I first heard about Unicode (probably in late 1991), I naively
assumed that it would employ some version of the Boodberg approach,
i.e., the use of a 'small' subset of Chinese from which the entirety
is composed. But, as has been stated many times on this list, the
preferred approach was to base the Unicode Han repertoire on lists of
precomposed hanzi/hanja/kanji that were actually in use in computers
and, for the most part, were sanctioned by national governments. This
was natural given the fact that the details (and here the details mean
everything) of a system such as the one Dr. Boodberg envisioned were
probably not available to the Unicode people, not were they in use by
any national, commercial, or even academic body. In other words, it
would have meant that such an approach would have had to have been
developed by what came to be known as the Unicode Consortium itself.

Although difficult, I believe that within the decade, the composition
of the Chinese script will be recognized and well-understood, and the
option to treat each of the tens of thousands of Chinese graphs,
including new ones but excluding of course the 300 or so unsegmentable
wen, as a digraph that can be decomposed into hemigrams will be made
available, perhaps even in Unicode.

In the meantime, vis-a-vis Unicode and the Han repertoire, it's a case
of 'get over it'. I had to.

Jon

-- 
Jon Babcock [EMAIL PROTECTED]



Re: Giga Character Set: Nothing but noise

2000-10-14 Thread Jon Babcock

I see I was *doubly* "brief". Sorry for the duplicate message. Jon
-- 
Jon Babcock [EMAIL PROTECTED]



Re: Giga Character Set: Nothing but noise

2000-10-13 Thread Doug Ewell

John Jenkins [EMAIL PROTECTED] wrote:

 Have we figured out yet what part of "Hamlet" the Giga people claim
 cannot be encoded in Unicode?

 I had to do some head scratching on that one.  I finally figured out
 that it was meant rhetorically.  Would the inability to encode Hamlet
 be acceptable?  No.  So why foist on the world a character set (viz.,
 Unicode) that can't handle Chinese properly?  Isn't Chinese as
 important as English?

 At least, I think that's what they meant.

Yes, I finally figured that out after reading the white paper and doing
a general Web search on Coventive and their "Giga Character Set."  As
Ken pointed out, they are based in Taiwan and have the usual focus on
"efficient" CJK encoding and language-specific Han glyphs, along with a
deep conviction that Western-based organizations couldn't possibly get
this stuff right if they tried.

In the white paper, they tip their hand by continually referring to
"display codes" as if displaying glyphs were the only thing character
codes were used for.  (What about input, storage, comparison,
collation, etc.?)

There are several misstatements about Unicode, ranging from merely
ignorant to -- David Starner had the right word for it -- outright
slanderous.  First, of course, is the premise that "16-bit" Unicode has
room for only 65,536 characters.  Most of the perceived shortcomings of
Unicode are based on this falsehood and can be quickly dismissed.
There is also a statement that contiguous ranges of Unicode code units
are assigned to languages, when in fact Unicode maintains a studied
ignorance of language and doesn't even require all characters in the
same script to be encoded in the same block.

Of course, there is the usual claim that "Unicode can not easily
include the new characters that continue to be formed."  Try telling
someone who was in Boston or Athens recently that Unicode's rigid
structure doesn't permit the addition of new characters!  Then, another
news flash: Unicode doesn't provide for the reality that "the
directionality of written language can vary."  So I guess that means
the Bidirectional Algorithm, the Bidirectional Category field in
UnicodeData.txt, the directional override codes, etc. don't actually
exist.

You gotta love the separate, proprietary, *patented* algorithms that
are created to handle each specific language's "peculiarities."  Note
how English, French, Spanish, German, Italian, and Portuguese -- all
at least 98% covered by Latin-1 -- each have their own GCS encodings.
When do you suppose we will see the Basque, Sami, Azeri, Yi, Thaana,
etc. algorithms?  When Coventive unilaterally decides to support them?
(Ah, but they have thrown in Klingon, just to prove it can be done.)

And, of course, Coventive claims to have improved display performance
dramatically -- 1500x for Korean! -- by composing glyphs dynamically
from component pieces rather than referencing a precomposed glyph from
a "behemoth look-up table."  (Do they think some kind of search must
take place to locate the glyph for code point U+mumble?)  Conveniently
ignored is the fact that not all CJK characters are decomposable in
this way, the severe performance hit imposed on searching and sorting,
and the fact that an approach like this would only work for CJK in any
event.

An article in the October 12, 2000 issue of Linux Weekly News
http://lwn.net/bigpage.php3 tries to explain the benefit:  "Many
Asian characters are composites, made up of one or more simpler
characters.  Unicode simply makes a big catalog of characters, without
recognizing their internal structure; GCS apparently handles things in
a more natural manner."  However, the article does not go on to specify
just what is better, more efficient, or more "natural" about the GCS
approach.

(BTW, an article in the online Taipei Times mentioned that GCS assigns
4 bytes for each code point.  So who's inefficient now?)

I am sorely tempted to point out that their criticism of CJK glyph
unification in Unicode could be addressed by judicious use of Plane 14
tags, but no matter; Giga is DOA.  It is false economy, it attempts to
solve perceived CJK problems by introducing bogus distinctions, it
considers only one aspect of character code processing (display) while
ignoring all others, and it is the patented, proprietary work of one
company.  We will never have to worry about Giga, and in a year or so
we will forget it ever existed.

-Doug Ewell
 Fullerton, California