Java and Unicode

2000-11-14 Thread Jani Kajala

As Unicode will soon contain characters defined beyond the code point range
[0,65535] I'm wondering how is Java going to handle this?

I didn't find any hints from JDK documentation either, at least a few days
ago when I browsed the Java documentation about internationalization I just
saw a comment that 'Unicode is a 16-bit encoding.' (two errors in one
sentence)


Regards,
Jani Kajala





A very basic question about Big5/x-Jis/ Unicode....

2000-11-14 Thread Maikki . Frisk

Hi
I have recently started to study Unicode and tried to understand what it is,
except that it is a system that supports double byte languages. When doing
this, I've bumped into Big5,Jis Shift, x-Jis. Are these synonyms for
different Chinese and Japanese character sets and for which? I'm specially
interested in the various Japanese systems. What are they? Which one should
I prefer in creating (multilingual) web sites? Is there sth special I'd need
to consider when using Japanese or does using Unicode 3.0 simply solve my
problems?

Thx
/maikki




Errors in Unihan?

2000-11-14 Thread Pierpaolo Bernardi


Hello,

In the Unihan.txt database, in the kMandarin field there are entries
with duplicate pronunciations. For example:

U+4E21  kMandarin   1 LIANG3 2 LIANG3 3 LIANG4
U+4E4E  kMandarin   1 HU1 HU2 2 HU1
U+4E86  kMandarin   1 LIAO3 2 LE LIAO3

Is there a reason for these duplicates? If this is the case, the
format of this field should be documented better in the header. If
these duplications are errors, I can supply a list of them.

Also, what's the meaning of the isolated numbers?



Other entries certainly contains errors, for example:

U+5594  kMandarin   1 WO1 2 01
^ this is zero.

U+4EC0  kMandarin   1 SHI2 2 SHEN2 3 SHI2 SHIU2SHEN2 SHI2
   ?? -- shi2 shen2 ??

Regards,
  Pierpaolo Bernardi



Re: OT: Devanagari question

2000-11-14 Thread David Starner

On Tue, Nov 14, 2000 at 08:22:21AM -0800, D.V. Henkel-Wallace wrote:
 Sadly, it seems unlikely that any furture change or adoption of orthography 
 will use characters not already supported by the then major computer 
 systems.  In fact the trend seems to be the other way, viz Spain's changing 
 of its collation rules.
 
 For a minority language (which all remaining unwritten languages are) the 
 pressure will be strong to use existing combinations (since they won't 
 constitute a large enough community for people to write special rendering 
 support).

I don't know about that. On one hand, you have Chimchim(sp?) whose current
alphabet uses g and x as special vowels, and Cherokee which is usually (often?)
written in an ASCII-compatible orthography using ? as a letter. But on the
other, Esperanto and Lakota both have introduced new letters without problem,
and Lakota still can't be written in Unicode*. And I don't see why adding
new letters would be a problem - when the Cherokee syllabary is used, it
appears to be used with one of two different 7-bit font-based encodings,
not Unicode. Even if new letters were done right with Unicode, there's lots 
of space in the Private Use areas.

* There was some discussion on this on the list in September, that ended with
someone finding 019E LATIN SMALL LETTER N WITH LONG RIGHT LEG. Unfortunetly,
there's no corresponding LATIN CAPITAL LETTER N WITH LONG RIGHT LEG, which
Lakota needs.

-- 
David Starner - [EMAIL PROTECTED]
http://dvdeug.dhis.org
As centuries of pulp novels and late-night Christian broadcasting have taught 
us, anything we don't understand can be used for the purposes of Evil.
-- Kenneth Hite, Suppressed Transmissions



Re: Errors in Unihan?

2000-11-14 Thread John Jenkins


On Tuesday, November 14, 2000, at 08:24 AM, Pierpaolo Bernardi wrote:

 In the Unihan.txt database, in the kMandarin field there are entries
 with duplicate pronunciations. For example:
 
 U+4E21kMandarin   1 LIANG3 2 LIANG3 3 LIANG4
 U+4E4EkMandarin   1 HU1 HU2 2 HU1
 U+4E86kMandarin   1 LIAO3 2 LE LIAO3
 
 Is there a reason for these duplicates? If this is the case, the
 format of this field should be documented better in the header. If
 these duplications are errors, I can supply a list of them.
 

That would be very helpful, yes.  

 Also, what's the meaning of the isolated numbers?
 

The value of the field was obtained from dictionaries.  When a dictionary provides 
more than one meaning, it is not infrequent that one pronunciation is specific to a 
particular meaning and another pronunciation specific to another.  This is where the 
numbers come from.

Inasmuch as the database doesn't maintain the link between specific definitions and 
pronunciations, the isolated numbers should also be removed.




Re: OT: Devanagari question

2000-11-14 Thread John Cowan

"D.V. Henkel-Wallace" wrote:
 
 For a minority language (which all remaining unwritten languages are) the
 pressure will be strong to use existing combinations (since they won't
 constitute a large enough community for people to write special rendering
 support).

OTOH minority languages have come to be written with novel scripts like
Pollard and UCAS.

-- 
There is / one art   || John Cowan [EMAIL PROTECTED]
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



Re: Devanagari question

2000-11-14 Thread Antoine Leca

Mark Davis wrote:
 
 The Unicode Standard does define the rendering of such combinations, which
 is in the absence of any other information to stack outwards.
 
 A dumb implementation would simply move
 the accent outwards if there was in the same position. This will not
 necessarily produce an optimal positioning, but should be readable.

Note that it also should increase the line spacing.
Note also that the renderer should notice that event, even in when there
is interleaved unrelevant (zero-width) characters.
And we are using a dumb implementation.

Anyway, my point was not about this, which are as you say, the basics of
the dumbest renderer.
No, I was thinking about the implications of mixing Nagari consonants
with kana diacritics (or the contrary); or circling (U+20DD) around
Indian conjuncts, or else around superscript digits; or the Tibetan
subjoined below Latin letters (how do they attach?); or Jamos followed
by a virama or a Telugu length mark. Etc.
My point was it is *not* a good idea to render an out-of-context 
Telugu length mark (U+0C55), when it follows for example a Latin vowel,
as a macron, even if this is the "logical" behaviour. Such code will be,
IMHO, just waste.


 If it take megabytes of code to do [that] there is probably something
 else wrong.

I do not count a dumb implementation as "decent".

And yes, I was overemphasing with "megabytes". The OT support in FreeType,
which does only a small part of this task, is only 315 Kbytes of C code.
So I expect the not-so-dumb renderer based on it, to be around 0.5 megabyte.
Which does not take in account the code embeeded in the OT fonts themselves.

As a result, yes, please remove the "s".

 
Antoine



Re: Java and Unicode

2000-11-14 Thread John O'Conner

You can currently store UTF-16 in the String and StringBuffer classes. However,
all operations are on char values or 16-bit code units. The upcoming release of
the J2SE platform will include support for Unicode 3.0 (maybe 3.0.1)
properties, case mapping, collation, and character break iteration. There is no
explicit support for surrogate pairs in Unicode at this time, although you can
certainly find out if a code unit is a surrogate unit.

In the future, as characters beyond 0x become more important, you can
expect that more robust, official support will ollow.

-- John O'Conner

Jani Kajala wrote:

 As Unicode will soon contain characters defined beyond the code point range
 [0,65535] I'm wondering how is Java going to handle this?

 I didn't find any hints from JDK documentation either, at least a few days
 ago when I browsed the Java documentation about internationalization I just
 saw a comment that 'Unicode is a 16-bit encoding.' (two errors in one
 sentence)

 Regards,
 Jani Kajala




Lakota (was Re: OT: Devanagari question)

2000-11-14 Thread Rick McGowan

[EMAIL PROTECTED] wrote:

 Unfortunately, there's no corresponding LATIN CAPITAL LETTER N WITH LONG
 RIGHT LEG, which Lakota needs.

To my knowledge, the discussion in September between John Cowan and Curtis Clark 
didn't terminate with any actual proposal, and I'm not clear on whether the above 
assertion is a fact.  I'm not saying I know anything about this field either.  Does 
Lakota REALLY need a letter that isn't in Unicode?

Are you in a position to provide documents and evidence, and/or make a definite 
proposal for adding this character?  It would be a good thing to add, if it's really 
needed.

Rick

 


RE: Devanagari question

2000-11-14 Thread Ayers, Mike


 From: D.V. Henkel-Wallace [mailto:[EMAIL PROTECTED]]


 At 06:30 2000-11-14 -0800, Marco Cimarosti wrote:

 But my point was: not even Mr. Ethnologue himself knows
 exactly *which*
 combinations are meaningful, in all orthographic system.
 And, clearly, no
 one can figure out which combinations may become meaningful
 in the *future*
 -- e.g. when a previously unwritten language gets its
 orthography, or when
 the spelling of an already written language gets changed.

 Sadly, it seems unlikely that any furture change or adoption
 of orthography
 will use characters not already supported by the then major computer
 systems.  In fact the trend seems to be the other way, viz
 Spain's changing
 of its collation rules.

I do not think that this is a trend.  The last I knew,
computer-savvy Taiwan and Hong Kong were continuing to invent new
characters.  In the end, the onus is on the computer to support the user.
Only during the current frenzy of computerization is the reverse permitted -
this will pass.

 For a minority language (which all remaining unwritten
 languages are) the
 pressure will be strong to use existing combinations (since
 they won't
 constitute a large enough community for people to write
 special rendering
 support).

That depends on how you look at it.  From what I understand (which I
freely admit I have learned only from this list), Indic languages tend to be
supported in toto, and therefore even the currently unwritten ones will
belong to a highly non-minority language family.


$.02,

/|/|ike



RE: Devanagari question

2000-11-14 Thread Rick McGowan

Mike Ayers wrote:

 The last I knew,
 computer-savvy Taiwan and Hong Kong were continuing to invent new
 characters.  In the end, the onus is on the computer to support the user.

Yes, the computer should support the user, but... The invention of new characters to 
serve multitudes is OK, and international standards will probably continue to support 
that.  But I don't think it's reasonable or appropriate to keep inventing new 
characters willy-nilly for individuals (as reported), and then expect them to be added 
to an international standard.  That's silly.  The onus is not on international 
standards to support the whimsical production of novel, rarely-used, or nonce 
characters of the type reported to be generated.

In any case, I still have never seen actual documentary evidence that would prove to 
me that in fact Taiwan and Hong Kong *ARE* creating new characters at the drop of a 
hat.  People just keep saying that to scare everyone.  Sounds like an urban myth to me.

Rick

 


RE: Devanagari question

2000-11-14 Thread Thomas Chan

On Tue, 14 Nov 2000, Rick McGowan wrote:

 Mike Ayers wrote:
  The last I knew,
  computer-savvy Taiwan and Hong Kong were continuing to invent new
  characters.  In the end, the onus is on the computer to support the user.
 
 Yes, the computer should support the user, but... The invention of new characters to 
serve multitudes is OK, and international standards will probably continue to support 
that.  But I don't think it's reasonable or appropriate to keep inventing new 
characters willy-nilly for individuals (as reported), and then expect them to be 
added to an international standard.  That's silly.  The onus is not on international 
standards to support the whimsical production of novel, rarely-used, or nonce 
characters of the type reported to be generated.
 In any case, I still have never seen actual documentary evidence that would prove to 
me that in fact Taiwan and Hong Kong *ARE* creating new characters at the drop of a 
hat.  People just keep saying that to scare everyone.  Sounds like an urban myth to 
me.

I think there is some confusion between "new characters" in the sense that
they were never available in any standard, but which are taken from
pre-existing print sources, and now people would like to properly add
them; versus "new characters" that were made up "yesterday" for frivolous
reasons.


Thomas Chan
[EMAIL PROTECTED]