Java and Unicode
As Unicode will soon contain characters defined beyond the code point range [0,65535] I'm wondering how is Java going to handle this? I didn't find any hints from JDK documentation either, at least a few days ago when I browsed the Java documentation about internationalization I just saw a comment that 'Unicode is a 16-bit encoding.' (two errors in one sentence) Regards, Jani Kajala
A very basic question about Big5/x-Jis/ Unicode....
Hi I have recently started to study Unicode and tried to understand what it is, except that it is a system that supports double byte languages. When doing this, I've bumped into Big5,Jis Shift, x-Jis. Are these synonyms for different Chinese and Japanese character sets and for which? I'm specially interested in the various Japanese systems. What are they? Which one should I prefer in creating (multilingual) web sites? Is there sth special I'd need to consider when using Japanese or does using Unicode 3.0 simply solve my problems? Thx /maikki
Errors in Unihan?
Hello, In the Unihan.txt database, in the kMandarin field there are entries with duplicate pronunciations. For example: U+4E21 kMandarin 1 LIANG3 2 LIANG3 3 LIANG4 U+4E4E kMandarin 1 HU1 HU2 2 HU1 U+4E86 kMandarin 1 LIAO3 2 LE LIAO3 Is there a reason for these duplicates? If this is the case, the format of this field should be documented better in the header. If these duplications are errors, I can supply a list of them. Also, what's the meaning of the isolated numbers? Other entries certainly contains errors, for example: U+5594 kMandarin 1 WO1 2 01 ^ this is zero. U+4EC0 kMandarin 1 SHI2 2 SHEN2 3 SHI2 SHIU2SHEN2 SHI2 ?? -- shi2 shen2 ?? Regards, Pierpaolo Bernardi
Re: OT: Devanagari question
On Tue, Nov 14, 2000 at 08:22:21AM -0800, D.V. Henkel-Wallace wrote: Sadly, it seems unlikely that any furture change or adoption of orthography will use characters not already supported by the then major computer systems. In fact the trend seems to be the other way, viz Spain's changing of its collation rules. For a minority language (which all remaining unwritten languages are) the pressure will be strong to use existing combinations (since they won't constitute a large enough community for people to write special rendering support). I don't know about that. On one hand, you have Chimchim(sp?) whose current alphabet uses g and x as special vowels, and Cherokee which is usually (often?) written in an ASCII-compatible orthography using ? as a letter. But on the other, Esperanto and Lakota both have introduced new letters without problem, and Lakota still can't be written in Unicode*. And I don't see why adding new letters would be a problem - when the Cherokee syllabary is used, it appears to be used with one of two different 7-bit font-based encodings, not Unicode. Even if new letters were done right with Unicode, there's lots of space in the Private Use areas. * There was some discussion on this on the list in September, that ended with someone finding 019E LATIN SMALL LETTER N WITH LONG RIGHT LEG. Unfortunetly, there's no corresponding LATIN CAPITAL LETTER N WITH LONG RIGHT LEG, which Lakota needs. -- David Starner - [EMAIL PROTECTED] http://dvdeug.dhis.org As centuries of pulp novels and late-night Christian broadcasting have taught us, anything we don't understand can be used for the purposes of Evil. -- Kenneth Hite, Suppressed Transmissions
Re: Errors in Unihan?
On Tuesday, November 14, 2000, at 08:24 AM, Pierpaolo Bernardi wrote: In the Unihan.txt database, in the kMandarin field there are entries with duplicate pronunciations. For example: U+4E21kMandarin 1 LIANG3 2 LIANG3 3 LIANG4 U+4E4EkMandarin 1 HU1 HU2 2 HU1 U+4E86kMandarin 1 LIAO3 2 LE LIAO3 Is there a reason for these duplicates? If this is the case, the format of this field should be documented better in the header. If these duplications are errors, I can supply a list of them. That would be very helpful, yes. Also, what's the meaning of the isolated numbers? The value of the field was obtained from dictionaries. When a dictionary provides more than one meaning, it is not infrequent that one pronunciation is specific to a particular meaning and another pronunciation specific to another. This is where the numbers come from. Inasmuch as the database doesn't maintain the link between specific definitions and pronunciations, the isolated numbers should also be removed.
Re: OT: Devanagari question
"D.V. Henkel-Wallace" wrote: For a minority language (which all remaining unwritten languages are) the pressure will be strong to use existing combinations (since they won't constitute a large enough community for people to write special rendering support). OTOH minority languages have come to be written with novel scripts like Pollard and UCAS. -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less|| http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Re: Devanagari question
Mark Davis wrote: The Unicode Standard does define the rendering of such combinations, which is in the absence of any other information to stack outwards. A dumb implementation would simply move the accent outwards if there was in the same position. This will not necessarily produce an optimal positioning, but should be readable. Note that it also should increase the line spacing. Note also that the renderer should notice that event, even in when there is interleaved unrelevant (zero-width) characters. And we are using a dumb implementation. Anyway, my point was not about this, which are as you say, the basics of the dumbest renderer. No, I was thinking about the implications of mixing Nagari consonants with kana diacritics (or the contrary); or circling (U+20DD) around Indian conjuncts, or else around superscript digits; or the Tibetan subjoined below Latin letters (how do they attach?); or Jamos followed by a virama or a Telugu length mark. Etc. My point was it is *not* a good idea to render an out-of-context Telugu length mark (U+0C55), when it follows for example a Latin vowel, as a macron, even if this is the "logical" behaviour. Such code will be, IMHO, just waste. If it take megabytes of code to do [that] there is probably something else wrong. I do not count a dumb implementation as "decent". And yes, I was overemphasing with "megabytes". The OT support in FreeType, which does only a small part of this task, is only 315 Kbytes of C code. So I expect the not-so-dumb renderer based on it, to be around 0.5 megabyte. Which does not take in account the code embeeded in the OT fonts themselves. As a result, yes, please remove the "s". Antoine
Re: Java and Unicode
You can currently store UTF-16 in the String and StringBuffer classes. However, all operations are on char values or 16-bit code units. The upcoming release of the J2SE platform will include support for Unicode 3.0 (maybe 3.0.1) properties, case mapping, collation, and character break iteration. There is no explicit support for surrogate pairs in Unicode at this time, although you can certainly find out if a code unit is a surrogate unit. In the future, as characters beyond 0x become more important, you can expect that more robust, official support will ollow. -- John O'Conner Jani Kajala wrote: As Unicode will soon contain characters defined beyond the code point range [0,65535] I'm wondering how is Java going to handle this? I didn't find any hints from JDK documentation either, at least a few days ago when I browsed the Java documentation about internationalization I just saw a comment that 'Unicode is a 16-bit encoding.' (two errors in one sentence) Regards, Jani Kajala
Lakota (was Re: OT: Devanagari question)
[EMAIL PROTECTED] wrote: Unfortunately, there's no corresponding LATIN CAPITAL LETTER N WITH LONG RIGHT LEG, which Lakota needs. To my knowledge, the discussion in September between John Cowan and Curtis Clark didn't terminate with any actual proposal, and I'm not clear on whether the above assertion is a fact. I'm not saying I know anything about this field either. Does Lakota REALLY need a letter that isn't in Unicode? Are you in a position to provide documents and evidence, and/or make a definite proposal for adding this character? It would be a good thing to add, if it's really needed. Rick
RE: Devanagari question
From: D.V. Henkel-Wallace [mailto:[EMAIL PROTECTED]] At 06:30 2000-11-14 -0800, Marco Cimarosti wrote: But my point was: not even Mr. Ethnologue himself knows exactly *which* combinations are meaningful, in all orthographic system. And, clearly, no one can figure out which combinations may become meaningful in the *future* -- e.g. when a previously unwritten language gets its orthography, or when the spelling of an already written language gets changed. Sadly, it seems unlikely that any furture change or adoption of orthography will use characters not already supported by the then major computer systems. In fact the trend seems to be the other way, viz Spain's changing of its collation rules. I do not think that this is a trend. The last I knew, computer-savvy Taiwan and Hong Kong were continuing to invent new characters. In the end, the onus is on the computer to support the user. Only during the current frenzy of computerization is the reverse permitted - this will pass. For a minority language (which all remaining unwritten languages are) the pressure will be strong to use existing combinations (since they won't constitute a large enough community for people to write special rendering support). That depends on how you look at it. From what I understand (which I freely admit I have learned only from this list), Indic languages tend to be supported in toto, and therefore even the currently unwritten ones will belong to a highly non-minority language family. $.02, /|/|ike
RE: Devanagari question
Mike Ayers wrote: The last I knew, computer-savvy Taiwan and Hong Kong were continuing to invent new characters. In the end, the onus is on the computer to support the user. Yes, the computer should support the user, but... The invention of new characters to serve multitudes is OK, and international standards will probably continue to support that. But I don't think it's reasonable or appropriate to keep inventing new characters willy-nilly for individuals (as reported), and then expect them to be added to an international standard. That's silly. The onus is not on international standards to support the whimsical production of novel, rarely-used, or nonce characters of the type reported to be generated. In any case, I still have never seen actual documentary evidence that would prove to me that in fact Taiwan and Hong Kong *ARE* creating new characters at the drop of a hat. People just keep saying that to scare everyone. Sounds like an urban myth to me. Rick
RE: Devanagari question
On Tue, 14 Nov 2000, Rick McGowan wrote: Mike Ayers wrote: The last I knew, computer-savvy Taiwan and Hong Kong were continuing to invent new characters. In the end, the onus is on the computer to support the user. Yes, the computer should support the user, but... The invention of new characters to serve multitudes is OK, and international standards will probably continue to support that. But I don't think it's reasonable or appropriate to keep inventing new characters willy-nilly for individuals (as reported), and then expect them to be added to an international standard. That's silly. The onus is not on international standards to support the whimsical production of novel, rarely-used, or nonce characters of the type reported to be generated. In any case, I still have never seen actual documentary evidence that would prove to me that in fact Taiwan and Hong Kong *ARE* creating new characters at the drop of a hat. People just keep saying that to scare everyone. Sounds like an urban myth to me. I think there is some confusion between "new characters" in the sense that they were never available in any standard, but which are taken from pre-existing print sources, and now people would like to properly add them; versus "new characters" that were made up "yesterday" for frivolous reasons. Thomas Chan [EMAIL PROTECTED]