Common XML Data Locale Repository V1.0 Alpha Available!
Forwarded on behalf of Helena Chapman: The OpenI18N WG of the Free Standards Group is pleased to inform you that CLDR (Common XML Locale Data Repository) V1.0 Alpha snapshot is available. The CLDR repository provides application developers a consistent and uniform resource in managing the locale-sensitive data used for formatting, parsing, and analysis. It also includes the comparison charts that demonstrates the locale data differences on various platforms. For details on the locale data comparison charts, please see http://oss.software.ibm.com/cvs/icu/~checkout~/locale/all_diff_xml/comparison_charts.html. The V1.0 alpha is available at http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/xml/ via CVS under the tag release-1-0-alpha. To report problems, comments or defects, please submit a bug report at http://www.openi18n.org/locale-bugs/public. The V1.0 Locale Data Markup Language specification on which the CLDR data is based upon can be found at http://www.openi18n.org/specs/ldml/. Thank you. Regards, Helena Shih Chapman Manager, SWG Customer Satisfaction, Quality and ISO9K2K Co-Chair of OpenI18N / Free Standards Group -- Vladimir Weinstein, IBM GCoC-Unicode/ICU San Jose, CA [EMAIL PROTECTED]
Re: Unicode transliterations (and other operations)
[EMAIL PROTECTED] writes: There have been some messages in this thread discussing whether something is transliteration or transcription. On that point I have two comments: first, ISO TC 46 has created definitions for these two terms that apply to ISO standards under their purview; these definitions can be found at http://www.elot.gr/tc46sc2/purpose.html. Secondly, it is my impression that many people use the term transliteration in a broader sense than the strict definition defined by TC 46. That appears to be the case for the help file associated with the ICU demo, which defines transliteration as, the general process of converting characters from one particular script to another one. Moreover, there is a need for a term to described a This is because ICU implementation of transliteration actually allows for even more general thing - converting characters according to a given set of rules. It can be used both for transliteration and transcription as defined in TC 46. For example, Kashmiri (India / Pakistan) is written in Devanagari and in Nastaliq-style Arabic (aka Persio-Arabic); Wolaytta (Ethiopia) is written in Ethiopic and Roman; Tai Dam is written in Tai Dam script, in Lao script and in Roman with Vietnamese-style diacritics. Let me add Serbian to this list - it is written both in Latin and Cyrillic scripts with mapping that is almost one to one. In case of Serbian, There are, in principle, three potential ways to deal with publishing in multiple writing systems: 1. Separate documents are created manually, one for each writing system. This method is not feasible at all in case of Serbian. . 2. A document is created manually in one writing system, and different parallel documents are generated through an automated process for the other writing systems. This is the most common practice used, although with some interesting consequences, see below. 3. A single document is created that can be displayed in terms of alternate writing systems using font mechanisms, possibly relying on transduction done within smart fonts. This one is also used. Here is the case of Serbian. It uses 30 cyrillic letters or 30 latin letters. However, some of the letters in the latin alphabet are represented as two letters - here are the pairs: \u0409/\u0459 == Lj/lj \u040A/\u045A == Nj/nj \u040F/\u045F == D\u017E/d\u017E \u0402/\u0452 occasionally represented in latin as Dj/dj, but usually represented by \u0110/\u0111 Transliteration from cyrillic to latin is very easy. The only problem is transliteration of upper case letters above, which can be transliterated either to upper/lower case combination or to two upper case letters, depending on the case of following letters. A little bit more complicated is transliteration of Serbian from latin to cyrillic, even when Unicode encoded, for two reasons: 1) if foreign names are not transcribed or tagged, they will be simply transliterated to cyrillic form, which is always a source of good laugh for Serbian readers, 2) this one happens extremely rarely - some words that use two-letter latin letters should be transliterated to two cyrillic letters, instead of just one. This is the case with some adopted foreign words. However, it is not of interest in everyday practice. Interesting and wrong practice used by a lot of magazines that print in cyrillic and also have a latin Internet publication is using a latin based encoding for cyrillic version, where q, w, x and y are used for cyrillic letters that use two letters in latin representation, for example, W and w represent \u040A and \u045A. However, foreign names are not transcribed, but written in original form in latin script. So, after moving from cyrillic to latin, Washington becomes Njashington. Of course, if Unicode was used for storing the text, transliteration from cyrillic to latin would be correct and almost trivial. My experience in transliteration says that 'pure' Unicode text is not enough for comfortable transliteration, especially for texts that tend to mix latin and cyrillic, as it is the case with most of technical texts. Some additional tagging is required to make it fully automatic. Otherwise, additional proof reading is required. I had reasonable success in writing MS Word macros that did transliteration - things that helped were formatting foreign word differently - using italic or bold. Hope this makes sense, V. -- Vladimir Weinstein, IBM GCoC-Unicode/ICU Cupertino, CA, [EMAIL PROTECTED]
Re: Unicode transliterations (and other operations)
I trust that 'moving' a name or a term between languages would be called transcription, not transliteration. Transliteration just tries to 'move' from script to script. Markus Scherer writes: Looks interesting. How are you approaching the complication that transliteration is between pairs of languages? I know what you mean: Gorbachev is Gorbatschow in German. This would then be an example of transcription, which differs on language pair basis, as it tries to get the speakers to pronounce the same word. I think that the rules that we have in ICU are probably English-centric where it makes a difference. V. -- Vladimir Weinstein, IBM GCoC-Unicode/ICU Cupertino, CA, [EMAIL PROTECTED]
Re: Slovenian and Croat letters
Michael Everson writes: At 12:14 +0200 2001-07-02, Martin Kotulla wrote: Can anyone give me some information on the Slovenian and Croat letters in the Unicode range U+0200 to U+0217? Which purpose do these characters have? At what point in time have they been in use? They are used to mark tone in linguistic discussion of traditional poetic texts. Would you mind pointing to some examples? Thanks, V. -- Vladimir Weinstein, IBM GCoC-Unicode/ICU Cupertino, CA, [EMAIL PROTECTED]
Re: Slovenian and Croat letters
Depending on what you consider everyday usage. Newspaper texts would be fine. Literature and language scholars might require it. However, every speaker of the language would understand the text without the accent marks. BTW. These accent marks are also used in the latin version of Serbian. Also, in these languages, there are two more accent marks used - grave and acute. V. Martin Kotulla writes: Michael Everson wrote: At 12:14 +0200 2001-07-02, Martin Kotulla wrote: Can anyone give me some information on the Slovenian and Croat letters in the Unicode range U+0200 to U+0217? Which purpose do these characters have? At what point in time have they been in use? They are used to mark tone in linguistic discussion of traditional poetic texts. Which means that a Croat/Slovenian typeface would be regarded complete for everyday use even without those characters. Right? -Martin -- Vladimir Weinstein, IBM GCoC-Unicode/ICU Cupertino, CA, [EMAIL PROTECTED]
Re: Cyrillic -
Hi, I have looked at your web site. If I am not mistaken, you are using a codepage that is commonly refered to as cyrillic YUSCII. This makes the page almost unusable except for the people that have 'Pulshelvetika7' font installed. As you have correctly assumed, the best thing would be to convert the page to Unicode (although you could also convert it microsoft cp1251 or ISO 8859-5). You will not loose your pages - just work on the copies of them. One possible way would be, as Markus already mentioned, to use ICU converter framework - but you would have to make a converter table. There is also a set of macros for Word that handles ex-YU codepage conversions, which can be found at http://solair.eunet.yu/~minya/ Once you have converted your text to Unicode - you should add encoding information to your page about the used encoding. Most modern browsers should be able to swallow and correctly display such a page. Should you have more questions, please contact me directly. Hope this helps, -- Vladimir Weinstein Software Engineer, Unicode Technology Group IBM JTC Cupertino 408-777-5844 (t/l 240-5844)
Re: Unicode keyboard editor utility
You might also want to check out Keyboard Layout Manager at http://solair.eunet.yu/~minya/Programs/klm/klm.html which allows redefinition of keyboard mapping files for most Windows systems (Win9x, NT, 2K). Hope this helps, Vladimir Alan Wood wrote: Magda Danish asked if anyone knew of a Unicode keyboard editor utility. There is a beta release of version 5.0 of Keyboard Manager that supports Unicode characters and works with Windows NT. It can be downloaded from http://www.tavultesoft.com/keyman/. Janko's Keyboard Generator is for Windows 95 and 98, and does not seem to support Unicode. http://solair.eunet.yu/~janko/engdload.htm I have recently come across 3-D Keyboard, but I have not had time to investigate it. http://www.fingertipsoft.com/3dkbd/index.html When I get time, I will add the latter 2 to my page of font and keyboard utilities at: http://www.hclrss.demon.co.uk/unicode/utilities_fonts.html Alan Wood (Documentation Writer / Web Master) Context Limited (Electronic publishers of UK and EU legal and official documents) mailto:[EMAIL PROTECTED] http://www.context.co.uk/ http://www.alanwood.net/ (Unicode, special characters, pesticide names) -- Vladimir Weinstein Software Engineer, Unicode Technology Group IBM JTC Cupertino 408-777-5844 (t/l 240-5844)