Common XML Data Locale Repository V1.0 Alpha Available!

2003-10-09 Thread Vladimir Weinstein
Forwarded on behalf of Helena Chapman:

The OpenI18N WG of the Free Standards Group is pleased to inform you that CLDR 
(Common XML Locale Data Repository) V1.0 Alpha snapshot is available. The CLDR 
repository provides application developers a consistent and uniform resource in 
managing the locale-sensitive data used for formatting, parsing, and analysis. 
It also includes the comparison charts that demonstrates the locale data 
differences on various platforms.

For details on the locale data comparison charts, please see 
http://oss.software.ibm.com/cvs/icu/~checkout~/locale/all_diff_xml/comparison_charts.html. 

The V1.0 alpha is available at 
http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/xml/ via CVS under 
the tag release-1-0-alpha.
To report problems, comments or defects, please submit a bug report at 
http://www.openi18n.org/locale-bugs/public.
The V1.0 Locale Data Markup Language specification on which the CLDR data is 
based upon can be found at http://www.openi18n.org/specs/ldml/.

Thank you.

Regards,

Helena Shih Chapman
Manager, SWG Customer Satisfaction, Quality and ISO9K2K
Co-Chair of OpenI18N / Free Standards Group
--
Vladimir Weinstein, IBM GCoC-Unicode/ICU  San Jose, CA [EMAIL PROTECTED]



Re: Unicode transliterations (and other operations)

2001-07-04 Thread Vladimir Weinstein

[EMAIL PROTECTED] writes:
  There have been some messages in this thread discussing whether something
  is transliteration or transcription. On that point I have two comments:
  first, ISO TC 46 has created definitions for these two terms that apply to
  ISO standards under their purview; these definitions can be found at
  http://www.elot.gr/tc46sc2/purpose.html. Secondly, it is my impression that
  many people use the term transliteration in a broader sense than the
  strict definition defined by TC 46. That appears to be the case for the
  help file associated with the ICU demo, which defines transliteration as,
  the general process of converting characters from one particular script to
  another one. Moreover, there is a need for a term to described a

This is because ICU implementation of transliteration actually allows for even more 
general thing - converting characters according to a given set of rules. It can be 
used both for transliteration and transcription as defined in TC 46.

  For example, Kashmiri (India / Pakistan) is written in Devanagari and in
  Nastaliq-style Arabic (aka Persio-Arabic); Wolaytta (Ethiopia) is written
  in Ethiopic and Roman; Tai Dam is written in Tai Dam script, in Lao script
  and in Roman with Vietnamese-style diacritics.

Let me add Serbian to this list - it is written both in Latin and Cyrillic scripts 
with mapping that is almost one to one. 

In case of Serbian, 
  There are, in principle, three potential ways to deal with publishing in
  multiple writing systems:
  
  1. Separate documents are created manually, one for each writing system.

This method is not feasible at all in case of Serbian. .

  2. A document is created manually in one writing system, and different
  parallel documents are generated through an automated process for the other
  writing systems.

This is the most common practice used, although with some interesting consequences, 
see below.

  3. A single document is created that can be displayed in terms of alternate
  writing systems using font mechanisms, possibly relying on transduction
  done within smart fonts.
This one is also used.

Here is the case of Serbian. It uses 30 cyrillic letters or 30 latin letters. However, 
some of the letters in the latin alphabet are represented as two letters - here are 
the pairs:
\u0409/\u0459 == Lj/lj
\u040A/\u045A == Nj/nj
\u040F/\u045F == D\u017E/d\u017E
\u0402/\u0452 occasionally represented in latin as Dj/dj, but usually represented by 
\u0110/\u0111

Transliteration from cyrillic to latin is very easy. The only problem is 
transliteration of upper case letters above, which can be transliterated either to 
upper/lower case combination or to two upper case letters, depending on the case of 
following letters.

A little bit more complicated is transliteration of Serbian from latin to cyrillic, 
even when Unicode encoded, for two reasons:
1) if foreign names are not transcribed or tagged, they will be simply transliterated 
to cyrillic form, which is always a source of good laugh for Serbian readers,
2) this one happens extremely rarely - some words that use two-letter latin letters 
should be transliterated to two cyrillic letters, instead of just one. This is the 
case with some adopted foreign words. However, it is not of interest in everyday 
practice.

Interesting and wrong practice used by a lot of magazines that print in cyrillic and 
also have a latin Internet publication is using a latin based encoding for cyrillic 
version, where q, w, x and y are used for cyrillic letters that use two letters in 
latin representation, for example, W and w represent \u040A and \u045A. However, 
foreign names are not transcribed, but written in original form in latin script. So, 
after moving from cyrillic to latin, Washington becomes Njashington. Of course, if 
Unicode was used for storing the text, transliteration from cyrillic to latin would be 
correct and almost trivial.

My experience in transliteration says that 'pure' Unicode text is not enough for 
comfortable transliteration, especially for texts that tend to mix latin and cyrillic, 
as it is the case with most of technical texts. Some additional tagging is required to 
make it fully automatic. Otherwise, additional proof reading is required. 

I had reasonable success in writing MS Word macros that did transliteration - things 
that helped were formatting foreign word differently - using italic or bold.

Hope this makes sense,
V.

-- 
Vladimir Weinstein, IBM GCoC-Unicode/ICU  Cupertino, CA,  [EMAIL PROTECTED]







Re: Unicode transliterations (and other operations)

2001-07-03 Thread Vladimir Weinstein

I trust that 'moving' a name or a term between languages would be called 
transcription, not transliteration. Transliteration just tries to 'move' from script 
to script.

Markus Scherer writes:
   Looks interesting.  How are you approaching the complication that transliteration 
 is between pairs of languages?
  
  I know what you mean: Gorbachev is Gorbatschow in German.

This would then be an example of transcription, which differs on language pair basis, 
as it tries to get the speakers to pronounce the same word.


  
  I think that the rules that we have in ICU are probably English-centric where it 
 makes a difference.


V.

-- 
Vladimir Weinstein, IBM GCoC-Unicode/ICU  Cupertino, CA,  [EMAIL PROTECTED]







Re: Slovenian and Croat letters

2001-07-02 Thread Vladimir Weinstein

Michael Everson writes:
  At 12:14 +0200 2001-07-02, Martin Kotulla wrote:
  Can anyone give me some information on the Slovenian and Croat letters
  in the Unicode range U+0200 to U+0217?
  
  Which purpose do these characters have? At what point in time have they
  been in use?
  
  They are used to mark tone in linguistic discussion of traditional 
  poetic texts.

Would you mind pointing to some examples?

Thanks, 
V.

-- 
Vladimir Weinstein, IBM GCoC-Unicode/ICU  Cupertino, CA,  [EMAIL PROTECTED]










Re: Slovenian and Croat letters

2001-07-02 Thread Vladimir Weinstein


Depending on what you consider everyday usage. Newspaper texts would be fine. 
Literature and language scholars might require it. 

However, every speaker of the language would understand the text without the accent 
marks.

BTW. These accent marks are also used in the latin version of Serbian. Also, in these 
languages, there are two more accent marks used - grave and acute.

V.

Martin Kotulla writes:
  Michael Everson wrote:
  
   At 12:14 +0200 2001-07-02, Martin Kotulla wrote:
   Can anyone give me some information on the Slovenian and Croat letters
   in the Unicode range U+0200 to U+0217?
   
   Which purpose do these characters have? At what point in time have they
   been in use?
  
   They are used to mark tone in linguistic discussion of traditional
   poetic texts.
  
  Which means that a Croat/Slovenian typeface would be regarded complete for
  everyday use even without those characters. Right?
  
  -Martin
  
  
  
  
  

-- 
Vladimir Weinstein, IBM GCoC-Unicode/ICU  Cupertino, CA,  [EMAIL PROTECTED]







Re: Cyrillic -

2000-10-01 Thread Vladimir Weinstein

Hi,

I have looked at your web site. If I am not mistaken, you are using a
codepage that is commonly refered to as cyrillic YUSCII. This makes the
page almost unusable except for the people that have 'Pulshelvetika7'
font installed.

As you have correctly assumed, the best thing would be to convert the
page to Unicode (although you could also convert it microsoft cp1251 or
ISO 8859-5). You will not loose your pages - just work on the copies of
them.

One possible way would be, as Markus already mentioned, to use ICU
converter framework - but you would have to make a converter table.
There is also a set of macros for Word that handles ex-YU codepage
conversions, which can be found at http://solair.eunet.yu/~minya/

Once you have converted your text to Unicode - you should add encoding
information to your page about the used encoding.

Most modern browsers should be able to swallow and correctly display
such a page.

Should you have more questions, please contact me directly.

Hope this helps,

--
Vladimir Weinstein
Software Engineer, Unicode Technology Group
IBM JTC Cupertino
408-777-5844 (t/l 240-5844)



Re: Unicode keyboard editor utility

2000-07-26 Thread Vladimir Weinstein

You might also want to check out Keyboard Layout Manager at
http://solair.eunet.yu/~minya/Programs/klm/klm.html

which allows redefinition of keyboard mapping files for most Windows
systems (Win9x, NT, 2K).

Hope this helps,
Vladimir



Alan Wood wrote:

 Magda Danish asked if anyone knew of a Unicode keyboard editor utility.
 There is a beta release of version 5.0 of Keyboard Manager that supports
 Unicode characters and works with Windows NT.  It can be downloaded from
 http://www.tavultesoft.com/keyman/.
 Janko's Keyboard Generator is for Windows 95 and 98, and does not seem to
 support Unicode.  http://solair.eunet.yu/~janko/engdload.htm
 I have recently come across 3-D Keyboard, but I have not had time to
 investigate it. http://www.fingertipsoft.com/3dkbd/index.html

 When I get time, I will add the latter 2 to my page of font and keyboard
 utilities at:
 http://www.hclrss.demon.co.uk/unicode/utilities_fonts.html

 Alan Wood
 (Documentation Writer / Web Master)
 Context Limited
 (Electronic publishers of UK and EU legal and official documents)
 mailto:[EMAIL PROTECTED]
 http://www.context.co.uk/
 http://www.alanwood.net/ (Unicode, special characters, pesticide names)

--
Vladimir Weinstein
Software Engineer, Unicode Technology Group
IBM JTC Cupertino
408-777-5844 (t/l 240-5844)