Re: [dev] BCP-47 based proposal for "IsoStrings", Locale Variants and describing languages ?

Eike Rathke Thu, 22 Apr 2010 12:56:06 -0700

Hi Caolan,

On Thursday, 2010-04-22 17:52:56 +0100, Caolan McNamara wrote:


> > Actually the aImplIsoNoneStdLangEntries can never be the result of
> > a conversion as valid ISO code combinations exist for all LangIDs in
> > aImplIsoLangEntries.
> > MsLangId::convertLanguageToIsoNames() methods is moot. I don't recall if
> > it was ever used that way since we write XML, but I doubt it.
> 
> > Accepting makes of course sense, but a conversion will always result
> > in the corresponding ISO codes.
> 
> Set some text to Azeri (Cyrillic) in writer with 3.2 and save as .odt,
> the result is <style:text-properties fo:language="az"
> fo:country="cyrillic"/>

Ouch.. sigh.. that's clearly a bug from an ODF view. It starts with
offering both, Cyrillic and Latin, in the language list despite having
the (non-unique) mapping for LANGUAGE_AZERI_CYRILLIC commented out in
the mapping table. One more reason to support script codes..


> Woops, right, I used an invalid 5 letter example. Anyway, checking for 4
> letter encodings which plausibly could show up in a Unix locale, take
> LANG=ja_JP.Sjis as a better example.

That with my proposal would be
Language = ja
Country = JP
Variant = -::ja_JP.Sjis

> > > The other consideration is that if you enforce a script code as the
> > > first tag in a Variant, it becomes trivial to pull out the script tag
> > > from a Variant string with a two liner without any other processing,
> > > e.g.
> > > 
> > > sal_Int32 nIndex = 0;
> > > rtl::OUString aScriptSubtag = rVariant.getToken(0, '-', nIndex);
> > 
> > That's indeed neat. But again, see my previous mail, not all BCP47 tags
> > would fulfill this requirement if they contained extlang subtags.
> 
> I had sort of imagined something like zh-cmn-Latn-CN would appear as
> Language = zh-cmn
> Country = CN 
> Variant = Latn

See also your other mail, we'd get into having to extract ISO codes for
storage anyway, plus in this case store an *:rfc-language-tag attribute.
However, luckily AFAIK for all extlang type subtags so far exist
language type subtags, here 'zh-cmn' would map to 'cmn'. I'm quite sure
that with enhancing aImplIsoLangEntries and the methods accessing it we
could set up proper mappings.

Additionally, we'll have to examine how ICU handles these cases, passing
down script codes and variants will need some extra work.


> > As a quick solution I'd come up with:
> > [...]
> Sounds good.

Thanks. Also reading it the 3rd time I didn't find flaws.

> 
> > * If only a BCP47 variant is involved, with or without script, we could
> >   add the variant to the first subfield, having
> >   '-' [script] '-' [variant]
> >   for easier extraction with rVariant.getToken(1, '-', nIndex).
> 
> Sounds like gilding the lily. Do we really need to easily extract that,
> and anyway can't there be multiple BCP47 variant tags as opposed to only
> one script tag ?

True, there's no limit on variant subtags. So this doesn't scale well.


> > And, maybe, using such a Locale with Java might lead to unpredictable
> > results, I don't know.
> 
> It would definitely help if anyone knew what on earth the java Variant
> field ever gets used for.

Citing http://www.docjar.com/docs/api/java/util/Locale.html

| The variant argument is a vendor or browser-specific code. For example,
| use WIN for Windows, MAC for Macintosh, and POSIX for POSIX. Where there
| are two variants, separate them with an underscore, and put the most
| important one first. For example, a Traditional Spanish collation might
| construct a locale with parameters for language, country and variant as:
| "es", "ES", "Traditional_WIN".

http://www.joconner.com/javai18n/articles/Locale.html says

| Operating system (OS), browser, and other application vendors can use
| the variant to provide additional functionality or customization that
| isn't possible with just a language and country designation. For
| example, a software company may need to indicate a locale for a specific
| operating system, so they may create an es_ES_MAC or es_ES_WIN locale
| for the Macintosh or Windows platforms. One historical example from the
| Java 2 platform itself is the use of the EURO variant for European
| locales that use the Euro currency. During the transition period for
| those countries, the Java platform (version 1.3) used this variant. For
| example, although a de_DE (German-speaking Germany) locale existed,
| a de_DE_EURO (German-speaking German locale with a Euro variant) was
| added to the Java environment. Because the Euro currency is now the
| standard currency for the affected locales at this point, those variants
| have been removed since version 1.4 of the platform. Most application
| designs will probably not require variant locale definitions.

So essentially sounds pretty much like "if Language and Country fields
aren't sufficient, put stuff here".

I'm just not sure that there aren't Java locale services that somehow
check the content of the Variant field and stumble over ':' or '-'
delimiters, for example, or bail out on unregistered content.

Btw, there's the OpenJDK Locale Enhancement Project
http://sites.google.com/site/openjdklocale/
I once asked those guys what the recommendation would be to transport
a BCP47 tag in the old Locale struct, the answer was more or less that
there isn't a recommendation, but that for example a dialect variant
would be "mapped to the Variant field", however. Adding the script to
the Language field, as in "sr_Latn", was considered not a good idea.
Seconded.

  Eike

-- 
 OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer.
 SunSign   0x87F8D412 : 2F58 5236 DB02 F335 8304  7D6C 65C9 F9B5 87F8 D412
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
 Please don't send personal mail to the e...@sun.com account, which I use for
 mailing lists only and don't read from outside Sun. Use er...@sun.com Thanks.

pgpSHbnETrzLR.pgp
Description: PGP signature

Re: [dev] BCP-47 based proposal for "IsoStrings", Locale Variants and describing languages ?

Reply via email to