Our (three) Locale classes, i.e. rtl::Locale,
com::sun::star::lang::Locale and comphelper::Locale all have three
fields, i.e.

OUString Language
OUString Country
OUString Variant

where Language is typically a ISO-639 [*] code, Country is typically in
ISO-3166 and Variant is basically an undefined bucket with API
documentation copied and pasted from the Java Locale class.

We don't have a good way to describe various languages with just
Language and Country, typically those that can be written in different
scripts.

For some cases we're already kluding things a bit with e.g. sh-RS to
describe "Serbian using the Latin script in Serbia" (as opposed to sr-RS
to describe Serbian written in the default Cyrillic script in Serbian)
and in some cases we don't have a way to describe languages we could
support if we had a way to transport that info around e.g. Inuktitut
Syllabics Canada (as opposed to iu-CA to describe that language written
in the default Latin alphabet which we could currently support)

In BCP-47 the examples above could be described as sr-Latn-RS and
iu-Cans-CA (http://www.rfc-editor.org/rfc/bcp/bcp47.txt) where basically
if the second tag exists (of an arbitrary number of tags) and is four
letters long it must be a ISO 15924 script code. i.e. current sr-RS and
iu-CA remain valid BCP-47 strings.

In parallel with all this the glibc locale string is typically
language_country.encod...@modifier_string, where language and Country
are as above, while the modifier strings are typically "cyrillic" and
"latin" for those variants among other more arbitrary ones. e.g.
"sr_rs.ut...@latin" for what would be in BCP-47 style "sr-Latn-RS"

I note here that the rtl Locale parser converts a Unix locale string
into a rtl::Locale along the lines of

Language = language, e.g. sr
Country = Country, e.g. RS
Variant = all_the_rest_of_the_string, e.g. .ut...@latin

and depends on being able to reverse this conversion back to the the
original Unix locale string, i.e. it needs to rebuild sr_rs.ut...@latin
from its rtl::Locale structure

and the comphelper one does something similar, while the
com::sun::star::lang::Locale.Variant seems to be pretty much unused
throughout OOo.

In our xml format afaics, where the com::sun::star::lang::Locale is
basically the structure that backs it, we have just "language" and
"country" tags.

So..., how about we adopt a BCP-47 based approach. i.e.

a) Where we are currently describing locales as a string in "iso-format"
we use BCP-47. Currently valid locale strings get to remain valid.

b) Where we use a Locale structure, Language and Country stay the same,
but we specify a format for the remaining Variant field where it is
BCP-based sequence of tags separated by '-'. The Variant field becomes
the equivalent BCP-47 locale string for the totality, minus the language
and region tags, plus that the first tag entry *must* be a Script Code
to ensure forward and backward conversion to an unambiguous BCP-47
string. In this scheme the script tag at the start of the Variant can
(and must) be empty to denote the default script.

c) Where we use "language" and "country" codes in our xml format we add
a "language-tags" attribute which maps directly to that Variant field.

i.e. sr-Latn-RS becomes

Language = sr
Country = RS
Variant = Latn

i.e. sr-Latn-RS-whatever-foo becomes

Language = sr
Country = RS
Variant = Latn-whatever-foo

a BCP-47 string of de-DE-1901 becomes 

Language = de
Country = DE
Variant = -1901

de-DE remains

Language = de
Country = DE
Variant =

Parsers that want to convert a Unix Locale into the above structure can
take, e.g.
aa_er.ut...@saaho

and make it into

Language = aa
Country = ER
Variant = -.ut...@saaho

to give a reversible scheme where the original Unix Locale string can be
reconstructed, and for Unix Locale strings which hint at the script in
use, we can parse sr_rs.ut...@latin into

Language = sr
Country = RS
Variant = latn-.ut...@latin

and remain reversible into the original Unix Locale string, and also
provide a non-null script tag which allows continued conversion from the
rtl::Locale class to the com::sun::star::lang::Locale one without losing
script tag information.

The xml format for a style that sets the Language of a paragraph to
Inuktitut Syllabics Canada could then use an additional language-tags
attribute, e.g.

<style:text-properties fo:language="iu" fo:country="CA"
fo:language-tags="Cans"/> 

while the "Locales" string of the spellchecker Locales string can use
BCP-47 format, e.g. support "iu-Cans-CA"

C.

1. See #i111019# for the sal issue to parse three letter codes there,
and not limit it to two codes.
2. Presumably it would be best to prefer *generating* sh-RS for
backwards compatibility, even though accepting sr-Latn-RS
3. comphelper::Locale is very little used, it looks like a good idea to
move uses of it over to com::sun::star::lang::Locale and convert it to
some calls that operate on that instead and/or merge the unused bits
over to e.g. MSLangId.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org

Reply via email to