Hi Caolan,

On Wednesday, 2010-04-21 13:02:53 +0100, Caolan McNamara wrote:

> > "get to remain valid":  Is it the case that all currently valid locale 
> > strings happen to adhere to the BCP-47 restrictions, so would 
> > automatically be valid BCP-47 strings.
> 
> Well, there is one problem. aImplIsoNoneStdLangEntries in
> i18npool/source/isolang/isolang.cxx has...
> 
> { LANGUAGE_SERBIAN_LATIN,               "sr", "latin"    },
> { LANGUAGE_SERBIAN_CYRILLIC,            "sr", "cyrillic" },
> { LANGUAGE_AZERI_LATIN,                 "az", "latin"    },
> { LANGUAGE_AZERI_CYRILLIC,              "az", "cyrillic" },
> 
> Considering the way the various tables work, that means there is one
> combination of language of "az" and country of "cyrillic" which has
> escaped out into the file format as fo:language="az"
> fo:country="cyrillic", so az-cyrillic would have to be accepted in
> addition though it's not valid BCP-47.

Actually the aImplIsoNoneStdLangEntries can never be the result of
a conversion as valid ISO code combinations exist for all LangIDs in
aImplIsoLangEntries. The corresponding code in both
MsLangId::convertLanguageToIsoNames() methods is moot. I don't recall if
it was ever used that way since we write XML, but I doubt it.


> Given that, it makes sense to continue to accept as input the other
> entries in that above table and aImplIsoNoneStdLangEntries2 +
> aImplOtherEntries as acceptable input for an "iso-string" (though they
> never were generated as output). So BCP-47 + some extra grandfathered
> tags.

Accepting makes of course sense, but a conversion will always result in
the corresponding ISO codes.


> > Is the requirement "that the first tag entry *must* be a Script Code to 
> > ensure forward and backward conversion to an unambiguous BCP-47 string" 
> > really necessary?  A <langtag> w/o <language> and <region> parts would be
> > 
> >    [script] *("-" variant) *("-" extension) ["-" privateuse]
> > 
> > where the syntactic forms allowed for <script> are disjoint of those 
> > allowed for <variant>, <extension>, and <privateuse>.
> 
> The need for conversion from a Unix locale string in rtl to a rtl_Locale
> and back again is what bothers me. Following the above protects against
> converting a unknown existing or future Unix locale string into a
> rtl_Locale which if used anywhere following this convention gives
> incorrect results. e.g, there are some glibc locales like zh_TW.euctw so
> LANG=zh_TW.Euctw is acceptable
> 
> which currently will give
> rtl_Locale of...
> Language = de
> Country = BE
> Variant = Euctw
> 
> If a future iso-15924 adds Euctw as a script code, then there's a
> problem.

They should not, ISO 15924 alpha is defined to be a 4 letter code.
Anyway, a script code in the BCP47 context would have to be registered
with IANA, and they certainly (hopefully..) would reject a non-4-letter
code.


> The other consideration is that if you enforce a script code as the
> first tag in a Variant, it becomes trivial to pull out the script tag
> from a Variant string with a two liner without any other processing,
> e.g.
> 
> sal_Int32 nIndex = 0;
> rtl::OUString aScriptSubtag = rVariant.getToken(0, '-', nIndex);

That's indeed neat. But again, see my previous mail, not all BCP47 tags
would fulfill this requirement if they contained extlang subtags.


> > Is reversibility necessary here?  I ask because this makes the Variant 
> > contain data that does not adhere to the above BCP-47 <langtag> w/o 
> > <language> and <region> parts.
> 
> I feel it is because if we look into sal/osl/unx/nlsupport.c and e.g.
> osl_getTextEncodingFromLocale there we use _compose_locale to regenerate
> from rtl_Locale a string to pass to setlocale(LC_CTYPE and some other
> similar examples in there. So it looks to me that a rtl_Locale that
> originates from _parse_locale on a given string has to be convertible
> back to that string in order to be useful with setlocale.

Lovely :-/

So then my proposed "Variant contains either 4 letter script code or
something else" also wouldn't work. Any other approach to transport
BCP47 in Variant wouldn't either, if the Locale struct is used for those
rtl calls. This is a mess. On the other hand, where do these methods get
called? If from the applications' core then the Variant probably is
empty anyway as they convert back and forth between Locale and MsLangID.

As a quick solution I'd come up with:

* Devide Variant into three subfields, separated by ':' colon.
* First subfield is either a 4 letter script code followed by '-', or
  only '-' to indicate absence of script.
  * This enables the extraction with rVariant.getToken(0, '-', nIndex).
* Second subfield is a full BCP47 string in case Language is "x-bcp47"
  or a BCP47 variant is involved, otherwise empty.
* Third subfield is the _full_ Unix locale string, or empty.
  * _compose_locale() could extract this with
    rVariant.getToken(2, ':', nIndex)
* Variant can be empty.
  * Extraction of script code still delivers a null string.
  * _compose_locale() in this case will have to concatenate
    Language-Country as it currently does.

* If only a BCP47 variant is involved, with or without script, we could
  add the variant to the first subfield, having
  '-' [script] '-' [variant]
  for easier extraction with rVariant.getToken(1, '-', nIndex).

Create a Bcp47 class that transparently handles all cases and tells
which subtags are involved, capable of parsing a BCP47 string or Unix
locale string and convert those to Locale, or analyze a Locale struct
and construct a BCP47 string. This is needed for document access to
decide what attributes are to be written, read *:rfc-language-tag
attributes, and to invoke spell checkers etc. etc. We'd need that
anyway.

Did I miss anything?

Ah, yes, of course, adapt the gazillion places that drop Variant.

And, maybe, using such a Locale with Java might lead to unpredictable
results, I don't know.

  Eike

-- 
 OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer.
 SunSign   0x87F8D412 : 2F58 5236 DB02 F335 8304  7D6C 65C9 F9B5 87F8 D412
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
 Please don't send personal mail to the e...@sun.com account, which I use for
 mailing lists only and don't read from outside Sun. Use er...@sun.com Thanks.

Attachment: pgpRhmOFI6GVn.pgp
Description: PGP signature

Reply via email to