Re: allowing an @modifier for documentlanguage locale-based argument

Gavin Smith Tue, 14 Apr 2026 12:04:40 -0700

On Mon, Apr 13, 2026 at 11:58:31PM +0200, Patrice Dumas wrote:
> > I don't like the "latn" and "cyrl" abbreviations used in BCP 47 for "latin"
> > and "cyrillic" (I'm aware these abbreviated names come from incorporation
> > of another ISO standard) and think we should just stick to "latin" and
> > "cyrillic" as already used in .po files.
> 
> Maybe we could accept both the 4 letter codes of ISO 15924 and aliases,
> with the constraint that aliases should be at least 5 letters long,
> allowing the ones used in po files for the most common cases (as seen
> in bcp47.c:
>   { "latin",      "Latn" },
>   { "cyrillic",   "Cyrl" },
>   { "hebrew",     "Hebr" },
>   { "arabic",     "Arab" },
>   { "devanagari", "Deva" },
>   { "gurmukhi",   "Guru" },
>   { "mongolian",  "Mong" }
> 
> 
> Looking at the list of ISO 15924 codes and names, many names are quite long
> and there need to be disambiguation:
> https://en.wikipedia.org/wiki/ISO_15924
> 
> The 4 letter codes are practical to have short but readable and unique names,
> therefore I think that we should accept the 4 letter codes in any case.


I'm okay with adding some kind of @documentscript command to distinguish
the Latin and Cyrillic alphabets, as this is a legit problem for Serbian
(which is a language that does have translations for Texinfo, and likely some
users as well).  ISO 15924 is probably ok to reference for script
identifiers, although it's likely that hardly any of these would ever be
used in Texinfo documents, apart from Latn and Cyrl, as that's all that
is currently supported with the gettext framework (I think we should
still support using "latin" and "cyrillic", though.)

I'm still sceptical about other ways of specifying language variants using
codes from BCP 47.

I notice that ISO language codes and ISO country codes work well enough and
are simple to understand.  I don't know exactly how ISO make their decisions
about country codes or languages, but they seem to do it well enough and
without any controversy.  It means users of the system can accept their
decisions and avoid the need to argue about the existence of countries
(Kosovo, Palestine) or languages (like Scots, or the question of whether
Chinese (zh) is "just one language").

So in principle I'm fine with accepting such codes from standards bodies.

However, how do we know that that the BCP 47 system promulgated by the
IETF and/or IANA really classifies dialects of a language in the most
appropriate way?  As far as I know, the IETF and IANA are separate
organisations from the ISO.  If the "variant" codes were a straightforward
extension of ISO 639, then it would be easier to accept, but it seems
to be its own system run by different organisations.

It may not be the case that adopting BCP 47 achieves the aim of being able
to designate all the languages that users want to use in their documents.

It seems that dialect or language classification is a hard and probably
never-ending job.  When it gets to finer distinctions of dialect they
become more questionable, with no map or schema ever captures the full truth.  
A finer-grained classification than the ISO 639 language codes may be harder
to achieve successfully.  So the extent to which BCP 47 solves a problem
depends on how well the IETF and/or IANA maintain the language codes.

For example, the IANA subtag registry gives about 10 variants of Occitan,
but not that many for most other languages.  Is this just because somebody
wrote in asking for them to be added, or do they have some process of getting
linguistic experts to check information?  Do they have people working
all over the world studying dialects?

It seems that the problem of dialect classification is potentially
very open to input from biased people pushing pet theories (which
would especially be a problem for more obscure languages than Occitan
- presumably somebody would pick up if somebody was trying to invent
a non-existent dialect of Occitan, but this might not be the case for
other languages).  Trying to reading about dialect variations on the
Internet can be a depressing process due to the large amount of ignorant
anecdotal opinions expressed by non-experts.

Suppose there turned out to be a flaw in the way that the IANA allocated
sub-language codes.  Then we'd be stuck with referencing a broken system.

I imagine there well may be disputes as to the best way to study, document
and classify the underlying linguistic reality, especially when it comes
to minor linguistic variations.  It's possible there may be systems for
classifying languages and dialects that may be better than what BCP
47 does, or that there such systems may exist in the future.

For example, if there was a real practical need for distinguishing
variants of a language, maybe the ISO would invent new top-level codes
for them.  There are ISO 639 codes for languages that could be considered
dialects.  There is "yue" for Yue (Cantonese) even though there is already
"zh" for Chinese.  There is "oc" for Occitan as well as codes for closely
related Romance languages.  There is "sco" for Scots even though there
is also "en" for English.

If the proposal to reference the distinctions enabled in BCP 47 were
driven by present practical needs, then we could better tell if using
it were likely to be sufficient for those needs.  Hower, the inability to
distinguish dialects of Occitan, or to distinguish Ekavian and Ijekavian
pronuciations of Serbian, appear merely hypothetical problems.

The only problem that it is definitely worth solving is how to distinguish
between Cyrillic and Latin alphabet use for Serbian.

Re: allowing an @modifier for documentlanguage locale-based argument

Reply via email to