On Tue, Apr 14, 2026 at 10:36:25PM +0200, Patrice Dumas wrote:
> On Tue, Apr 14, 2026 at 08:04:16PM +0100, Gavin Smith wrote:
> > I'm still sceptical about other ways of specifying language variants using
> > codes from BCP 47.
> > 
> > However, how do we know that that the BCP 47 system promulgated by the
> > IETF and/or IANA really classifies dialects of a language in the most
> > appropriate way?
> 
> We do not know.  But the ISO is not infallible either.
> 
> > As far as I know, the IETF and IANA are separate
> > organisations from the ISO.  If the "variant" codes were a straightforward
> > extension of ISO 639, then it would be easier to accept, but it seems
> > to be its own system run by different organisations.
> 
> I do not understand why you consider differently ISO, IANA and IETF (or
> the Unicode consortium, which relies on IANA too for its data).  All are
> standard bodies, IETF and IANA are associated to the internet, while ISO
> is more generic, but none is more trustworthy than another.

...

> > For example, the IANA subtag registry gives about 10 variants of Occitan,
> > but not that many for most other languages.  Is this just because somebody
> > wrote in asking for them to be added, or do they have some process of 
> > getting
> > linguistic experts to check information?  Do they have people working
> > all over the world studying dialects?
> 
> I do not know.  All I can say is that, for somebody who is not very
> knowledgable on the matter, not a speaker nor reader but has an interest
> in occitan, this looks like good choices.  Among others, of course,
> there is no absolute certainty that the choices made are the best.  It
> is the same for languages, actually.  The languages tags change over
> time.  But, even if the IANA work is imperfect, which I have no evidence
> of, I still rest my case that ignoring all the language variants is much
> worse than being occasionally wrong for some variants.

RFC 5646 (BCP 47) (section 3.5) explains how anyone can make submissions to
add language subtags:

   The subtag registration form MUST be sent to
   <[email protected]>.  Registration requests receive a two-week
   review period before being approved and submitted to IANA for
   inclusion in the registry. 

The submissions are reviewed by the Language Subtag Reviewer, who
is an individual appointed by the "IESG".  (Section 3.2).  (Some pages
online identify the individual as Doug Ewell.)

https://www.rfc-editor.org/rfc/rfc5646.txt

It's not on the same level as assigning a new ISO country code, which
is going to have vastly more scrutiny, and I expect that language codes
(especially the two-letter codes) would likewise have a greater amount
of scrutiny.

> > It seems that the problem of dialect classification is potentially
> > very open to input from biased people pushing pet theories (which
> > would especially be a problem for more obscure languages than Occitan
> > - presumably somebody would pick up if somebody was trying to invent
> > a non-existent dialect of Occitan, but this might not be the case for
> > other languages).
> 
> Is that a real issue?  Worse that not supporting any of the language
> variants?

Some of the closed non-English Wikipedias may be examples of where
it was difficult to know if the language being used was valid. 

https://en.wikipedia.org/wiki/List_of_Wikipedias#Wikipedia_editions

For example:

  "The project has only ever attracted a single user who claimed to
  be fluent in Norfuk; he or she didn't add much content, and has not
  edited since 2012. As a result, with neither fluent speakers nor written
  literature to "sanity check" the project, we have no idea whether this
  project is even in Norfuk at all, or would be useful to the very rare
  Norfuk speaker who'd rather use a tiny Norfuk edition of Wikipedia rather
  than English Wikipedia. "

https://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Pitkern_%26_Norfuk_Wikipedia_3

en-boont represents Boontling which doesn't appear to be a very important
phenonemon, based on: https://en.wikipedia.org/wiki/Boontling

The reason that Boontling gets a code, and dozens of traditional English
dialects don't, is just that somebody proposed Boontling for inclusion,
and nobody (I expect) proposed the other dialects.  (Such dialects were
much in decline in the second half of the 20th century, as far as I am
aware.)

It just seems that the task of dialect classification is very susceptible
to low quality information and even hoaxes.

There was the "Focurc" hoax that got some attention online:

  Tucked away in the villages outside Falkirk in the Scottish central
  lowlands, a language has been slowly developing for centuries. A few
  hundred native speakers remain, buffeted by modernity, snobbery and a
  schools system that makes people speak “proper” English. It sounds
  a lot like Scots, but it’s not Scots: it’s called Focurc.
  
  That’s the contention of Mark O’Donnell, 22, the language’s main
  cheerleader online – and the only person ever to have documented it.

  ...

  And we should note that what O’Donnell says is not given much credence
  by academics who have devoted their careers to the study of Scots.
  
  “Having considered, with all good will, the evidence presented for
  there being a separate West Germanic language, closely related to, but
  not the same as, Scots, spoken in the Falkirk area, there appears to be no
  reason for supporting such a hypothesis,” says Robert Millar, professor
  of linguistics and Scottish language at the University of Aberdeen.

https://inews.co.uk/news/long-reads/focurc-newly-documented-language-found-one-scottish-area-86908

Now I'm not saying that there is anything wrong with how the IANA is allocating
codes, but we should be aware of the potential for issues and controversies.


> > Suppose there turned out to be a flaw in the way that the IANA allocated
> > sub-language codes.  Then we'd be stuck with referencing a broken system.
> 
> Not only us.  Everybody, as all the internet uses the IANA system,
> through HTML, XML, libreoffice, LaTeX, wikipedia...  I do not believe
> that we would be the most impacted, HTML, libreoffice or wikipedia have
> a goal of handling all the languages and have users and content in many
> language variants.  We definitively are not at the forefront of
> supporting the diversity of languages...  In any case, we could change
> the data we use if we are dissatisfied with IANA.

My concern is there has been very little apparent problems with the
current way of naming a language, in Texinfo, in POSIX locales, or in
the gettext translation framework.  Hence I'm cautious about complicating
the system used by opening up to this system for classifying dialects.

> > I imagine there well may be disputes as to the best way to study, document
> > and classify the underlying linguistic reality, especially when it comes
> > to minor linguistic variations.  It's possible there may be systems for
> > classifying languages and dialects that may be better than what BCP
> > 47 does, or that there such systems may exist in the future.
> 
> Then we can switch.  All the evidence points towards the IANA selection
> being the state of the art nowadays.  But we can switch whenever we
> want.

Then we should design the Texinfo language so that it is easy to switch.

> > If the proposal to reference the distinctions enabled in BCP 47 were
> > driven by present practical needs, then we could better tell if using
> > it were likely to be sufficient for those needs.  Hower, the inability to
> > distinguish dialects of Occitan, or to distinguish Ekavian and Ijekavian
> > pronuciations of Serbian, appear merely hypothetical problems.
> 
> For the Occitan it is not an hypothetical issue, that's for sure.  What
> is not known is wether there are people knowledgable in Occitan willing
> to write manuals and translated strings.  Outside of Texinfo, variants
> are definitively used, see babel for example for their list, it includes
> be-tarask, sr-ijekavsk, de-AT-1901, el-polyton.  (No Occitan variant
> though, although I would have bet on it ;-).

I think it's okay to add a command for specifying language sub-variant,
as long as the complexity of referencing the language is limited to that
one language.  As before:

@documentlanguage sr
@documentlanguagevariant ekavsk
@documentscript latin

Only the "@documentlanguagevariant" command would need to be defined in
terms of BCP 47 or the IANA language subtag registry.

Then in the future, if there is any issue about the definition of language
variants, or other systems become more prominent, we would just need
to change the documented usage of this one command.

ISO language code would continue to be within the @documentlanguage line,
as in "@documentlanguage fr_CA".

If there was a need for more than one variant, this could be given as
a comma-separated list on the @documentlanguagevariant line: like the
argument to @example.

Script, as you say is less specific than language dialect (if I understand
correctly) so can be given in a separate command.  (First you specify
the spoken language, then the script for writing it down.)  I think we
can avoid the use of @modifier suffixes to language identifiers.

Reply via email to