On Tue, 16 Sept 2025 at 01:46, Chet Ramey <chet.ra...@case.edu> wrote:

> On 9/15/25 10:46 AM, Robert Elz wrote:
>
> > If it was intended to mean "parsing the script" it would certainly say
> so.
>
> Doesn't the fact that the discussion of token recognition includes the
> references to <blank>s imply this?
>

There's an argument to utility that strongly suggests that this is not
what's intended. Locale awareness makes programs more portable in the sense
that they make them easier to use by a wider range of people. Influencing
the parsing of a script based on a locale that's outside the script's
control has the opposite effect: it makes them less portable, because what
runs OK in one locale won't even parse correctly in another.

Having encoding be specified as a sub-aspect of a locale seemed like a
reasonable design choice in the 1970s,(*1) but with Unicode displacing
other encodings, it's become apparent that this entanglement has problems,
not least that it leads discussions (like this one) down rabbit holes when
some participants assume encoding is intrinsically part of the locale, and
that therefore anything that depends on encoding must therefore be
locale-dependent. In practice character categories like <blank>
are generally invariant(*2) for a given encoding,(*3) even among different
locales.

Perhaps it's worth considering the <alpha> category, which in some national
ASCII variants(*4) would include codepoints 0x5b…0x5e & 0x7b…0x7e.  It
would make no sense(*6) to treat the US-ASCII symbols @ [ \ ] ^ ` { | } ~
as “part of an identifier”(*7) while parsing a script, and LC_CTYPE(*8)
should *clearly* be ignored in this case. So I would argue that for
consistency it should be ignored entirely while parsing.

Languages like C make a clear distinction between the compile-time
environment, where charset encoding is honoured(*9) but natural language
features are ignored, and the run-time environment, where the full suite of
localisation functionality is available (but still optional). Bash scripts
aren't “compiled” but they're certainly subject to multiple phases of
interpretation, and locales should be ignored in some and honoured in
others.

I support Greg Wooledge's request: if you feel obliged to implement
locale-aware parsing, please gate it on being in Posix mode, and/or a new
default-off option. Additionally, please can this be highlighted in the
manual as “experimental” and “may be withdrawn in a future version of Bash”
or similar, until such time as there's an official interpretation issued by
the Posix WG.

-Martin

(*1: an alternative could have been to include the encoding in $TERM.)

(*2: Arguably there are exceptions, but then the question arises of whether
you're really looking at “the same” encoding.)

(*3: Extensible encodings like UTF-8 may be revised to assign new
codepoints, and include them in various categories. (*99))

(*4: Sure, nobody uses ISO/IEC-646 national variant encodings outside a
museum.(*5))

(*5: Some museums also function as private residences.)

(*6: well, it makes no sense to me. Given my unfortunate prior tendency to
employ hyperbole and exaggeration, this time I stopped and thought really
hard about whether I should include this statement. Personally I think that
this hypothetical treatment of national-variant ASCII codepoints would be
completely nuts^W bonkers^W^W extremely counterproductive, and I hope that
this is a widely shared opinion.)

(*7: I'm aware that this isn't exactly how “identifier” is defined, but I'm
trying to make a case for rationality and consistency rather than strict
adherence to a (possibly erroneous) interpretation of the standard. I await
the WG's clarification on this point.)

(*8: Including LC_CTYPE inferred from LANG or overridden by LC_ALL.)

(*9: It is possible to encode C source code in EBCDIC, but conversely not
possible to encode C source code in some national ASCII variants (C23
§5.2.1). Multi-byte character encodings are expressly permitted in source
code (C23 §5.2.2, §6.4.4.5, §6.4.5.) and moreover, there's explicit support
for translating the source encoding to UTF-8, UTF-16, or UTF-32.)

(*99: Open question: should Bash treat ISO-10646-1991 as “the same”
encoding as ISO-10646-2025? As ISO-10646-2091? If not, why not? If so, with
what caveats & meta rules?)

Reply via email to