On Tue, 16 Sept 2025 at 01:46, Chet Ramey <chet.ra...@case.edu> wrote:
> On 9/15/25 10:46 AM, Robert Elz wrote: > > > If it was intended to mean "parsing the script" it would certainly say > so. > > Doesn't the fact that the discussion of token recognition includes the > references to <blank>s imply this? > There's an argument to utility that strongly suggests that this is not what's intended. Locale awareness makes programs more portable in the sense that they make them easier to use by a wider range of people. Influencing the parsing of a script based on a locale that's outside the script's control has the opposite effect: it makes them less portable, because what runs OK in one locale won't even parse correctly in another. Having encoding be specified as a sub-aspect of a locale seemed like a reasonable design choice in the 1970s,(*1) but with Unicode displacing other encodings, it's become apparent that this entanglement has problems, not least that it leads discussions (like this one) down rabbit holes when some participants assume encoding is intrinsically part of the locale, and that therefore anything that depends on encoding must therefore be locale-dependent. In practice character categories like <blank> are generally invariant(*2) for a given encoding,(*3) even among different locales. Perhaps it's worth considering the <alpha> category, which in some national ASCII variants(*4) would include codepoints 0x5b…0x5e & 0x7b…0x7e. It would make no sense(*6) to treat the US-ASCII symbols @ [ \ ] ^ ` { | } ~ as “part of an identifier”(*7) while parsing a script, and LC_CTYPE(*8) should *clearly* be ignored in this case. So I would argue that for consistency it should be ignored entirely while parsing. Languages like C make a clear distinction between the compile-time environment, where charset encoding is honoured(*9) but natural language features are ignored, and the run-time environment, where the full suite of localisation functionality is available (but still optional). Bash scripts aren't “compiled” but they're certainly subject to multiple phases of interpretation, and locales should be ignored in some and honoured in others. I support Greg Wooledge's request: if you feel obliged to implement locale-aware parsing, please gate it on being in Posix mode, and/or a new default-off option. Additionally, please can this be highlighted in the manual as “experimental” and “may be withdrawn in a future version of Bash” or similar, until such time as there's an official interpretation issued by the Posix WG. -Martin (*1: an alternative could have been to include the encoding in $TERM.) (*2: Arguably there are exceptions, but then the question arises of whether you're really looking at “the same” encoding.) (*3: Extensible encodings like UTF-8 may be revised to assign new codepoints, and include them in various categories. (*99)) (*4: Sure, nobody uses ISO/IEC-646 national variant encodings outside a museum.(*5)) (*5: Some museums also function as private residences.) (*6: well, it makes no sense to me. Given my unfortunate prior tendency to employ hyperbole and exaggeration, this time I stopped and thought really hard about whether I should include this statement. Personally I think that this hypothetical treatment of national-variant ASCII codepoints would be completely nuts^W bonkers^W^W extremely counterproductive, and I hope that this is a widely shared opinion.) (*7: I'm aware that this isn't exactly how “identifier” is defined, but I'm trying to make a case for rationality and consistency rather than strict adherence to a (possibly erroneous) interpretation of the standard. I await the WG's clarification on this point.) (*8: Including LC_CTYPE inferred from LANG or overridden by LC_ALL.) (*9: It is possible to encode C source code in EBCDIC, but conversely not possible to encode C source code in some national ASCII variants (C23 §5.2.1). Multi-byte character encodings are expressly permitted in source code (C23 §5.2.2, §6.4.4.5, §6.4.5.) and moreover, there's explicit support for translating the source encoding to UTF-8, UTF-16, or UTF-32.) (*99: Open question: should Bash treat ISO-10646-1991 as “the same” encoding as ISO-10646-2025? As ISO-10646-2091? If not, why not? If so, with what caveats & meta rules?)