Date: Sun, 14 Sep 2025 13:23:06 -0400 From: =?UTF-8?Q?Lawrence_Vel=C3=A1zquez?= <v...@larryv.me> Message-ID: <bbe1cd4d-2e4d-41a1-be56-f49ef925c...@app.fastmail.com>
| POSIX seems to require delimiting on all <blank>s [*], without | qualification. | | 7. If the current character is an unquoted <blank>, any | token containing the previous character is delimited | and the current character shall be discarded. It does indeed, but what is less clear is what locale should be used when doing that. There is a fairly strong argument that parsing scripts (including the arg to -c etc) should always be done in the C locale (aka POSIX) completely ignoring the LC_* (and LANG) variable values, using those only when processing user data, rather than the script itself. That would make scripts far more portable, and more like what any compiled language does - C programs don't change their syntax and run differently (in terms of what code is actually executed) just because the user changes locale before running them - why should any other program, regardless of in which language (computer language) it is written. Note that the first sentences in section 2.3 (Token recognition) which you quoted from (as included above) are: The shell shall read its input in terms of lines. (For details about how the shell reads its input, see the description of sh.) In the XCU 3 page for sh, we find: LC_CTYPE Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single-byte as opposed to multi-byte characters in arguments and input files), which characters are defined as letters (character class alpha), and the behavior of character classes within pattern matching. What seems relevant there is the "in arguments and input files" -and that it doesn't say anything about when parsing the script (unless one classified the script as an input file, and given the context, that seems like a stretch). There is really nothing in the standard I can find which is explicit about how locales affect parsing of programs using any of the languages it defines (awk, bc, sed, sh ...) - and I would certainly hesitate to jump to the conclusion that the locale settings are intended to affect that, in fact I think that would be an extremely poor result. kre