Date:        Sun, 14 Sep 2025 13:23:06 -0400
    From:        =?UTF-8?Q?Lawrence_Vel=C3=A1zquez?= <v...@larryv.me>
    Message-ID:  <bbe1cd4d-2e4d-41a1-be56-f49ef925c...@app.fastmail.com>


  | POSIX seems to require delimiting on all <blank>s [*], without
  | qualification.
  |
  |     7.  If the current character is an unquoted <blank>, any
  |         token containing the previous character is delimited
  |         and the current character shall be discarded.


It does indeed, but what is less clear is what locale should be used
when doing that.  There is a fairly strong argument that parsing scripts
(including the arg to -c etc) should always be done in the C locale (aka POSIX)
completely ignoring the LC_* (and LANG) variable values, using those only
when processing user data, rather than the script itself.

That would make scripts far more portable, and more like what any compiled
language does - C programs don't change their syntax and run differently
(in terms of what code is actually executed) just because the user changes
locale before running them - why should any other program, regardless of
in which language (computer language) it is written.

Note that the first sentences in section 2.3 (Token recognition) which you
quoted from (as included above) are:

        The shell shall read its input in terms of lines. (For details about
        how the shell reads its input, see the description of sh.)

In the XCU 3 page for sh, we find:

LC_CTYPE Determine the locale for the interpretation of sequences of bytes
         of text data as characters (for example, single-byte as opposed to
         multi-byte characters in arguments and input files), which characters
         are defined as letters (character class alpha), and the behavior
         of character classes within pattern matching.

What seems relevant there is the "in arguments and input files" -and that it
doesn't say anything about when parsing the script (unless one classified
the script as an input file, and given the context, that seems like a stretch).

There is really nothing in the standard I can find which is explicit about
how locales affect parsing of programs using any of the languages it defines
(awk, bc, sed, sh ...) - and I would certainly hesitate to jump to the
conclusion that the locale settings are intended to affect that, in fact
I think that would be an extremely poor result.

kre


Reply via email to