A philosophical question regarding shell vars & shell built-in utilities

Robert Elz via austin-group-l at The Open Group Thu, 19 Oct 2023 12:47:02 -0700

While generating

    https://www.austingroupbugs.net/view.php?id=1778#c6550


   (note 6550 to bug 1778, mostly about field splitting with the read utility,
   and in particular whether reading into some vars should have unspecified
   effects if changes to those variables could affect the field splitting
   behaviour - reading into, and hence changing, IFS is an obvious example) 

and even earlier, I started to consider what the relationship should be
between shell variables, and shell built-in utilities.

Utilities like read (also getopts, cd, ...) which (almost) must be built
in as they are specified to alter shell variables are something of a special
case, so I'll defer discussion of those until later in this message.  [Aside:
just "almost must be built in" for some of these, as an implementation could
have some other method to allow a utility to interact with the shell, and use
that to allow designated utilities to alter shell variables, or other aspects
of the shell environment.]

So, for now, let's just consider the "often" built in utilities, like
printf, echo, test (aka '[') etc.

With those, if a shell does something like

        unset LANG LC_ALL LC_CTYPE LC_COLLATE LC_MONETARY LC_TIME ....
        LANG=weird
        printf format arg arg arg

Is printf allowed, required, or prohibited from doing its output as
if LANG==weird ?    Note that LANG here is not exported (that was part
of the point of the unset) and if printf were not built in, it would
have no access to the shell's internal LANG variable.   But if it is
builtin, it does.

Is there any language in the current (or forthcoming) standard that
is intended to specify this?  (If anyone knows of some, please reference
or quote it.)

Similarly with test, and the collating sequence for the weird LANG.

Note that if we were instead to do

        export LANG=weird
        printf format arg arg arg

or

        LANG=weird printf format arg arg arg

then it is clear that the exported LANG is intended (required) for printf
to use (and similarly for any other utilities, built-in or not).

Now we get to the issue of those utilities which are required to alter
shell variables, where for consistency I think some of the answers will
depend upon the answer to the question above.

Let's take a particularly simple (and now clear) example first

        X=whatever
        X=something unset X

In the forthcoming standard, it is clear than when this completes, X must
be unset, and not have either "whatever" or "something" as its value, and
must not be exported.   That applies to any special built-in utility which
modifies shell variables.

Now let's look at a similar, but closely related (but much more complex)
case

        X=whatever
        X=something . script

and assume the script does

        X=newvalue

as one of its commands (whole command, not a var-assign for something else),
and that that is the sole mention of X in "script" (or perhaps it is expanded
as well, but that doesn't affect its value).

Since '.' is a special builtin, I believe the same rule applies, and that
when the dot script completes, the shell environment should have X=newvalue
as part of it, though it is less clear to me what the requirement is wrt
X's export status (must be, must not be, unspecified whether ...).

If we had instead

        unset X; X=newvalue

in the script, then I think it would be clear, when the script is complete
the shell environment must have X=newvalue and X must not be exported.

[Aside: for anyone wanting to make exceptions in case X is readonly, then
we know here it cannot be, as we are making assignments to X before running
the dot script.]

To make this less abstract, a more likely example perhaps

        PATH=/where/my/script/lives . script

and "script" sets PATH to whatever I really want it to be.  That might be
all it does, script might be a single line containing
        PATH=/bin:/usr/bin
(or something).   There'd be no question if I instead did

        . /where/my/script/lives/script

but I didn't, I chose to find the script using the temporary exported PATH.

All of this is now (will be in POSIX Issue 8) specified for special built
in utilities.   In the PATH example, in both invocations, PATH must end
up being what the script set it to, not whatever it had previously held, and
not the value exported into the script in the first invocation (though that
would be what it would be required to be if the script did not set PATH).


But all that doesn't cover other utilities that are built in, which are
not special built-in, like read, cd and getopts, but which do set variables.
It would (or could) also cover extensions in various shells, like bash's
printf's -v option (write the output into a shell variable) or its %n
format specifier (next arg is a var name, which gets set to the number of
bytes (or maybe chars, doesn't matter here) which have been output before
that format specifier (just like printf(3)).

OK, first question here, and remember to consider your answer to
the questions above about built in printf, test, etc, if we do

        unset IFS
        IFS=:
        read -r a b c

If the line read from standard input (without the leading white space,
that's just for this email, only the spaces between y and p, and q and r
exist in the line) is

        x:y p:q r

should we end up with

        a=x   b='y p'  c='q r"
or
        a=x:y b=p:q    c=r

?       Why, and what in the standard specifies (or allows) the answer
you believe to be correct?


Second question (and for this one, assume that the read implementation
behaves as is specified in mantis bug 1778 - which all shells do for this
purpose as far as I can tell)

If we have

        X=foo
        IFS= X=bar read -r X

and the input is "abc\n"  (the \n becomes the line delimiter, and is removed)
then what value will be in X after this is complete?   Why?

Note that here we explicitly export IFS to read, avoiding the issue of the
previous question, and by giving read an empty value for IFS, we explicitly
say "no field splitting happens", so the line read is to be assigned to the
variable named, which is X in this case, so read will assign "abc" to X.
But what happens in the shell afterwards?   Where is that specified?

Now let's go one step further, and consider combining all of this into
a more complex example, where hopefully the answers we get will be
consistent with the answers to the previous questions.


Consider now a case where a variable that a (not special) built-in utility
uses as part of its operation is exported to that utility, and the utility
sets that variable (into the shell environment) as a part of its operation.

As a first, and particularly absurd, example - and one which is using
extensions to the standard, let's consider bash's printf %n format operator.
The idea here is to get some general rules for how built-in utilities are
supposed to operate in general, not to comment on bash, or %n formats, etc,
so please just assume all of that part were to be made standard for this.

Consider (with all locale vars initially unset)

        LANG=en_US.UTF-8 printf $'Hello %s%n \xC3\xAC\n' World LANG

Now after printf has output "Hello World" it has written 11 bytes (and 
characters) to standard output, so the %n format causes LANG=11 to
be executed.   First question for this example - is the change made to
LANG in the shell environment (remember that LANG there is unset, and
so not exported) supposed to have any influence at all upon how printf
operates, or should it see only the value exported to it in its environment?

If the answer is the latter (no changes should be visible inside printf)
then there's nothing more to answer.   Otherwise, what does printf output
for the two bytes given in hex here, one UTF-8 encoded character, or just
2 bytes as we end up by default with LANG=C (as there almost certainly is no
language called "11" - let's just assume that is true for the purposes of
this message).   While that probably makes no difference in this particular
example, the bytes output would be the same, consider instead a case
where the default shell LANG at the time wasn't unset, so we had (all
other locale vars still unset)

        LANG=en_US.UTF-8
        LANG=zh_TW.BIG5 printf $'Hello %s%n \xC3\xAC\n' World LANG

When the shell encodes the $'' string, it is using en_US.UTF-8 (I presume,
assume these are 2 interactive commands entered at the keyboard, the first
is parsed and executed before the 2nd is read) and so that will be a single
UTF-8 encoded character (I presume, I really know nothing about char encoding
or locales) but when printf comes to output those bytes, they are to either
be output as Taiwanese Chinese BIG 5 encoding, or as 2 bytes of (kind of)
ascii ?

Note here I am not really interested in the locale or encoding aspects, the
underlying question is whether when a built-in utility alters the value of
a shell variable, should that utility be able to (or required to, or not
permitted to) access that variable for its own uses later, and does that
differ if a particular value for the variable has been exported into its
environment for it to use ?

Now all this is leading up to this example ...

        unset IFS
        IFS=$' \t\n' read IFS v1 v2 v3

(and for this, ignore what is happening in bug 1778, and concentrate just
on the philosophy of what should happen).

Take the almost same example input as earlier:

        : x:y p:q r

Now if you're of the opinion that read should not alter any of the
variables until field splitting has completed, the answer here is
clear, the resulting variables after the read has finished are

        IFS=: v1=x:y v2=p:q v3=r

as the field splitting happens with space in IFS, so that's what is
used to split the fields.

On the other hand, for the interesting case, let's assume you're one
of the "as each field is split out, assign it to its variable" (which
several shells implement).   In that case, the IFS=: assignment happens
before field splitting gets to the next field, and the shells that
implement things this way use that value for splitting the remaining
fields, leading to

        IFS=: v1=x v2='y p' v3='q r'

as the result.   But should they?   Not should they assign field by
field, that's not the question here, but when field splitting continues,
should it be using the value that was passed in the environment for read
to use, or should it be digging into the shell to look at variable values
that are not exported to it, just because the utility might have changed
that value?

I have one more very much related example for everyone to consider...

        pwd; OLDPWD=/foo; OLDPWD=/bar cd /tmp; echo $OLDPWD

where shells produce different results, but where in this case I
believe the standard (the current draft for Issue 8 anyway) is clear
what is required.   I believe that the output should be two identical
lines, containing the path of whatever the current directory was
before this command line was executed (and let's assume that was not /tmp
just to avoid complications).   And I believe that is what Draft 3 of
POSIX Issue 8 requires.   Fortunately no shell I tested outputs /bar for
the echo (that would clearly be wrong).   But many output /foo which makes
no sense to me at all, and can only be a bug, there's nothing in the
standard I can find which permits that result - the standard says that cd
sets OLDPWD to the path to the current working directory before it was
changed.   That's what the initial "pwd" printed, which is why the two
output lines should be the same (and several shells implement it that way).
The messing around with the assignments to OLDPWD should be irrelevant,
none of that should have any effect upon anything - yet in several shells
it does.

Amazingly, the (now older) AST version of ksh93 got this correct, but the
current (community maintained) version seems to have broken it.   (It is
unfortunate we cannot play games with setting PWD here, as doing that makes
cd's behaviour unspecified - but setting OLDPWD does that only in the case
of "cd -" which we're not doing here.)


Enough for now - more might follow based upon responses.

One last point, I am asking here more about what should happen, rather
than what does happen in current shells - we can get to the later question
if we can form some common opinion about the desired outcomes here.
For shell implementers, and users with a favourite shell, please don't
just consider what your own shell does here, unless you particularly
carefully pondered this question before, and either implemented what
you believe is correct, or picked a shell which had done that - which
exception, my guess is, applies to just about no-one (including me).

kre

A philosophical question regarding shell vars & shell built-in utilities

Reply via email to