On Fri, 2022-02-18 at 00:35 +0000, Thorsten Glaser wrote:
> You can have nōn-POSIX locales. For example, in mksh, I have a UTF-8
> mode, but I specify that only the "C" locale attempts POSIX
> conformance.

But that sounds like a violation of POSIX, e.g. if you had a locale 'C'
which would encode '.' as 0x2E and another one which encodes the same
as something else - you couldn't just say that only the other locale is
non-POSIX, but then your whole implementation wouldn't be compliant.

Same if any of the other hard rules are broken... like chars from the
portable charset being just one byte long, etc..



> Switching the locale during shell runtime is not allowed to change
> the way the script is parsed, so the variables etc. are all that is
> permitted to “change”, by means of reinterpretation.

Yes I found that now in the standard and I guess it's more or less
clear when a locale change does apply and when not:

- foo=x
  => clear,... all lexical, change doesn't apply

- printf '%s' 'foo'
  => clear, printf is like a command.. so printf would use the change
     locale, depending on what the format actually is (which is also a
     bit ambiguous, see https://www.austingroupbugs.net/view.php?id=1562)

- ${var%foo}
  => well,.. semi-clear

- expanding $# or $?
  => in principle, POSIX seems to allow locales that have different
     encoding for the chars from the portable charset
     So in principle, one could ask whether $# and $? gives the digits
     from the new locale, when it was changed in internally.
     However, POSIX also says, when the portable charset chars are
     encoded differently, and such locales are used, the results are
     unspecified.
     So no doesn't really matter.


> But here you’re lucky again that <period> has to have the exact same
> encoding across *all* locales supported in one POSIX “universe”, and
> that it must not occur as part of a multibyte encoding in a supported
> locale on the same universe.

Despite that.. and despite the solution that had been discussed here
before, ... having spent quite some thought (and hopefully learned a
bit) about it... I'm still unsure about how to make it the command
substitution with trailing newlines really 100% portable in any
situation (i.e. locale) allowed by POSIX and especially with any shell
conforming to POSIX.

... (below)



> > But at least, it should still work portably, when doing the
> > LC_ALL=C
> 
> No, absolutely not.
> 
> In all supporta̲b̲l̲e̲ scenarios (i.e. those in which you’re not
> entering
> unspecified behaviour already anyway), you’ll be safe with:
> 
> x=$(command; echo .); x=${x%.}
> 
> (Or a variant that carries over $?, of course.)

...

I know you've said earlier, that you considered using '.' enough but
Chet Ramey, Geoff Clare and other still said the LC_ALL=C switch would
be necessary.

It was brought up before that an implementation would be allowed to not
handle it gracefully, if the string was say: "<some invalid
encoding>."... and even if that coulnd't form a new character because
of the special properties of '.' ... it could still fail to being
stripped of properly.

Your argument was, that and shell that fails to do that would have a
bug... but it's unclear whether that's really mandated by the standard.


That's why I've asked before:
> I tried to find out in the standard, what POSIX actually says that
> "${tmp%∈}" operates on: bytes or characters.
> 
> And that seems a bit ambiguous (well, to me at least).
> 
> - In some earlier discussion it was pointed out that shell variables
>   should be strings (of bytes, other than NUL)

If variables are byte strings... (which is also disputed, btw.)...

> 
> - 2.6.2 Parameter Expansion
>   doesn't seem to say, what the #, ##, % an %% special forms of
>   expansion work on: bytes or characters
> 
> - 2.13. Pattern Matching Notation says:
>   "The pattern matching notation described in this section is used to
>   specify patterns for matching strings in the shell."
>   => strings... would mean bytes

... and pattern matching notation works on strings (=bytes)...

> 
> - 2.13.1 Patterns Matching a Single Character however says:
>   "The following patterns matching a single character shall match a
>   single character: ordinary characters,..."

... but matches characters ...

... does that imply what you said before (i.e. any shells not ignoring
invalid characters have a bug as per POSIX and thus just '.' would be
enough)?
Or does it imply the standard is contradictory there?
Or does it imply the use of the term "string" is simply an error and it
should have been "character string"... and thus only a variable that
contains purely valid characters would work for sure?

It doesn't seem to be really standardised in any way.

So your point that *just* using '.' without any LC_ALL=C ... may or may
not be enough - at least not from the standards PoV.



I'm not fully convinced of either... and the locale switch in turn -
whether it's actually needed or not - seems to bring in it's own
problems as well, especially since I may have to use unset on LC_ALL
when restoring it's previous state.

However, it seems that shells that support `local` vars are quite
incompatible with respect to that, and that this may even cause
troubles when one doesn't use `local` oneself but e.g. a function
cmd_subst_with_trailing_nl() is called by someone who does).
(see 
https://lore.kernel.org/dash/2a150a397fcdda7dd357dd5b125ec585acce3a70.ca...@scientia.org/T/#u
 )



One could now say:
The hell... why are you asking... just use solely "." or use it
together with LC_ALL=C ... and it will work in 99% of all cases.

And true... it likely would... but then I could also just use "x",
which would still work in 98% of all cases (is anyone still using non-
UTF-8? ^^)... but neither would solve the original problem - command
substitution with trailing newlines that work in 100%.

Shell script are used in sooo many places - not rarely in security
relevant ones... filenames may contain trailing newlines... and it may
be that attacker tries to abuse this somehow... yet, for a script it
seems to be quite difficult to do a simple
  filename="$( basename "$pathname" )"
and be 100% (and not just the 99%) sure to have gotten the real
filename.


Simple (made up) example of how this might be abused:
A script that e.g. runs from cron and cleans up files, created by users
in some dir.
It goes through all files of a particular user (e.g. nobody, for files
resulting from anonymous uploads or so) and deletes them.
It does some processing of the pathname... like above with basename, or
e.g. with sed... perhaps stripping of file extension or whatsoever.
And attacker somehow guesses the name of a file that should not be
deleted, uploads it with trailing newlines added, these get lost in the
process and the precious files gets removed.

Not a brilliant example, I know,... but it's 06:00 am.. and I'm bad in
making good examples ;-)



> > It should also mean, that regardless of what's chosen as sentinel
> > (e.g.
> > '.', 'bbb' or even a multibyte '∈')... as long as these are valid
> > characters with respect to the locale/encoding in which the shell
> > parses, they should yield the same bytes all over:
> 
> Using <period> is more robust because it *additionally* covers the
> case in which you happen upon other-encoding data.

The thing with ∈ was just in order to document how that would work (in
principle) and then explaining why it's actually problematic.


> > > tmp="$(command; printf ∈)"
> > > LC_ALL='C'
> > > tmp="${tmp%∈}"
> > 
> > So the printf gets the very same bytes as sentinel (whether it's
> > '.',
> > 'bbb' or '∈') ... as does the pattern in the parameter expansion,
> > that
> > strips off the sentinel... at least from the lexical PoV.
> 
> From the lexical PoV, sure… but do consider:
> 
> LC_ALL=$value1
> foo() {
>       tmp=$(command; echo ∈)
>       tmp=${tmp%∈}
> }
> LC_ALL=$value2
> foo
> 
> In this scenario, the %∈ is parsed in c locale,

Only(!) if $value1 is also the locale in which the shell was started
respectively in which the script is encoded - not when it's just set to
that value, within the script.


>  but the
> command is run in $value2 locale. On the other hand, the echo
> will still get just a string… unless ∈ suddenly contains back‐
> slashes (or percent, in your printf case… please don’t overuse
> printf(1) like that when echo suffices).

Well I'd guess both, echo and printf have the same problem here...
depending on how weird the locales get.

POSIX does not seem to forbid having a locale foo in which % is encoded
to another value than in the locale bar.
It just says that using both in the same program leads to unspecified
behaviour.




Thanks,
Chris.

          • ... Geoff Clare via austin-group-l at The Open Group
            • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Geoff Clare via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Eric Blake via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Thorsten Glaser via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... shwaresyst via austin-group-l at The Open Group
              • ... Geoff Clare via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Geoff Clare via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
          • ... Robert Elz via austin-group-l at The Open Group
      • Re:... mirabilos via austin-group-l at The Open Group
        • ... Christoph Anton Mitterer via austin-group-l at The Open Group
  • Re: does POS... Robert Elz via austin-group-l at The Open Group

Reply via email to