Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")

2017-05-16 Thread Jilles Tjoelker
On Tue, May 16, 2017 at 07:41:43AM +0100, Stephane Chazelas wrote:
> 2017-05-16 10:03:56 +0700, Robert Elz:
> > Date:Mon, 15 May 2017 18:36:58 +0200
> > From:Steffen Nurpmeso 
> > Message-ID:  <20170515163658.b7ljs%stef...@sdaoden.eu>
> [...]
> > Alternatively, you could implement this as an external #! script, that
> > would be a lot slower, but would at least avoid that issue.

> Or just write it as quote() (...) instead of quote() { ...;}

> > quote() {
> > case "$1" in
> > *\'*)   ;;   # the harder case, we will get to below.
> > *)  printf "'%s'" "$1"; return 0;;
> > esac
> > 
> > _save_IFS="${IFS}" # if possible just make IFS "local"
> > _save_OPTS="$(set +o)"  # quotes there not really needed.
> > IFS=\'
> > set -f
> > set -- $1
> > _result_="${1}"# we know at least $1 and $2 exist, as there
> > shift  # was one quote in the input.
> > 
> > for __arg__
> > do
> > _result_="${_result_}'\\''${__arg__}"
> > done
> > printf "'%s'" "${_result_}"
> > 
> > # now clean up
> > 
> > IFS="${_save_IFS}"  #none of this is needed with a good 
> > "local"...
> > eval "${_save_OPTS}"
> > unset __arg__ _result_ _save_IFS _save_OPTS
> > 
> > return 0;
> > }
> [...]

> That doesn't work properly in POSIX shells. In AT ksh, $IFS is
> Internal Field Delimiter (or Terminator, not "Separator") for
> splitting (as opposed to  for "$*")

> That was fixed in pdksh, zsh and the Almquist shell, but
> unfortunately in the mean time, POSIX specified the AT ksh way
> (later, some variants of ash and pdksh have changed back to the
> POSIX way).

> So in POSIX shells both a''b' and a''b are split into ("a", "",
> "b") so that quote() would quote both to 'a'\'''\''b'

This is easily fixed by adding a quoted empty string:
set -- $1''

If $1 ends with an IFS character, you get a final empty field.
If $1 does not end with an IFS character, the empty string does nothing.

-- 
Jilles Tjoelker



[OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)

2017-05-16 Thread Stephane Chazelas
2017-05-16 17:29:13 +0100, Stephane Chazelas:
[...]
> >   | Here, I'd fire awk and quote more than one arg at a time:
> > 
> > Hmm - you're really aiming for maximum sluggishness...  I could beat that
> > by just adding a couple of sleeps ...
> 
> Depends. If quoting only a handful a arguments, then that call
> to awk might cost you you a couple of milliseconds indeed. But
> if processing thousands, you might find that it saves a few
> seconds.

Actually, even for a handful of arguments, and even with gawk,
it seems it's generally quicker to use awk in my tests:

With 5 arguments (where "a" uses your quote() and "b" uses mine, see below):

(zsh syntax below)

$ szsh() (exec -a sh zsh "$@")
$ for shell (dash bash mksh szsh ksh93 yash) for file (a b) (TIMEFMT="$shell 
$file %*E"; time (repeat 100 $shell ./$file "a'b'c"{1..5}) > /dev/null)
dash a 0.329
dash b 0.407
bash a 0.942
bash b 0.528
mksh a 1.598
mksh b 0.540
szsh a 0.763
szsh b 0.622
ksh93 a 0.667
ksh93 b 0.464
yash a 0.738
yash b 0.429

In mksh, printf is not built-in which doesn't help. In all but
ksh93, that still does 5 forks because of the $(set +o).

dash is the only one that manages to be quicker (not if I use
mawk instead of gawk though).

For 3000 arguments, that's where we see the real advantage of using a real
programming language instead of inadequate features of a shell :-b:

$ for shell (dash bash mksh szsh ksh93 yash) for file (a b) (TIMEFMT="$shell 
$file %*E"; time ($shell ./$file "a'b'c"{1..3000}) > /dev/null)
dash a 0.827
dash b 0.019
bash a 7.712
bash b 0.080
mksh a 9.928
mksh b 0.022
szsh a 2.274
szsh b 0.028
ksh93 a 1.184
ksh93 b 0.022
yash a 2.655
yash b 0.035



the scripts under test:



$ cat a
quote() {
case "$1" in
*\'*)   ;;   # the harder case, we will get to below.
*)  printf "'%s'" "$1"; return 0;;
esac

_save_IFS="${IFS}" # if possible just make IFS "local"
_save_OPTS="$(set +o)"  # quotes there not really needed.
IFS=\'
set -f
set -- $1
_result_="${1}"# we know at least $1 and $2 exist, as there
shift  # was one quote in the input.

for __arg__
do
_result_="${_result_}'\\''${__arg__}"
done
printf "'%s'" "${_result_}"

# now clean up

IFS="${_save_IFS}"  #none of this is needed with a good 
"local"...
eval "${_save_OPTS}"
unset __arg__ _result_ _save_IFS _save_OPTS

return 0;
}
s=$(
  for i do
quote "$i"; printf ' '
  done
)
printf '%s\n' "$s"

$ cat b
quote() {
  LC_ALL=C awk -v q="'" -v b='\\' '
function quote(s) {
  gsub(q, q b q q, s)
  return q s q
}
BEGIN {
  sep = ""
  for (i = 1; i < ARGC; i++) {
printf "%s", sep quote(ARGV[i])
sep = " "
  }
  if (sep) print ""
}' "$@"
}
s=$(quote "$@")
printf '%s\n' "$s"

-- 
Stephane



Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")

2017-05-16 Thread Stephane Chazelas
2017-05-16 17:33:26 +0700, Robert Elz:
[...]
>   | Or just write it as quote() (...) instead of quote() { ...;}
> 
> Yes, as you would have seen later, I mentioned that in a subsequent
> message.

Sorry about that. I hadn't seen that message at the time I
replied.

[...]
>   | Here, I'd fire awk and quote more than one arg at a time:
> 
> Hmm - you're really aiming for maximum sluggishness...  I could beat that
> by just adding a couple of sleeps ...

Depends. If quoting only a handful a arguments, then that call
to awk might cost you you a couple of milliseconds indeed. But
if processing thousands, you might find that it saves a few
seconds.


> I deliberately did not do multiple arg quoting, as what you want in
> that case depends upon the application, just quoting each separately
> is not necessarily the desired result.   And given the ability to quote
> a single string, adding the mechanism to quote multiple strings is
> not very hard ...(call the function over and over) and you get to
> deal with the multiple results in whatever way your application needs.

My quote() works like your quote() when passed a single
argument, mine can take more than one and still produce a
useful outcome (and helps with performance).

> WHat is clightly harder to fix, but can be done if you really wanted it,
> is to omit redundant (quoting) 's in the result, so we don't end up
> with stuff like
> 
>   'a'\'''\'''\''b'
> when
>   a\'\'\'b
> 
> is all that is really needed...   If the aim is just, as was originally
> stated (save & restore,) then it doesn't matter, but if you are ever
> going to show the result to a human, it does.

When quoting shell code, it's better to quote everything as the
parsing depends on the locale (and with single quote as that's a
safe character in most usable encodings)

There are shell quoting libraries out there that try to be smart
by not quoting everything but they end up introducing
vulnerabilities. 

See for instance this bug in perl's String::ShellQuote
https://rt.cpan.org/Public/Bug/Display.html?id=118508

>   | Using LC_ALL=C on the assumption that the encoding of ' (0x27 in
> 
> This is the shell, there is exactly one single quote character, and it
> is that one.   The data can be anything, the characters used in the
> syntax elements cannot.  Nor do non-ascii chars ever expand to anything
> or have any meaning different from themselves as a data char.
> 
> If we start having shell parsing differently depending on what locale the
> user happens to be using, we may as well all give up now, and go find
> something else to do.

Yes, as I said, single-quote is safe in all charsets on my
system. But backslash and backtick are not for instance.

On one given POSIX system, ` and \ being part of the portable
character set are guaranteed to be encoded the same in every
charset supported on the system. For instance, on ASCII-based
systems (the norm nowadays), \ is 0x5c. The shell syntax
(POSIX operators and keywords) use only characters from the
portable charset, but that is not to say that 0x5c cannot be
found in the multi-byte encoding of other characters.

For instance the α character in BIG5-HKSCS is encoded as
0xa3 0x5c. In POSIX shells like bash or ksh93 (also zsh), in a
zh_HK.big5hkscs Hong Kong locale.

echo α

would /work/. It would not issue a PS2 prompt because of that
trailing 0x5c byte (\ in ASCII and in BIG5-HKSCS).

Yet, if you did a LC_ALL=C sed 's/\\/&&/g' on that α, it would
effectively turn it into α\ which would make things worse.

The single quote character doesn't have such a problem.

(and yes, before you mention it, I agree, all multi-byte
character sets other than UTF-8 should really be retired as
they're a source of countless issues).

> 
>   | Also note that if $IFS was previously unset upon calling your
>   | quote() (as is common when you want to restore splitting to its
>   | default behaviour), it would leave it assigned an empty value
>   | (which means "no splitting").
> 
> Yes, mentioned that in my following message too.

Sorry again.

> stephane.chaze...@gmail.com in another message said:
>   | No, the split+glob operator that is done upon unquoted parameter expansion
>   | (or command substitution or arithmetic expansion) is completely different
>   | from the shell syntax parsing. It is not affected by quotes. 
> 
> I'm not sure what point you were making there, but all I was saying was
> that in my original (test, not in the function) I did
> 
>   y=$(quote $x)
> 
> (by accident, I normally quote everything.)   That version doesn't
> work properly at all - of course (depending upon what $x expands to
> of course).   When quoted ("$x") it does work.  That is, when quoted
> like this the value of $x is certainly a single arg to the quote
> function, and any glob meta chars in it will just be themselves,
> not expanded as file names, which the unquoted version would do.
[...]

Sorry that's probably my misinterpretation of:

kre> $ y=$(quote "$x")

Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")

2017-05-16 Thread Chet Ramey
On 5/16/17 6:33 AM, Robert Elz wrote:

> If we start having shell parsing differently depending on what locale the
> user happens to be using, we may as well all give up now, and go find
> something else to do.

Too late:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_03

7. If the current character is an unquoted , any token containing
the previous character is delimited and the current character shall be
discarded.

 is locale-dependent:

3.74 Blank Character ()

One of the characters that belong to the blank character class as defined
via the LC_CTYPE category in the current locale. In the POSIX locale, a
 character is either a  or a .


-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")

2017-05-16 Thread Robert Elz
Date:Tue, 16 May 2017 07:41:43 +0100
From:Stephane Chazelas 
Message-ID:  <20170516064143.ga3...@chaz.gmail.com>

  | Or just write it as quote() (...) instead of quote() { ...;}

Yes, as you would have seen later, I mentioned that in a subsequent
message.

  | That doesn't work properly in POSIX shells.

Hmm, yes you're right (I did say it might not be bug free and had not
been tested much) - but that's relatively easy to fix, and does not
need ...

  | Here, I'd fire awk and quote more than one arg at a time:

Hmm - you're really aiming for maximum sluggishness...  I could beat that
by just adding a couple of sleeps ...

And if I wanted slow, but not quite that slow, I'd much prefer sed to
awk for generic string processing - but (I assumed) that the aim here would
be to produce something quick enough to use frequently, and for most
relatively simple string processing, which doesn't need REs, what the
shell offers is just fine.

I deliberately did not do multiple arg quoting, as what you want in
that case depends upon the application, just quoting each separately
is not necessarily the desired result.   And given the ability to quote
a single string, adding the mechanism to quote multiple strings is
not very hard ...(call the function over and over) and you get to
deal with the multiple results in whatever way your application needs.

WHat is clightly harder to fix, but can be done if you really wanted it,
is to omit redundant (quoting) 's in the result, so we don't end up
with stuff like

'a'\'''\'''\''b'
when
a\'\'\'b

is all that is really needed...   If the aim is just, as was originally
stated (save & restore,) then it doesn't matter, but if you are ever
going to show the result to a human, it does.


  | Using LC_ALL=C on the assumption that the encoding of ' (0x27 in

This is the shell, there is exactly one single quote character, and it
is that one.   The data can be anything, the characters used in the
syntax elements cannot.  Nor do non-ascii chars ever expand to anything
or have any meaning different from themselves as a data char.

If we start having shell parsing differently depending on what locale the
user happens to be using, we may as well all give up now, and go find
something else to do.

  | Also note that if $IFS was previously unset upon calling your
  | quote() (as is common when you want to restore splitting to its
  | default behaviour), it would leave it assigned an empty value
  | (which means "no splitting").

Yes, mentioned that in my following message too.

stephane.chaze...@gmail.com in another message said:
  | No, the split+glob operator that is done upon unquoted parameter expansion
  | (or command substitution or arithmetic expansion) is completely different
  | from the shell syntax parsing. It is not affected by quotes. 

I'm not sure what point you were making there, but all I was saying was
that in my original (test, not in the function) I did

y=$(quote $x)

(by accident, I normally quote everything.)   That version doesn't
work properly at all - of course (depending upon what $x expands to
of course).   When quoted ("$x") it does work.  That is, when quoted
like this the value of $x is certainly a single arg to the quote
function, and any glob meta chars in it will just be themselves,
not expanded as file names, which the unquoted version would do.

kre



Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")

2017-05-16 Thread Steffen Nurpmeso
"Schwarz, Konrad"  wrote:
 |> -Original Message-
 |> From: Stephane Chazelas [mailto:stephane.chaze...@gmail.com]
 |> To: Robert Elz
 |> Cc: Steffen Nurpmeso; austin-group-l@opengroup.org
 |> Subject: Re: sh(1): is roundtripping of the positional parameter stack
 |
 |> Here, I'd fire awk and quote more than one arg at a time:
 |> 
 |> quote() {
 |>   LC_ALL=C awk -v q="'" -v b='\\' '
 |> function quote(s) {
 |>   gsub(q, q b q q, s)
 |>   return q s q
 |>}
 |> BEGIN {
 |>   sep = ""
 |>   for (i = 1; i < ARGC; i++) {
 |> printf "%s", sep quote(ARGV[i])
 |>  sep = " "
 |>}
 |>   if (sep) print ""
 |>}' "$@"
 |>}
 |
 |> Also note that if $IFS was previously unset upon calling your
 |> quote() (as is common when you want to restore splitting to its default
 |> behaviour), it would leave it assigned an empty value (which means "no
 |> splitting"). One common way to address  that is to do:
 |> 
 |>_save_IFS=$IFS; ${IFS+":"} unset _save_IFS
 |>...
 |>IFS=$_save_IFS; ${_save_IFS+":"} unset IFS
 |
 |Really excellent work all around -- I'm very impressed.

I concur.
Thanks, for all the answers!

--steffen
|
|Ralph says i must not use signatures which spread the light!



RE: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")

2017-05-16 Thread Schwarz, Konrad
> -Original Message-
> From: Stephane Chazelas [mailto:stephane.chaze...@gmail.com]
> To: Robert Elz
> Cc: Steffen Nurpmeso; austin-group-l@opengroup.org
> Subject: Re: sh(1): is roundtripping of the positional parameter stack

> Here, I'd fire awk and quote more than one arg at a time:
> 
> quote() {
>   LC_ALL=C awk -v q="'" -v b='\\' '
> function quote(s) {
>   gsub(q, q b q q, s)
>   return q s q
> }
> BEGIN {
>   sep = ""
>   for (i = 1; i < ARGC; i++) {
> printf "%s", sep quote(ARGV[i])
>   sep = " "
>   }
>   if (sep) print ""
> }' "$@"
> }

> Also note that if $IFS was previously unset upon calling your
> quote() (as is common when you want to restore splitting to its default
> behaviour), it would leave it assigned an empty value (which means "no
> splitting"). One common way to address  that is to do:
> 
>_save_IFS=$IFS; ${IFS+":"} unset _save_IFS
>...
>IFS=$_save_IFS; ${_save_IFS+":"} unset IFS

Really excellent work all around -- I'm very impressed.

Konrad



Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")

2017-05-16 Thread Stephane Chazelas
2017-05-16 10:03:56 +0700, Robert Elz:
[...]
> $ y=$(quote "$x") 
[...]
> Just remember to always quote variable references "$x" unless you are
> 100% certain what the content of the variable is, eg: as above with $y
> where we know it is the result of the quote function, so is safe.
[...]

No, the split+glob operator that is done upon unquoted parameter
expansion (or command substitution or arithmetic expansion) is
completely different from the shell syntax parsing. It is not
affected by quotes.


a="'a  b'"
printf '<%s>\n' $a

Still outputs (assuming the default value of $IFS):

<'a>


And

touch "'a'" "'b'"
a="'?'"
echo $a

still outputs

'a' 'b'

(and with a="' * '" you'd list the current directory)

Those variables output by quote() are intended to be passed to
eval, but still need to be quoted:

eval "set -- $a"

certainly *not* eval set -- $a, which because of the glob part
(or if ' or possibly \ was in $IFS) would  be a command
injection vulnerability (if the content of $a was not
controlled).

You'd leave a variable unquoted if you wanted it to be either
split or globbed or both, but would then need to set $IFS and/or
disable globbing.

-- 
Stephane




Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")

2017-05-16 Thread Stephane Chazelas
2017-05-16 10:03:56 +0700, Robert Elz:
> Date:Mon, 15 May 2017 18:36:58 +0200
> From:Steffen Nurpmeso 
> Message-ID:  <20170515163658.b7ljs%stef...@sdaoden.eu>
[...]
> Alternatively, you could implement this as an external #! script, that
> would be a lot slower, but would at least avoid that issue.

Or just write it as quote() (...) instead of quote() { ...;}

> quote() {
>   case "$1" in
>   *\'*)   ;;   # the harder case, we will get to below.
>   *)  printf "'%s'" "$1"; return 0;;
>   esac
> 
>   _save_IFS="${IFS}" # if possible just make IFS "local"
>   _save_OPTS="$(set +o)"  # quotes there not really needed.
>   IFS=\'
>   set -f
>   set -- $1
>   _result_="${1}"# we know at least $1 and $2 exist, as there
> shift  # was one quote in the input.
> 
>   for __arg__
>   do
>   _result_="${_result_}'\\''${__arg__}"
>   done
>   printf "'%s'" "${_result_}"
> 
>   # now clean up
> 
>   IFS="${_save_IFS}"  #none of this is needed with a good 
> "local"...
>   eval "${_save_OPTS}"
>   unset __arg__ _result_ _save_IFS _save_OPTS
> 
>   return 0;
> }
[...]

That doesn't work properly in POSIX shells. In AT ksh, $IFS is
Internal Field Delimiter (or Terminator, not "Separator") for
splitting (as opposed to  for "$*")

That was fixed in pdksh, zsh and the Almquist shell, but
unfortunately in the mean time, POSIX specified the AT ksh way
(later, some variants of ash and pdksh have changed back to the
POSIX way).

So in POSIX shells both a''b' and a''b are split into ("a", "",
"b") so that quote() would quote both to 'a'\'''\''b'

Here, I'd fire awk and quote more than one arg at a time:

quote() {
  LC_ALL=C awk -v q="'" -v b='\\' '
function quote(s) {
  gsub(q, q b q q, s)
  return q s q
}
BEGIN {
  sep = ""
  for (i = 1; i < ARGC; i++) {
printf "%s", sep quote(ARGV[i])
sep = " "
  }
  if (sep) print ""
}' "$@"
}

Using LC_ALL=C on the assumption that the encoding of ' (0x27 in
ASCII) is not found in any other character in the locale of the
user (0x27 is not found in any character of any charset (other
than ' of course) on my system). Using LC_ALL=C is to ensure
that awk still manages to handle sequence of bytes that wouldn't
form valid characters in the user's locale but that most shells
are still able to store in their parameters.

So you can do:

set -- x 'a b' "foo'" "'bar" $'A\x80B'
saved_parameters=$(quote "$@")
eval "set -- $saved_parameters"

Also note that if $IFS was previously unset upon calling your
quote() (as is common when you want to restore splitting to its
default behaviour), it would leave it assigned an empty value
(which means "no splitting"). One common way to address  that is
to do:

   _save_IFS=$IFS; ${IFS+":"} unset _save_IFS
   ...
   IFS=$_save_IFS; ${_save_IFS+":"} unset IFS

-- 
Stephane