Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")
On Tue, May 16, 2017 at 07:41:43AM +0100, Stephane Chazelas wrote: > 2017-05-16 10:03:56 +0700, Robert Elz: > > Date:Mon, 15 May 2017 18:36:58 +0200 > > From:Steffen Nurpmeso> > Message-ID: <20170515163658.b7ljs%stef...@sdaoden.eu> > [...] > > Alternatively, you could implement this as an external #! script, that > > would be a lot slower, but would at least avoid that issue. > Or just write it as quote() (...) instead of quote() { ...;} > > quote() { > > case "$1" in > > *\'*) ;; # the harder case, we will get to below. > > *) printf "'%s'" "$1"; return 0;; > > esac > > > > _save_IFS="${IFS}" # if possible just make IFS "local" > > _save_OPTS="$(set +o)" # quotes there not really needed. > > IFS=\' > > set -f > > set -- $1 > > _result_="${1}"# we know at least $1 and $2 exist, as there > > shift # was one quote in the input. > > > > for __arg__ > > do > > _result_="${_result_}'\\''${__arg__}" > > done > > printf "'%s'" "${_result_}" > > > > # now clean up > > > > IFS="${_save_IFS}" #none of this is needed with a good > > "local"... > > eval "${_save_OPTS}" > > unset __arg__ _result_ _save_IFS _save_OPTS > > > > return 0; > > } > [...] > That doesn't work properly in POSIX shells. In AT ksh, $IFS is > Internal Field Delimiter (or Terminator, not "Separator") for > splitting (as opposed to for "$*") > That was fixed in pdksh, zsh and the Almquist shell, but > unfortunately in the mean time, POSIX specified the AT ksh way > (later, some variants of ash and pdksh have changed back to the > POSIX way). > So in POSIX shells both a''b' and a''b are split into ("a", "", > "b") so that quote() would quote both to 'a'\'''\''b' This is easily fixed by adding a quoted empty string: set -- $1'' If $1 ends with an IFS character, you get a final empty field. If $1 does not end with an IFS character, the empty string does nothing. -- Jilles Tjoelker
[OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)
2017-05-16 17:29:13 +0100, Stephane Chazelas: [...] > > | Here, I'd fire awk and quote more than one arg at a time: > > > > Hmm - you're really aiming for maximum sluggishness... I could beat that > > by just adding a couple of sleeps ... > > Depends. If quoting only a handful a arguments, then that call > to awk might cost you you a couple of milliseconds indeed. But > if processing thousands, you might find that it saves a few > seconds. Actually, even for a handful of arguments, and even with gawk, it seems it's generally quicker to use awk in my tests: With 5 arguments (where "a" uses your quote() and "b" uses mine, see below): (zsh syntax below) $ szsh() (exec -a sh zsh "$@") $ for shell (dash bash mksh szsh ksh93 yash) for file (a b) (TIMEFMT="$shell $file %*E"; time (repeat 100 $shell ./$file "a'b'c"{1..5}) > /dev/null) dash a 0.329 dash b 0.407 bash a 0.942 bash b 0.528 mksh a 1.598 mksh b 0.540 szsh a 0.763 szsh b 0.622 ksh93 a 0.667 ksh93 b 0.464 yash a 0.738 yash b 0.429 In mksh, printf is not built-in which doesn't help. In all but ksh93, that still does 5 forks because of the $(set +o). dash is the only one that manages to be quicker (not if I use mawk instead of gawk though). For 3000 arguments, that's where we see the real advantage of using a real programming language instead of inadequate features of a shell :-b: $ for shell (dash bash mksh szsh ksh93 yash) for file (a b) (TIMEFMT="$shell $file %*E"; time ($shell ./$file "a'b'c"{1..3000}) > /dev/null) dash a 0.827 dash b 0.019 bash a 7.712 bash b 0.080 mksh a 9.928 mksh b 0.022 szsh a 2.274 szsh b 0.028 ksh93 a 1.184 ksh93 b 0.022 yash a 2.655 yash b 0.035 the scripts under test: $ cat a quote() { case "$1" in *\'*) ;; # the harder case, we will get to below. *) printf "'%s'" "$1"; return 0;; esac _save_IFS="${IFS}" # if possible just make IFS "local" _save_OPTS="$(set +o)" # quotes there not really needed. IFS=\' set -f set -- $1 _result_="${1}"# we know at least $1 and $2 exist, as there shift # was one quote in the input. for __arg__ do _result_="${_result_}'\\''${__arg__}" done printf "'%s'" "${_result_}" # now clean up IFS="${_save_IFS}" #none of this is needed with a good "local"... eval "${_save_OPTS}" unset __arg__ _result_ _save_IFS _save_OPTS return 0; } s=$( for i do quote "$i"; printf ' ' done ) printf '%s\n' "$s" $ cat b quote() { LC_ALL=C awk -v q="'" -v b='\\' ' function quote(s) { gsub(q, q b q q, s) return q s q } BEGIN { sep = "" for (i = 1; i < ARGC; i++) { printf "%s", sep quote(ARGV[i]) sep = " " } if (sep) print "" }' "$@" } s=$(quote "$@") printf '%s\n' "$s" -- Stephane
Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")
2017-05-16 17:33:26 +0700, Robert Elz: [...] > | Or just write it as quote() (...) instead of quote() { ...;} > > Yes, as you would have seen later, I mentioned that in a subsequent > message. Sorry about that. I hadn't seen that message at the time I replied. [...] > | Here, I'd fire awk and quote more than one arg at a time: > > Hmm - you're really aiming for maximum sluggishness... I could beat that > by just adding a couple of sleeps ... Depends. If quoting only a handful a arguments, then that call to awk might cost you you a couple of milliseconds indeed. But if processing thousands, you might find that it saves a few seconds. > I deliberately did not do multiple arg quoting, as what you want in > that case depends upon the application, just quoting each separately > is not necessarily the desired result. And given the ability to quote > a single string, adding the mechanism to quote multiple strings is > not very hard ...(call the function over and over) and you get to > deal with the multiple results in whatever way your application needs. My quote() works like your quote() when passed a single argument, mine can take more than one and still produce a useful outcome (and helps with performance). > WHat is clightly harder to fix, but can be done if you really wanted it, > is to omit redundant (quoting) 's in the result, so we don't end up > with stuff like > > 'a'\'''\'''\''b' > when > a\'\'\'b > > is all that is really needed... If the aim is just, as was originally > stated (save & restore,) then it doesn't matter, but if you are ever > going to show the result to a human, it does. When quoting shell code, it's better to quote everything as the parsing depends on the locale (and with single quote as that's a safe character in most usable encodings) There are shell quoting libraries out there that try to be smart by not quoting everything but they end up introducing vulnerabilities. See for instance this bug in perl's String::ShellQuote https://rt.cpan.org/Public/Bug/Display.html?id=118508 > | Using LC_ALL=C on the assumption that the encoding of ' (0x27 in > > This is the shell, there is exactly one single quote character, and it > is that one. The data can be anything, the characters used in the > syntax elements cannot. Nor do non-ascii chars ever expand to anything > or have any meaning different from themselves as a data char. > > If we start having shell parsing differently depending on what locale the > user happens to be using, we may as well all give up now, and go find > something else to do. Yes, as I said, single-quote is safe in all charsets on my system. But backslash and backtick are not for instance. On one given POSIX system, ` and \ being part of the portable character set are guaranteed to be encoded the same in every charset supported on the system. For instance, on ASCII-based systems (the norm nowadays), \ is 0x5c. The shell syntax (POSIX operators and keywords) use only characters from the portable charset, but that is not to say that 0x5c cannot be found in the multi-byte encoding of other characters. For instance the α character in BIG5-HKSCS is encoded as 0xa3 0x5c. In POSIX shells like bash or ksh93 (also zsh), in a zh_HK.big5hkscs Hong Kong locale. echo α would /work/. It would not issue a PS2 prompt because of that trailing 0x5c byte (\ in ASCII and in BIG5-HKSCS). Yet, if you did a LC_ALL=C sed 's/\\/&&/g' on that α, it would effectively turn it into α\ which would make things worse. The single quote character doesn't have such a problem. (and yes, before you mention it, I agree, all multi-byte character sets other than UTF-8 should really be retired as they're a source of countless issues). > > | Also note that if $IFS was previously unset upon calling your > | quote() (as is common when you want to restore splitting to its > | default behaviour), it would leave it assigned an empty value > | (which means "no splitting"). > > Yes, mentioned that in my following message too. Sorry again. > stephane.chaze...@gmail.com in another message said: > | No, the split+glob operator that is done upon unquoted parameter expansion > | (or command substitution or arithmetic expansion) is completely different > | from the shell syntax parsing. It is not affected by quotes. > > I'm not sure what point you were making there, but all I was saying was > that in my original (test, not in the function) I did > > y=$(quote $x) > > (by accident, I normally quote everything.) That version doesn't > work properly at all - of course (depending upon what $x expands to > of course). When quoted ("$x") it does work. That is, when quoted > like this the value of $x is certainly a single arg to the quote > function, and any glob meta chars in it will just be themselves, > not expanded as file names, which the unquoted version would do. [...] Sorry that's probably my misinterpretation of: kre> $ y=$(quote "$x")
Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")
On 5/16/17 6:33 AM, Robert Elz wrote: > If we start having shell parsing differently depending on what locale the > user happens to be using, we may as well all give up now, and go find > something else to do. Too late: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_03 7. If the current character is an unquoted , any token containing the previous character is delimited and the current character shall be discarded. is locale-dependent: 3.74 Blank Character () One of the characters that belong to the blank character class as defined via the LC_CTYPE category in the current locale. In the POSIX locale, a character is either a or a . -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")
Date:Tue, 16 May 2017 07:41:43 +0100 From:Stephane ChazelasMessage-ID: <20170516064143.ga3...@chaz.gmail.com> | Or just write it as quote() (...) instead of quote() { ...;} Yes, as you would have seen later, I mentioned that in a subsequent message. | That doesn't work properly in POSIX shells. Hmm, yes you're right (I did say it might not be bug free and had not been tested much) - but that's relatively easy to fix, and does not need ... | Here, I'd fire awk and quote more than one arg at a time: Hmm - you're really aiming for maximum sluggishness... I could beat that by just adding a couple of sleeps ... And if I wanted slow, but not quite that slow, I'd much prefer sed to awk for generic string processing - but (I assumed) that the aim here would be to produce something quick enough to use frequently, and for most relatively simple string processing, which doesn't need REs, what the shell offers is just fine. I deliberately did not do multiple arg quoting, as what you want in that case depends upon the application, just quoting each separately is not necessarily the desired result. And given the ability to quote a single string, adding the mechanism to quote multiple strings is not very hard ...(call the function over and over) and you get to deal with the multiple results in whatever way your application needs. WHat is clightly harder to fix, but can be done if you really wanted it, is to omit redundant (quoting) 's in the result, so we don't end up with stuff like 'a'\'''\'''\''b' when a\'\'\'b is all that is really needed... If the aim is just, as was originally stated (save & restore,) then it doesn't matter, but if you are ever going to show the result to a human, it does. | Using LC_ALL=C on the assumption that the encoding of ' (0x27 in This is the shell, there is exactly one single quote character, and it is that one. The data can be anything, the characters used in the syntax elements cannot. Nor do non-ascii chars ever expand to anything or have any meaning different from themselves as a data char. If we start having shell parsing differently depending on what locale the user happens to be using, we may as well all give up now, and go find something else to do. | Also note that if $IFS was previously unset upon calling your | quote() (as is common when you want to restore splitting to its | default behaviour), it would leave it assigned an empty value | (which means "no splitting"). Yes, mentioned that in my following message too. stephane.chaze...@gmail.com in another message said: | No, the split+glob operator that is done upon unquoted parameter expansion | (or command substitution or arithmetic expansion) is completely different | from the shell syntax parsing. It is not affected by quotes. I'm not sure what point you were making there, but all I was saying was that in my original (test, not in the function) I did y=$(quote $x) (by accident, I normally quote everything.) That version doesn't work properly at all - of course (depending upon what $x expands to of course). When quoted ("$x") it does work. That is, when quoted like this the value of $x is certainly a single arg to the quote function, and any glob meta chars in it will just be themselves, not expanded as file names, which the unquoted version would do. kre
Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")
"Schwarz, Konrad"wrote: |> -Original Message- |> From: Stephane Chazelas [mailto:stephane.chaze...@gmail.com] |> To: Robert Elz |> Cc: Steffen Nurpmeso; austin-group-l@opengroup.org |> Subject: Re: sh(1): is roundtripping of the positional parameter stack | |> Here, I'd fire awk and quote more than one arg at a time: |> |> quote() { |> LC_ALL=C awk -v q="'" -v b='\\' ' |> function quote(s) { |> gsub(q, q b q q, s) |> return q s q |>} |> BEGIN { |> sep = "" |> for (i = 1; i < ARGC; i++) { |> printf "%s", sep quote(ARGV[i]) |> sep = " " |>} |> if (sep) print "" |>}' "$@" |>} | |> Also note that if $IFS was previously unset upon calling your |> quote() (as is common when you want to restore splitting to its default |> behaviour), it would leave it assigned an empty value (which means "no |> splitting"). One common way to address that is to do: |> |>_save_IFS=$IFS; ${IFS+":"} unset _save_IFS |>... |>IFS=$_save_IFS; ${_save_IFS+":"} unset IFS | |Really excellent work all around -- I'm very impressed. I concur. Thanks, for all the answers! --steffen | |Ralph says i must not use signatures which spread the light!
RE: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")
> -Original Message- > From: Stephane Chazelas [mailto:stephane.chaze...@gmail.com] > To: Robert Elz > Cc: Steffen Nurpmeso; austin-group-l@opengroup.org > Subject: Re: sh(1): is roundtripping of the positional parameter stack > Here, I'd fire awk and quote more than one arg at a time: > > quote() { > LC_ALL=C awk -v q="'" -v b='\\' ' > function quote(s) { > gsub(q, q b q q, s) > return q s q > } > BEGIN { > sep = "" > for (i = 1; i < ARGC; i++) { > printf "%s", sep quote(ARGV[i]) > sep = " " > } > if (sep) print "" > }' "$@" > } > Also note that if $IFS was previously unset upon calling your > quote() (as is common when you want to restore splitting to its default > behaviour), it would leave it assigned an empty value (which means "no > splitting"). One common way to address that is to do: > >_save_IFS=$IFS; ${IFS+":"} unset _save_IFS >... >IFS=$_save_IFS; ${_save_IFS+":"} unset IFS Really excellent work all around -- I'm very impressed. Konrad
Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")
2017-05-16 10:03:56 +0700, Robert Elz: [...] > $ y=$(quote "$x") [...] > Just remember to always quote variable references "$x" unless you are > 100% certain what the content of the variable is, eg: as above with $y > where we know it is the result of the quote function, so is safe. [...] No, the split+glob operator that is done upon unquoted parameter expansion (or command substitution or arithmetic expansion) is completely different from the shell syntax parsing. It is not affected by quotes. a="'a b'" printf '<%s>\n' $a Still outputs (assuming the default value of $IFS): <'a> And touch "'a'" "'b'" a="'?'" echo $a still outputs 'a' 'b' (and with a="' * '" you'd list the current directory) Those variables output by quote() are intended to be passed to eval, but still need to be quoted: eval "set -- $a" certainly *not* eval set -- $a, which because of the glob part (or if ' or possibly \ was in $IFS) would be a command injection vulnerability (if the content of $a was not controlled). You'd leave a variable unquoted if you wanted it to be either split or globbed or both, but would then need to set $IFS and/or disable globbing. -- Stephane
Re: sh(1): is roundtripping of the positional parameter stack possible? (Was: Re: Shell parameter expansions involving '#")
2017-05-16 10:03:56 +0700, Robert Elz: > Date:Mon, 15 May 2017 18:36:58 +0200 > From:Steffen Nurpmeso> Message-ID: <20170515163658.b7ljs%stef...@sdaoden.eu> [...] > Alternatively, you could implement this as an external #! script, that > would be a lot slower, but would at least avoid that issue. Or just write it as quote() (...) instead of quote() { ...;} > quote() { > case "$1" in > *\'*) ;; # the harder case, we will get to below. > *) printf "'%s'" "$1"; return 0;; > esac > > _save_IFS="${IFS}" # if possible just make IFS "local" > _save_OPTS="$(set +o)" # quotes there not really needed. > IFS=\' > set -f > set -- $1 > _result_="${1}"# we know at least $1 and $2 exist, as there > shift # was one quote in the input. > > for __arg__ > do > _result_="${_result_}'\\''${__arg__}" > done > printf "'%s'" "${_result_}" > > # now clean up > > IFS="${_save_IFS}" #none of this is needed with a good > "local"... > eval "${_save_OPTS}" > unset __arg__ _result_ _save_IFS _save_OPTS > > return 0; > } [...] That doesn't work properly in POSIX shells. In AT ksh, $IFS is Internal Field Delimiter (or Terminator, not "Separator") for splitting (as opposed to for "$*") That was fixed in pdksh, zsh and the Almquist shell, but unfortunately in the mean time, POSIX specified the AT ksh way (later, some variants of ash and pdksh have changed back to the POSIX way). So in POSIX shells both a''b' and a''b are split into ("a", "", "b") so that quote() would quote both to 'a'\'''\''b' Here, I'd fire awk and quote more than one arg at a time: quote() { LC_ALL=C awk -v q="'" -v b='\\' ' function quote(s) { gsub(q, q b q q, s) return q s q } BEGIN { sep = "" for (i = 1; i < ARGC; i++) { printf "%s", sep quote(ARGV[i]) sep = " " } if (sep) print "" }' "$@" } Using LC_ALL=C on the assumption that the encoding of ' (0x27 in ASCII) is not found in any other character in the locale of the user (0x27 is not found in any character of any charset (other than ' of course) on my system). Using LC_ALL=C is to ensure that awk still manages to handle sequence of bytes that wouldn't form valid characters in the user's locale but that most shells are still able to store in their parameters. So you can do: set -- x 'a b' "foo'" "'bar" $'A\x80B' saved_parameters=$(quote "$@") eval "set -- $saved_parameters" Also note that if $IFS was previously unset upon calling your quote() (as is common when you want to restore splitting to its default behaviour), it would leave it assigned an empty value (which means "no splitting"). One common way to address that is to do: _save_IFS=$IFS; ${IFS+":"} unset _save_IFS ... IFS=$_save_IFS; ${_save_IFS+":"} unset IFS -- Stephane