Date:        Thu, 30 Jul 2020 15:53:53 +0200
    From:        Steffen Nurpmeso <stef...@sdaoden.eu>
    Message-ID:  <20200730135353.qwslp%stef...@sdaoden.eu>


  | The problem being that what is in the wild does not work out for
  | many languages.

I admit to not knowing a lot of the internationalisation issues,
or of unicode, but I don't understand this at all.

The quoting mechanisms in the shell provide a means to create
specific bit patterns to assign to variables, pass as parameters
to programs, etc.   I don't see that the mechanism by which they're
encoded in the sh language should matter all that much, the same
thing could be read from a file instead ( var=$(cat file) ) in which
case the shell spec has no control over the bit patterns at all.

Of course the quoting mechanisms make a difference to the ease of
use for the sh programmer, but that's an entirely different issue.

  | The in-use shell quote pattern consisting of small, isolated parts
  | which depend on which kind of escaping and expanding is necessary
  | just does not work out for many languages.

Can you give an example of something which cannot be done (assuming
$'' as currently intended to be specified)?   Note: not an example of
someone using the mechanisms to do the wrong thing - there are zillions
of ways to write bad code, but an example of something which cannot be
done correctly as specified.   Then we'll see if that really matters.

  |   ? echo Don"'"t you worry$'\x21' The sun shines on us. $'\u263A'
  |
  | The latter is what i mean.  There are many languages on this world
  | where these \u expansions do not work out that way, but where the
  | "entire sentence must be interpreted as a unity" in order to get
  | the iconv(3) conversation to nl_langinfo(CODESET) correctly, aka
  | the way it is _desired_.

Surely this depends upon how the shell works - if the shell is attempting
to convert just the \u escape into some other codeset, I can see your point,
but it doesn't need to work like that - it can work internally in 10646
code points (whether encoded in 16 or 32 bit values, or as UTF-8), and
only convert to the desired charset when actually used (that is, when
about to run "echo" at which point the entire string is available.

In any case, if the user has specified a specific unicode code point,
shouldn't that always be what is generated, regardless of whether it
makes sense or not?

  | And for that it would be tremendous if $'' would be defined so
  | that it can be used as the sole quoting mechanism,

No thanks.   Partly because $'' is already implemented (widely)
and used (perhaps slightly less yet) - so that ship has sailed.

I believe I've seen $" ... " used that way somewhere though (don't
recall where) and I believe it is a mistake.

As soon as you have multiple different types of expansions that
can occur, there are problems with which one gets priority, which
is performed first.   So, assuming there is a $"..." which works
as you desire, what happens with

        $"${VAR+foo\x7Dbar}"

Do we get foo}bar or foobar} ?   (assuming VAR was set of course).

Whichever way you pick, there will be arguments for doing it
the other way, in some other case.   This stuff simply becomes
a mess.   Please, don't go there.   If we wanted to add C type
encodings along with the others, we'd need to do it in a way that
is consistent with the other expansions, perhaps using something
like $[x7D] or $[u263A] or $[n] (but no, this is not a serious
suggestion).

And I cannot fathom how this in any way overcomes your earlier
objection, quoted strings in sh are not units, they're simply
pieces of some longer word (or can be) - your Don"'"t example
above (and the worry$'\x21') are both examples of that.

kre


Reply via email to