Hello,
Recently I started implementing support for $'...' strings based on the
proposed text in http://austingroupbugs.net/view.php?id=249 and later
comments in my shell (dash-based personal hobby project), which has left
me with a few questions. I understand that the text is not final and
that any answers I receive now may be invalidated by future changes to
the text.
1. Are the rules for determining the end of a dollar-quoted string
intended to be fully specified, especially when taking into account
strings that are never expanded?
: || : $'\c' #1
: || : $'\c\'' #2
? Unlike other shells, mksh takes the second ' in #1 as the operand to
\c, and the backslash by itself in #2, so complains about a missing
closing quote for both. Is any particular behaviour intended?
2. I am not able to fully understand the allowed unspecified effects of
null byte handling.
If a \xXX or \XXX escape sequence yields a byte whose value is 0,
it is unspecified whether that nul byte is included in the result
or if that byte and any following regular characters and escape
sequences up to the terminating unescaped single-quote are evaluated
and discarded.
a. As mentioned in a comment already, \u0000 can be used to produce a
null byte as well, which is missing from the text, but is this also
intended to apply to implementation-defined or unspecified forms of
producing a null byte? Examples are \400, which produces a null byte on
most implementations, but does not exhibit the same behaviour as \0 in
mksh in how strings are terminated, and \c@.
b. If the implementation includes that null byte in the result, how
does it follow that
printf a$'b\0c\''d
is required by this standard to produce:
abd
while historic versions of ksh93 produced:
ab
If the null byte is included in the string, would not passing the
complete string including null byte to the printf utility cause the
output to be "ab"?
3. About the addition "during token recognition" for handling of \uXXXX
and \UXXXXXXXX: when exactly does token recognition take place? Does
this require, allow, or forbid implementations from parsing multiple
commands prior to executing them, if one command may change the locale,
but the $'...' string is part of a different later command? Does this
require functions containing $'\uXXXX' to expand them according to the
locale that was in effect when the function was defined, or are shells
allowed to (re-)tokenise the function's commands when the function is
executed or behave as if such re-tokenisation is taking place?
4. Does the implementation-defined aspect of handling of \uXXXX and
\UXXXXXXXX "during token recognition" that are not part of the current
locale's character set allow implementations to issue an error message
at parse time? That is, assuming a shell decides to provide an error
message for
LC_ALL=C
: $'\u1234'
as zsh does, does this allow and/or require that same error message to
be produced for
LC_ALL=C
: || : $'\u1234'
Cheers,
Harald van Dijk