Hello,

Recently I started implementing support for $'...' strings based on the proposed text in http://austingroupbugs.net/view.php?id=249 and later comments in my shell (dash-based personal hobby project), which has left me with a few questions. I understand that the text is not final and that any answers I receive now may be invalidated by future changes to the text.

1. Are the rules for determining the end of a dollar-quoted string intended to be fully specified, especially when taking into account strings that are never expanded?

   : || : $'\c'   #1
   : || : $'\c\'' #2

? Unlike other shells, mksh takes the second ' in #1 as the operand to \c, and the backslash by itself in #2, so complains about a missing closing quote for both. Is any particular behaviour intended?

2. I am not able to fully understand the allowed unspecified effects of null byte handling.

If a \xXX or \XXX escape sequence yields a byte whose value is 0,
it is unspecified whether that nul byte is included in the result
or if that byte and any following regular characters and escape
sequences up to the terminating unescaped single-quote are evaluated
and discarded.

a. As mentioned in a comment already, \u0000 can be used to produce a null byte as well, which is missing from the text, but is this also intended to apply to implementation-defined or unspecified forms of producing a null byte? Examples are \400, which produces a null byte on most implementations, but does not exhibit the same behaviour as \0 in mksh in how strings are terminated, and \c@. b. If the implementation includes that null byte in the result, how does it follow that

                        printf a$'b\0c\''d
                    is required by this standard to produce:
                        abd
                    while historic versions of ksh93 produced:
                        ab

If the null byte is included in the string, would not passing the complete string including null byte to the printf utility cause the output to be "ab"?

3. About the addition "during token recognition" for handling of \uXXXX and \UXXXXXXXX: when exactly does token recognition take place? Does this require, allow, or forbid implementations from parsing multiple commands prior to executing them, if one command may change the locale, but the $'...' string is part of a different later command? Does this require functions containing $'\uXXXX' to expand them according to the locale that was in effect when the function was defined, or are shells allowed to (re-)tokenise the function's commands when the function is executed or behave as if such re-tokenisation is taking place?

4. Does the implementation-defined aspect of handling of \uXXXX and \UXXXXXXXX "during token recognition" that are not part of the current locale's character set allow implementations to issue an error message at parse time? That is, assuming a shell decides to provide an error message for

  LC_ALL=C
  : $'\u1234'

as zsh does, does this allow and/or require that same error message to be produced for

  LC_ALL=C
  : || : $'\u1234'

Cheers,
Harald van Dijk

Reply via email to