Re: Raw string literals and Unicode escapes

Alex Buckley Wed, 14 Feb 2018 12:25:13 -0800

On 2/13/2018 2:19 PM, Jim Laskey wrote:

10a. String s = `abc`; 10b. String s = \u0060abc`;
...
So, change the scanner to


A) Peek back to make sure the first open backtick was exactly a
backtick. B) Turn off Unicode escapes immediately so that only
backtick characters can be part of the delimiter. C) Turn on Unicode
escapes only after a valid closing delimiter is encountered.

Based on this all your examples are illegal.

I am not opposed to saying that a delimiter must be constructed fromactual ` characters (that is, the RawInputCharacter ` rather than theUnicodeEscape \u0060). It would be silly if the opening delimiter was\u0060 because the closing delimiter cannot be identical -- that hurtsreadability. (Clearly the six characters \ u 0 0 6 0 inside a raw stringliteral get no special processing.)

Unfortunately, there is nothing in the lexical grammar that prevents\u0060Hello` or \u0060Hello\u0060 or in fact any of the examples belowfrom being lexed as a RawStringLiteral. The JLS will need a semanticrule to force each RawStringDelimiter to be composed of actual `characters. As you say, this will make all the examples below illegal.

There is plenty of precedent for semantic rules ("It is a compile-timeerror ...") in the interpretation of Literal tokens, so that's fine. Infact, JLS 3.10.4 already has a semantic rule that appears to constrain adelimiter in a CharacterLiteral token:


  It is a compile-time error for the character following the
  SingleCharacter or EscapeSequence to be other than a '.

although it doesn't mean to force an actual ' character (that is, theRawInputCharacter ' and not the UnicodeEscape \u0027). It means:


  It is a compile-time error for the character following the
  SingleCharacter or EscapeSequence to be other than a ' (or the
  Unicode escape thereof).

Alex

On Feb 13, 2018, at 1:58 PM, Alex Buckley <[email protected]>
wrote:

I suspect the trickiest part of specifying raw string literals will
be the lexer's modal behavior for Unicode escapes. As such, I am
going to put the behavior under the microscope. Here is what the
JEP has to say:

----- Unicode escapes, in the form \uxxxx, are processed as part of
character input prior to interpretation by the lexer. To support
the raw string literal as-is requirement, Unicode escape processing
is disabled when the lexer encounters an opening backtick and
reenabled when encountering a closing backtick. -----

I would like to assume that if the lexer comes across the six
tokens \ u 0 0 6 0  then it should interpret them as a Unicode
escape representing a backtick _and then continue as if consuming
the tokens of a raw string literal_. However, the mention of _an_
opening backtick and _a_ closing backtick gave me pause, given that
repeated backticks can serve as the opening delimiter and the
closing delimiter. For absolute clarity, let's write out examples
to confirm intent: (Jim, please confirm or deny as you see fit!)

1.  String s = \u0060`;

Illegal. The RHS is lexed as ``;   which is disallowed by the
grammar.

2.  String s = \u0060Hello\u0060;

Illegal. The RHS is lexed as `Hello\u0060;   and so on for the rest
of the compilation unit -- the six tokens \ u 0 0 6 0 are not
treated as a Unicode escape since we're lexing a raw string
literal. And without a closing delimiter before the end of the
compilation unit, a compile-time error occurs.

3a.  String s = \u0060Hello`;

Legal. The RHS is lexed as `Hello`;   which is well formed.

3b.  String s = \u0060\u0060Hello`;

Depends! If you take the JEP literally, then just the Unicode
escape which serves as the first opening backtick ("_an_ opening
backtick") is enough to enter raw-string mode. That makes the code
legal: the RHS is lexed as `\u0060Hello`;   which is well formed.
On the other hand, you might think that we shouldn't enter
raw-string mode until the lexer in traditional mode has lexed the
opening delimiter fully (i.e. ALL the opening backticks). Then, the
code in 3b is illegal, because the opening delimiter (``) and the
closing delimiter (`) are not symmetric.

I think we should take the JEP literally, so that 3b is legal. And
then, some more examples:

4a.  String s = \u0060`Hello``;

Legal. The RHS is lexed as ``Hello``;   which is well formed.

4b.  String s = \u0060\u0060Hello``;

Illegal. The RHS is lexed as `\u0060Hello``;   which is disallowed
by the grammar. A raw string literal containing 11 tokens is
immediately followed by a ` token and a ; token which are not
expected.

4c.  String s = \u0060\u0060Hello`\u0060;

Depends! If you take the JEP literally, where _a_ closing backtick
is enough to re-enable Unicode escape processing, then the RHS is
lexed as `\u0060Hello``;  which is illegal per 4b. On the other
hand, if you think that we shouldn't re-enter traditional mode
until the lexer in raw-string mode has lexed the closing delimiter
fully (i.e. ALL the closing backticks), then presumably you think
analogously about the opening delimiter, so the RHS would be lexed
as ``Hello`\u0060;   which is illegal per 2 (no closing delimiter
`` before the end of the compilation unit).

5.  String s = \u0060`Hello`\u0060;

I put this here because it looks nice. It hits the same issues as
3b and 4c.

Alex

Re: Raw string literals and Unicode escapes

Reply via email to